Challenges of Product Matching
Many products offered via e-shops have grown swiftly as the Internet – based e services have risen in popularity in recent years. According to a new report, the total amount spent on e-commerce wholesale sales in the U.s. was $96.3 billion in 2016. Despite this, there is still a big problem that people have to deal with when they search for and buy a good or service. Many different e-shops sell the same product, yet the information regarding each e-products shop’s vary greatly. As a result, there aren’t any global identifiers for products, and most offers aren’t linked. As a result, there is no straightforward way for individuals to obtain all of the information they require as well as the best rates for the items they wish to purchase. To make it easier for people to find what they want, many product aggregator sites, like Google Product Search1, PriceGrabber2, and Shopzila3. These websites combine as well as categorize products from a variety of e-commerce stores and merchants. Even so, it’s not easy to figure out which online stores are selling the same thing and combine the information into a single picture. So that people can better navigate and search for products, Product aggregator websites must organize all of their products into a taxonomy.
For the last few years, various methodologies and approaches to product similarities and classification for Websites has been investigated. Many studies has been done on the issue of “Product Similarity Matching,” and comparable difficulties have been brought forward in various ways. To solve product matching challenges, both supervised and unsupervised learning have been utilized. (Pranckevi?ius and Virginijus Marcinkevicius, 2017)
Data about products might be structured or unstructured. When data is organized into numerous attributes with corresponding values, it is called structured data. A textual description containing attribute values and, in some cases, attribute names can be found in unstructured product data. If the product is unstructured, attribute values are usually extracted before matching. (Alpaydin, 2010)
Similarity discovery across online items in listings is often a backend procedure that employs the Listing2Query technique is used to find similarities depending on the text and image semantics of both the listed product description and images. (Thomas Di Martino, 2020). A bidirectional recurrent neural network is used to serialize the importance weight in the Listing2Query technique training in E-commerce in the paper Using Intent-Driven Similarity. This concept dramatically increases similarity searches in a simple manner, perhaps outperforming most popular BERT method. (Frakes et al., 1992)
As a product aggregator, we are constantly receiving product offers from e-commerce sites that we would want to include on our website. These offers could be for brand new products that we don’t yet have or for existing products that are already in our database from another e-shop. (Lumenlearning.com, 2015). With such a large volume of incoming offers, human decision-making in real-time, or even with days of delay, is impossible. To deal with it effectively, we’ll need some sort of automated method. (Edwardlowe.org, 2022)
Product Aggregator Websites and Taxonomy
More specifically, we have a table in our database with about 50 million rows representing various products and e-shop offers for those products. Comparing offers for all of those products would be impossible. So, given an incoming offer oi, we must first find a collection of N product candidates 12 C efficiently. Following that, each pair oi, cj is evaluated, and the product with the highest probability is picked. If no product has a high enough likelihood of being a match, a new one is generated, or an offer is assigned for manual assessment if there is any doubt. (Cohen, 2021).
Because they currently locate similar product using text-based methodologies, several e-commerce organizations encounter product Recommendation constraints. The concept of looking for commonalities is gaining hold amongst e-commerce goliaths in order to give better search viewing experiences for customers. (Mustafa, 2020). “Principal Component Analysis with k-means Clustering” is the title of a recently published report. After employing principal components analysis and a dimension reduction approach to reduce dimensionality, K-means clustering is utilized to find clusters. The results are utilized to compare non-controlled machine learning techniques such as Birch, K-medoid, Minibatch, Gaussian and Agglomerative mixture model. (Dmitriy, 2019). The likelihood of all available e-commerce merchants missing a terrific deal because of the cost a seller sells for one specific product is extremely high. There are two types of problems: almost identical color, design, kind, style, and equivalent items offered by various merchants are replacing items that are of common appeal to purchasers. (Small Business – Chron.com, 2019).
Second, finding similarities between millions and millions of readily available commodities causes issues with large-scale data processing. To tackle the problems, the Amazon data report gave a configurable solution based on the Product similarity service, which used deeper neural networking methodology and distributed computing technology. As a result, a very similar solution is generated with scalability and flexibility in both categorization and verification tasks. (Zuo et al., 2020)
Given that our data set consists of both text and images, it is possible to find similarities between words and based on image matching. The knowledge image recovery technique, on the other hand, is unimportant.A product description and related images may be able to cover the entire search for almost identical images from the descriptions. (Günter Röhrich, 2020). Using CNN and a sparse representation, the deep characteristics of items from the recently published “Content-based image retrieval: A review of current trends” study results on knowledge picture recovery are recovered. In comparison to state-of-the-art techniques/approaches, the proposed method using the CNN produced good speed and accuracy results in experiments. (Cogent Engineering, 2021).
Conclusion
Given that our data set consists of both text and images, it is possible to find similarities between words and based on image matching. The knowledge image recovery technique, on the other hand, is unimportant. A product description and related images may be able to cover the entire search for almost identical images from the descriptions. (Günter Röhrich, 2020). Using CNN and a sparse representation, the deep characteristics of items from the recently published “Content-based image retrieval: A review of current trends” study results on knowledge picture recovery are recovered. In tests, the suggested technique using the CNN outperformed state-of-the-art approaches in terms of speed and accuracy. (Cogent Engineering, 2021).
Methodologies and Approaches to Product Similarity Matching
The fact that there aren’t many links between e-commerce websites makes it hard to figure out if two product offers on two or more Website pages refer to the similar thing. So, the product matching is about finding pairs or groups of products that are the same. Products now have more information about their attributes and values thanks to Naive Bayes algorithm to find attribute-quality pairs in the text on the Web. (Vistex, 2021)
Then, regular expressions are written by hand and used to find certain features in the title and description of the products. In contrast, models that use named entity recognition to find features are made. Both of these methods use a CRF model to find features. (Paperswithcod, 2020) However, this model can only find specific attribute important pairs, that is enhanced. In this report, we add more features to the CRF model, making the CRF more able to recognize more attribute-value pairs. (Hyperscience, 2021)
The Silk rule learning framework is used to show the first way to match products. They use features from the product descriptions ‘like a bag of words or a dictionary or regular expressions’ in different ways to do this.
In this report , ‘Product matching ’ we use neural word embeddings and CNNs to cut down on the labeling work required by supervised approaches.We also go over the most vital work on using neural word embeddings with deep learning to extract product features. For the difficulty of matching as well as other difficulties with product data, word embeddings are utilized in a few methods to derive features from the data.
Image-based product matching has been applied in a few industries, including fashion and interior design. We devised a method for matching computer devices in two scenarios: street pictures and images from a web store. They employ a CNN to learn picture embeddings, that are then seen as an entry for a classification model challenge like ours, similar to how we retrieve image features. With the recent advances in deep learning including neural networks of image processing, there really are numerous ways to propose computer items based on the visuals they display.(Gandomi et al., 2013)
It’s not unexpected that image-based matching approaches work effectively in the domains covered by the works cited above because the products are primarily visual. On the other hand, this cannot be expected in all fields. Because product attributes in industries such as electronics are primarily written (descriptions, specifications, technical fact sheets), we can’t just utilize our methods to learn picture features in this work. (ACM Transactions on Graphics (TOG), 2015)
In product classification, you have to put a set of labels from the hierarchy of products on a piece of goods. It’s important to classify products from different websites in the same way because not all websites use a hierarchy, and those who do aren’t likely to use the same one. (Forsey, 2021)
While there are many different ways to categorize product data, ( Kozareva’s, 2014) methods one of the few that uses neural text embeddings as features for the task. Use a linear classification model and input feature vectors. The method has been expanded. Instead of using a linear classification model, it suggests using a two-level ensemble.
Similarity Discovery Across Online Items in Listings
Table 1
Product Title |
DELL Desktop Optiplex 7010 Intel Core I5 3.40GHz 4GB 500GB HDD Win 10 |
Product Description |
With a 4th Generation I5 Processor plus Windows 8.1 preloaded, you can get things done swiftly. Desktop OptiPlex 7010 |
Product URL |
https://www.example.com/dell-desktop-optiplex-7010 |
Attribute Name |
Attribute Value |
|
Attributes |
Brand Type Memory Operating System Color |
Dell Optiplex 7010 4GB 500GB HDD Windows 10 Black |
In this example, we see a record that is an organized or structured product.
In our product matching method, we use numerous feature extraction approaches to build a set of useful characteristics for the item matching problem, which is detecting identical products.
We have a structured information in the database A as well as an Internet repository of unorganized product descriptions P. Every A record includes a name, explanation, Web address, and a set of attribute-value pairs extracted from the product title. Numeric, category, and free-text attributes are available. Each record p has an unstructured written title and description, and also a product image. P Our goal is to use structured data from product set A as guidance for detecting duplicate entries in product set P, in conjunction with neural text embeddings built from all of product set P’s records.
To be more explicit, we use structured data as a starting point for creating an extracting features model in P that really can recover criterion pairs from unorganized product descriptions. After the extracting features model is applied, each product p P is represented by a sequence of characteristics Fp = f1, f2,…, fn, where the qualities are numerical or categorical. After that, the attribute vectors are utilized to build a deep learning model to find matched commodities. (Sagar, 2020)
Our method for matching products comprises three basic steps:
- Feature Extraction.
- Computing feature vectors of similarity.
- Classification
Our system’s overall design is depicted in Figure 1. There are two stages to the workflow: training and application.
The pre-processing of structured and unstructured Web product descriptions begins the training phase. Then, as follows, we construct 4 extraction features models:
- On the basis of a dictionary – In structured product descriptions, We compile a list of product qualities and their associated values.
- Conditional Random Field (CRF) – A CRF model is created with a set of discrete features.
- CRF with Text Embeddings – To address changing text trends in product descriptions, we apply text embedding features to enhance the training of the prior CRF model.
- Image Feature Extraction Model – In supplement to the textual qualities, we develop an image embeddings model.
While there are various time-consuming components in the feature extraction portion, such as training the CRF as well as picture embeddings models, it is important to remember that these operations are only conducted once and can be accomplished in an offline pre-processing phase. Products could be matched fast at run-time because the which was before CRF and embeddings algorithms only need to be implemented at that phase. (Zhao et al., 2018)
Following that, we label a small training set of unstructured pairs of matching and non-matching product descriptions physically. The similarity feature vectors of the labeled training product pairings are then determined. The similarity feature vectors are then utilized to train a classification algorithm that can distinguish between matching and non-matching pairs in the final stage. We will have a trained extracting features model and a classification model after the training step is completed. (dishashree26, 2017)
We construct a set M of all possible candidate matching pairs even during the application procedure, resulting in a high number of candidates, i.e. |M| = |P| (,|P| 1)/2. The feature extraction model is then used to extract attribute-value pairs and create feature similarity vectors. In the final phase, we apply the categorization model we created earlier to find product pairs that match. (Kruschke, 2001)
Various approaches to extracting features from unstructured and structured product details are being investigated.
Automated Decision-Making and Product Recommendation
To create a dictionary of features and values, we use dataset A of structured commodities. Assume that F represents the entire product database A. For a binary string v, D(v) yields the name of the attribute f F. The dictionary represents an inverted index D from A. Color is represented by f(Black) in Example 1, while branch d is represented by f(D) (Dell). Then, from the text, we construct all possible token n-grams (n 4) and compare them to dictionary values in order to ordtoct qualities from a specific product description p P. If there are several matches, we simply choose the longest n-gram, and ties involving multiple n-grams with the same peak length are handled by random selection. (Egorov, Yuryev and Daraselia, 2004)
Conditional random field models are a popular way to label textual descriptions in natural language processing. A conditional sequence model creates a conditional distribution over label sequences based on a particular observation sequence. (Prasad, 2019).
We use the Stanford CRF approach to train commodities CRF models in this investigation. We use a large set of discrete features from the Stanford NER model’s standard distribution to train the CRF model: current term, prior word, next word, current word character n-gram (n 6), current Point – of – sale tag, surrounding Part – of – speech sequence, defined shape, encompassing word shape sequence, presence of order in the left window (size = 4) and presence of word in the right window (size = 4).
We continue to normalize the values of the attributes once all recognise pairs have been recovered from the given dataset of offers. We apply the same attribute normalization technique, which includes recognizing the attribute type, text standardization, numerical simplification, and number normalizing with a unit of measurement.The diagram below shows the layout of a classification model tree. The process of making a classification tree. (Ic.ac.uk, 2022) .
At each node, a decision must be made about whether dividing should continue or whether the node should be a terminal node. The tree would inevitably get huge and overfit the data if the splitting proceeds until all observation at each endpoint belong to the very same class. This leads to a high error rate on an uncertain test set. Pruning is a test error rate reduction strategy. Price complication pruning is a popular method for extracting the best substrings from a large tree. If splitting occurs before the nodes have observations from distinct classes, the tree will grow tiny and may under-fit the data.
Both regression and classification issues can be solved using a Decision Tree (DT). Decision tree methods are simple to understand and visualize. Recursive binary segmentation is used to break feature space into subsets in a classification model (regions). The tree’s ultimate nodes are the leaves, which foresee the repercussions. Every observation is meant to fall into the most common training observation category in that region. (Lucidchart, 2022)
Example of a binary classification tree, with a1 and a2 as variable x1 thresholds and b1, b2, and b3 as variable x2 thresholds. The root node, the tree’s top node, is where classification begins. If indeed the value of x2 is larger than or equal to the threshold b2, the right child node is chosen. Its left child component is picked otherwise. At the child node, the amount of x1 is compared to the a1 or a2 criterion. The process continues until the observation reaches a terminal node (leaf), which defines how it should be categorized. (Smartdraw, 2022)
Commonalities in E-Commerce Organizations
For identical products, several e-commerces use similar images. As a result, the image can be used as a guide to finding similar things. In this paper, we use deep Convolutional Neural Networks, which area most widely used image processing algorithms (CNNs). CNN models are typically made up of a series of conveyers connected in a row, then fully connected layers. In each convolutional layer, several small filters are combined on the input image. Various features in the photo trigger different filters since the weights of every filter are randomly initialized. Some filters, for example, may be caused by a specific structure in the image, while some are triggered by a specific color in the image. Because there are so many convolutional layers, each one is typically connected to a convolution layers, which subsamples the data to reduce the number of features. The output of the convolutional layer is coupled to a standard Feed-Forward Neural Net for a given job. (Patel, 2020)
In each convolutional layer, several small filters are combined on the input image. Various features in the photo trigger different filters since the weights of every filter are randomly initialized. Some filters, for example, may be caused by a specific structure in the image, while some are triggered by a specific color in the image. Because there are so many convolutional layers, each one is typically connected to a convolution layers, which subsamples the data to reduce the number of features. The output of the convolutional layer is coupled to a standard Feed-Forward Neural Net for a given job. (Ishrat Jahan Ananya, 2021)
Figure 2
After the feature extraction is finished, we may create an attribute space F = f1, f2,…, fn that contains all of the obtained attributes, including the picture vectors. To assess how similar each conceivable product pair is, we generate the resemblance feature vector F(pi,PJ). By computing the similarity score for each attribute f in the attribute space F, we compute the similarity feature vector F(p1,p2) for two items p1 and p2, indicated by the characteristic properties Fp1 = f1v1,f2v1,…,fnv1 and Fp2 = f1v2,f2v2,…,fnv2, respectively. Let p1.val(f) and p2.val(f) be the p1 and p2 attribute values, respectively. Equation 1 explains how to use a mixed measure depending on the attribute data type to estimate the similarity among p1 and p2 for the attribute f. TF-IDF weights on character n-grams (n4) are used to calculate the Jaccard similarity, while TF-IDF weights on word tokens are used to calculate the Cosine similarity. (Han, Kamber and Pei, 2012).
The product matching problem is modeled as a two-class classification problem in which two products are either matching or non-matching. Because there are more more non-matching pairs than matching pairs, this classification technique is substantially biased. We train four different classifiers that are typically used for this type of task after generating the similarity feature vectors. We are attempting to find the best classifiers for dealing with the situation of high-class insufficiency in the item matching task, where only a few matched product pairings (positive class) and a significant number of non-matching product pairs exist. (Stars project, 2022).
K-means Clustering for Better Search Viewing Experiences
There are two steps in the product categorization process:
- Feature Extraction
We employ both controlled and unsupervised algorithms to extract features.
We make use of feature sets similar to those used in product matching.
To extract attributes from product descriptions for product categorization, we employ a dictionary-based technique. The dictionary-based technique begins with a vocabulary that includes all feature value pairings in the organized product dataset. After the features are retrieved from the text, the value of each feature is segmented, lower-cased, and fragments shorter than three letters are removed from the extracted features. Each attribute’s phrases are concatenated with the feature name to produce Dark as the final value. (Liang et al., 2017)
To address the challenge of the variety of new items and their constantly changing descriptions, we employ neural language modeling to extract text embeddings from disorganized product descriptions. We create text embeddings that include the entire document because our product information is whole texts for classification reasons. The most widely used neural language model for data text embedding is graph2vec, which is an extension of word2vec. (Quan, Wang and Ren, 2014).
For extracting picture embeddings, we employ the same CNN model. At this scenario, we employ the image vectors in their entirety, i.e., the entire picture vectors for image categorization. (Zheng and Zhu, 2019)
To create a classification model, we use the relevant features of each of the feature extraction approaches, such as Support Vector Machines, Naive Bayes, k-Nearest Neighbors, and Random Forest, where k=1. Product categorization, unlike product matching, is a multi – class classification problem because there are significantly more than 2 product categories. (Rustam, Sudarsono and Sarwinda, 2019)
We’ll look at the product matching pipeline in this section.
Data-set |
#Items |
#Similar pairs |
#Non-Similar pairs |
Computers |
325 |
201 |
26359 |
TVs |
433 |
326 |
65978 |
Smartphones |
312 |
526 |
26589 |
We’ll start by describing the datasets used, that include a dataset of structured products for oversight and a collection of unorganized product descriptions. We’ll start with the text extracting features model because it’s the most important part of our process. A CRF model based primarily on discrete features is compared to a CRF model that focuses on both continuous and discrete features retrieved from word embeddings. We look into the syntax of word vector representations in order to fully understand how they’re used in CRF model training. The experiment design is given, followed by the final results of the total product matching pipeline. In the last phase, the proposed strategy is submitted to an error analysis. (Ecfr.gov, 2022)
For supervision, we use Gemini Product Ads (GPA), and for assessment, we use a piece of the WebDataCommons extraction.
- For our research, we’ll choose a samples of multiple product categories from either the Gemini Product Advertisement database. The Dictionaries and CRF extracting features models are based on this dataset.
- We use an extract from WebDataCommons that has over 5 billion things marked up inside one of the three main HTML markup languages. (Microdata, Microformats, and RDFa).
Table 3
CRF |
CRFemb |
|||||||||
Dataset |
#training |
#test |
#atts. |
P |
R |
F1 |
P |
R |
F1 |
?F1 |
Computers |
2,330 |
1,000 |
27 |
94.81 |
93.35 |
94.08 |
93.67 |
93.2 |
93.43 |
-0.65 |
Televisions |
2,436 |
1,040 |
35 |
96.2 |
94.31 |
95.25 |
96.41 |
94.85 |
95.62 |
0.37 |
Mobile Phones |
2,220 |
1,010 |
35 |
97.62 |
96.13 |
96.87 |
96.72 |
95.84 |
96.27 |
-0.6 |
- This dataset focuses on product goods that have been Microdata tagged using the schema.org vocab. To accomplish so, we use the s:Product annotation on a subset of instances. The collection has more than 280 million entities, which equates to 280 million RDF quads. The attributes s:name and s:description are used to extract attribute-value pairs, and the s:Product/image is used to embed images.
- Although WDC is commonly thought of as structured data, most of our entities are only defined by three characteristics (title, explanation, and image), making it rather unstructured.
- Both entities are regarded as matching products if they include sufficient information to be recognizable and point to the very same product. It’s worth mentioning that the entities’ product features aren’t necessarily the same. Two annotators separately annotate the dataset, with disagreements resolved cooperatively.
- Extracting suitable attribute-value pairings with heavy coverage of product textual information is crucial for product matching. As an outcome, we compare the performance of a CRF model that uses only discrete features vs a CRF model that uses both continuous and discrete data from word embeddings. (CRFemb in the following).
- The CRFemb model is trained using both CBOW and Skip-Gram neural models for word embeddings. To train the models, we use the whole WDC and GPA datasets. (Hhs.gov, 2021)
To evaluate the efficiency of the product matching technique, we use traditional performance measures such as Precision (P), Recall (R), as well as F-score (F1). The F-score reflects the trade-off among recall and precision, computed as the harmonic average of both. (Analytics Vidhya, 2020)
Our method is compared against three baselines. To begin, we use a Bag of Words TF-IDF cosine similarity to match the products, presenting the highest score for various match thresholds, i.e., we repeat the corresponding threshold fro zero and one and assume that any pairs of resemblance above the threshold are matching pairs. Different combinations of product description were used to determine similarity, but the finest results are achieved when only the item title was employed.
We employ the generated document vectors as a second baseline. Furthermore, for each dataset, we create both DM and DBOW models. We try out several vector sizes. At various matching thresholds, we calculate the cosine similarity between every pair of vectors as well as report the best result. (Codescracker, 2022)
As a third baseline, we use the Silk Connection Discovery Framework, a fully accessible tool for discovering links between data objects in diverse data sources. Using genetic programming, the tool builds linking principles based on texture attributes. In this experiment, we extract attributes from the product name using our CRF model, then represent the global standard in RDF format. The examination is carried out utilizing a 10-fold cross validation procedure. (note, 2022)
A low-dimensional environment within which high-dimensional vectors could be translated is called an embedding. Embeddings make machine learning on large inputs, including such sparse vectors encoding words, easier. By grouping semantically similar inputs together during embedding space, an embedding should theoretically capture a few of the input’s semantics. The embedding of a model can be taught and reused. (Google Developers, 2020)
Using state-of-the-art image models and fine-tuning them on our dataset is preferable than constructing our own model for embedding creation. When these pre-trained models are used without any fine-tuning, the results are mediocre (average F1 score of 0.57), however the fine-tuned model performs significantly better (average F1 score of 0.75). The model used to generate picture embeddings is depicted in the image below.
The model fine-tuning method is inspired by the facial recognition system, with the ArcFace Margin Layer replacing the softmax layer in the model during the fine-tuning phase.
Unlike Softmax, it deliberately optimizes feature embeddings to ensure better similarity between data of the same class, resulting in higher embedding quality.
After creating embeddings, the purpose is to use the KNearestNeighbour Algorithm and Cosine Similarity to make accurate predictions. The sklearn framework cannot be used due to a high number of input data since it causes an Out of Memory Error. As both a result, the RAPIDS library is used; it is an accessible framework for accelerating data science by enabling end-to-end data science pipelines to run exclusively on GPUs.. (Beaumont, 2020)
To obtain the final picture-based predictions, predictions from all of the distinct image models are blended using one of the prediction methodologies outlined further in the document. (Google Cloud Blog, 2020)
TfidfVectorizer and Sentence Transformer are used to encode the product’s text label into word embeddings.
TfidfVectoriver uses the tfidf values to build word embeddings for each and every label, and TF-IDF is used to understand the importance of a word present in the document. (Analytics Vidhya, 2021)
Sentence Transformers is a framework that makes it simple to generate vector representations of text using transformer networks such as BERT, RoBERTa, and others. In this application, a pretrained transformer is used to generate sentence embeddings for determining the semantic similarity of text data. (Sbert.net, 2019)
Both the tfidf and transformer embeddings can be employed by either of the prediction algorithms for final prediction calculation after embedding generation.
By calculating the cosine of the angle between two vectors, cosine similarity determines whether the two vectors lie in the same directions. To generate the final predictions, a minimum threshold distance is determined, and all data points with a similarity value greater than the threshold value are required predictions. ‘The higher the similarity value, the closer the data points are related.’
NearestNeighbour is a popular approach for calculating the required number of data points based on a given measure. We can find credible projections by deciding on a minimal threshold distance. All sample values with a distance less than the threshold will be needed as forecasts. ‘The closer the association between sample points, the shorter the gap between them.’
The first strategy outperforms the second approach by a small margin. The first strategy uses merged embeddings (both picture and text embeddings) to produce predictions, whereas the second approach merges separate predictions. The open-source library ‘RAPIDS’, developed by NVIDIA, is used to implement both prediction algorithms. (Khojasteh, Hadi Abdi et al., 2020)
The Average F1 Score was utilized to evaluate the performance. The F1 score is generated for each data item, and then the mean of all F1 Scores is determined. The accuracy of a test is measured by the F1 score, which is generated using the test’s precision and recall. (the, 2017)
To gain additional insights into our method’s performance, we conducted a “hand error analysis” of its output. To be more particular, we look into false negatives ourselves to discover the different types of errors.
- Approach Errors – This category encompasses all failures caused by any of components in the approach pipeline:
- Incapability to retrieve important part pairs from product description depend on a complex explanation structure, mistakes, or abbreviations Even with the neural word embeddings, specifications often include subjective data, such as “Brand new Thinkpad laptop; updated version…”, making it difficult for our CRF model to catch cases when attribute-value combinations are identified.
- Photos of items in boxes. On some marketplaces, vendors choose to upload photographs of packaged products to show the condition of the items. The picture embeddings in this situation are more akin to a boxed product. (Ahrq.gov, 2019)
- Data Errors – We find comparable publication problems in our dataset, thus we classify the errors as data errors. Therefore the following are the errors detected in the dataset:
- Values that are incomplete or simplistic
- The names and descriptions of the products are misaligned. This error is exemplified by the so-called “click-bait,” wherein the product name, for example, refers to the “Idealpad 154K,” while the specification refers to a covering attachment for same model (s).
- In a single description area, you can put several product descriptions. Due to recommended or advertised products, websites frequently feature many descriptions on a single web page. The websites, on the other hand, do not adhere to the definition of creating a new schema.org/Product instance for each product description, resulting in many product entities being included in a single schema.org/Product instance. (PCMAG, 2022)
The error analysis outlined earlier is the key source of this approach’s shortcomings. Our matching method, in particular, is significantly reliant on accurate product attribute extraction of information from descriptions. Some would claim that certain product domains lack physical product qualities (such as memory, display size, and so on) that can be retrieved. (Umn.edu, 2013)
For instance, in the realm of computers, visual qualities take precedence over written ones. As a result, we wouldn’t be able to use our matching approach to its maximum potential.
The immense number of training data necessary to build the word embeddings is another disadvantage of this method. According to studies, increasing the quantity of data significantly improves the learned embeddings. As a result, training the CRFemb on lesser datasets will not increase its performance. (DATA and ERROR ANALYSIS, n.d.)
Conclusion
To sum up, further enhancements could be made to both the algorithm and the platform. There are two different ways to improve the convolutional neural network image matching approach in terms of algorithms.
One solution would be to address the issue of having few photographs per product class by using data augmentation techniques, which involve adding slightly changed copies of current data or newly made synthetic images to increase the amount of data. Another strategy that could be used in the future is to change the CNN approach from a pure classification problem to a matching problem, which could subsequently be adapted into a classification problem. This means that instead of receiving an image and attempting to forecast which product corresponds to it, the algorithm would take two photographs and use a score to determine whether they are matches or not, similar to how traditional approaches work.
A categorization based on the matching scores would be assigned after iterating through all product photos. This method circumvents dataset constraints because it does not require the use of the restricted retail picture dataset to train. String matching must be paired with its picture equivalent to optimize the product matching operation. As previously said, this thesis concentrated on the product picture component; yet, photographs alone are not the best identifier of a product, as there are circumstances where finding a match for the human eye is challenging. To identify one product from another, features such as name, brand, and packaging size must be examined. As a future upgrade, a similar investigation using string matching approaches should be done, with the most accurate one being merged with the developed picture ways.
References
ACM Transactio ://dl.acm.org/doi/10.1145/2766959 [Accessed 21 Apr. 2022].
van Bezu, R., Borst, S., Rijkse, R., Verhagen, J., Vandic, D. and Frasincar, F. (2015). Multi-component similarity method for web product duplicate detection. Proceedings of the 30th Annual ACM Symposium on Applied Computing.
ACM Other conferences. (2022). Consideration set generation in commerce search | Proceedings of the 20th international conference on World wide web. [online] Available at: https://dl.acm.org/doi/10.1145/1963405.1963452 [Accessed 21 Apr. 2022].
ACM SIGKDD Explorations Newsletter. (2022). Text mining for product attribute extraction | ACM SIGKDD Explorations Newsletter. [online] Available at: https://dl.acm.org/doi/10.1145/1147234.1147241 [Accessed 22 Apr. 2022].
Ghani, R., Probst, K., Liu, Y. and Fano, A. (2006). Text mining for product attribute extraction. [online] ResearchGate. Available at: https://www.researchgate.net/publication/262407606_Text_mining_for_product_attribute_extraction [Accessed 22 Apr. 2022].
More, A. (2017). Product Matching in eCommerce using deep learning – Walmart Global Tech Blog – Medium. [online] Medium. Available at: https://medium.com/walmartglobaltech/product-matching-in-ecommerce-4f19b6aebaca [Accessed 23 Apr. 2022].
Gopalakrishnan, V., Iyengar, S., Amit Madaan, Rastogi, R. and Sengamedu, S.H. (2012). Matching product titles using web-based enrichment. [online] undefined. Available at: https://www.semanticscholar.org/paper/Matching-product-titles-using-web-based-enrichment-Gopalakrishnan-Iyengar/46783c7884b758a3cd668e7f2a5309a3bae17b77 [Accessed 22 Apr. 2022].
Matching product titles using web-based enrichment – AMiner. (2012). Aminer.org. [online] Available at: https://www.aminer.org/pub/53e9bb0fb7602d97047496e5/matching-product-titles-using-web-based-enrichment [Accessed 22 Apr. 2022].
Gupta, V., Karnick, H., Bansal, A., and Jhala, P. (2016). Product Classification in E-Commerce using Distributional Semantics. arXiv.org. [online] Available at: https://arxiv.org/abs/1606.06083 [Accessed 22 Apr. 2022].
Morris, B. (2003). The components of the Wired Spanning Forest are recurrent. Probability Theory and Related Fields, [online] 125(2), pp.259–265. Available at: https://link.springer.com/article/10.1007/s00440-002-0236-0 [Accessed 22 Apr. 2022].
Petar Ristoski and Mika, P. (2016). Enriching Product Ads with Metadata from HTML Annotations. [online] undefined. Available at: https://www.semanticscholar.org/paper/Enriching-Product-Ads-with-Metadata-from-HTML-Ristoski-Mika/bac211c7a96c000078a0fd959353e5fb940c169e [Accessed 23 Apr. 2022].
Hood, R. (2019). Unravelling product matching in retail with AI – Towards Data Science. [online] Medium. Available at: https://towardsdatascience.com/unravelling-product-matching-with-ai-1a6ef7bd8614 [Accessed 23 Apr. 2022].
Icecreamlabs.com. (2019). Tackling Product Matching for E-commerce with Automation | IceCream Labs. [online] Available at: https://icecreamlabs.com/2018/12/09/leveraging-ai-and-machine-learning-for-product-matching/ [Accessed 23 Apr. 2022].
Product Similarity Matching for Food Retail using Machine Learning HANNA KEREK KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES. (n.d.). [online] Available at: https://www.diva-portal.org/smash/get/diva2:1431623/FULLTEXT02.
Cenk Çorapc? (2019). Product Matching with Deep Learning – Cimri Engineering. [online] Medium. Available at: https://engineering.cimri.com/product-matching-with-deep-learning-49d868c54fdb [Accessed 23 Apr. 2022].
Foxcroft, J., Chen, T., Padmanabhan, K. and Antonie, L. (2021). Product Matching Lessons and Recommendations from a Real World Application. [online] ResearchGate. Available at: https://www.researchgate.net/publication/352726054_Product_Matching_Lessons_and_Recommendations_from_a_Real_World_Application [Accessed 23 Apr. 2022].
Bilbro, L. (2021). Machine Learning-based Item Matching for Retailers and Brands. [online] Databricks. Available at: https://databricks.com/blog/2021/05/24/machine-learning-based-item-matching-for-retailers-and-brands.html [Accessed 23 Apr. 2022].
https://www.facebook.com/MachineLearningMastery (2019). How to Load and Visualize Standard Computer Vision Datasets With Keras. [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/how-to-load-and-visualize-standard-computer-vision-datasets-with-keras/#:~:text=Keras%20Computer%20Vision%20Datasets%20The%20Keras%20deep%20learning,testing%20model%20architectures%20and%20configurations%20for%20computer%20vision. [Accessed 23 Apr. 2022].
neptune.ai. (2020). Top 8 Image-Processing Python Libraries Used in Machine Learning – neptune.ai. [online] Available at: https://neptune.ai/blog/image-processing-python-libraries-for-machine-learning [Accessed 23 Apr. 2022].
Moffitt, C. (2020). Python Tools for Record Linking and Fuzzy Matching – Practical Business Python. [online] Pbpython.com. Available at: https://pbpython.com/record-linking.html [Accessed 23 Apr. 2022].
Tran, K. (2020). How to Find a Best Match with Python – Towards Data Science. [online] Medium. Available at: https://towardsdatascience.com/how-to-match-two-people-with-python-7583b51ff3f9 [Accessed 23 Apr. 2022].
Python (2020). Matching Algorithms in Python. [online] Data Science Stack Exchange. Available at: https://datascience.stackexchange.com/questions/67505/matching-algorithms-in-python [Accessed 23 Apr. 2022].
Placekey.io. (2022). Using Python for Address Matching: How To + the 6 Best Methods. [online] Available at: https://www.placekey.io/blog/address-matching-python [Accessed 23 Apr. 2022].
GeeksforGeeks. (2011). Python Program for KMP Algorithm for Pattern Searching – GeeksforGeeks. [online] Available at: https://www.geeksforgeeks.org/python-program-for-kmp-algorithm-for-pattern-searching-2/ [Accessed 23 Apr. 2022].
GeeksforGeeks. (2011). Python Program for KMP Algorithm for Pattern Searching – GeeksforGeeks. [online] Available at: https://www.geeksforgeeks.org/python-program-for-kmp-algorithm-for-pattern-searching-2/ [Accessed 23 Apr. 2022].
TensorFlow. (2022). Machine learning education | TensorFlow. [online] Available at: https://www.tensorflow.org/resources/learn-ml?gclid=CjwKCAjwx46TBhBhEiwArA_DjFV-ec3zjMajzobNfugb7RMc38r7muzLnJyiTyya9JLP_66nvIdGIBoCH2YQAvD_BwE [Accessed 24 Apr. 2022].
https://en-gb.facebook.com/BernardWMarr (2021). Are Machine Learning And AI The Same? | Bernard Marr. [online] Bernard Marr. Available at: https://bernardmarr.com/are-machine-learning-and-ai-the-same/ [Accessed 24 Apr. 2022].
Connamara. (2022). Connamara – Matching Engine Technology, EP3TM, Exchange Platform. [online] Available at: https://www.connamara.com/?gclid=CjwKCAjwx46TBhBhEiwArA_DjG99Zur55uNPxJ_Dr-955JehUa0l1FvlSrfj_9mKThSO13Xh849SiBoCgqcQAvD_BwE [Accessed 24 Apr. 2022].
Aguilar, L. (2021). Product Matching on DSS Virtual Salon. [online] Tryolabs. Available at: https://tryolabs.com/blog/2021/08/04/product-matching-on-dss-virtual-salon [Accessed 24 Apr. 2022].
reddit. (2015). r/MachineLearning – How to do product matching? [online] Available at: https://www.reddit.com/r/MachineLearning/comments/3b48zp/how_to_do_product_matching/ [Accessed 24 Apr. 2022].
Udacity. (2021). Python Match-Case Statement: Example & Alternatives | Udacity. [online] Available at: https://www.udacity.com/blog/2021/10/python-match-case-statement-example-alternatives.html [Accessed 24 Apr. 2022].
Google Cloud. (2022). Vertex AI Matching Engine overview | Google Cloud. [online] Available at: https://cloud.google.com/vertex-ai/docs/matching-engine/overview [Accessed 24 Apr. 2022].
Quora. (2021). How do top comparison sites do product data match? Which is the most effective way to do that, considering product data matching is an in… [online] Available at: https://www.quora.com/How-do-top-comparison-sites-do-product-data-match-Which-is-the-most-effective-way-to-do-that-considering-product-data-matching-is-an-inexact-science [Accessed 24 Apr. 2022].
Toth, A. (2017). Experiencing with product data matching. [online] Attilatoth.dev. Available at: https://www.attilatoth.dev/posts/data-matching/ [Accessed 24 Apr. 2022].
Haponik, A. (2021). The best Machine Learning Use Cases in E-commerce (update: June 2021). [online] Addepto. Available at: https://addepto.com/best-machine-learning-use-cases-ecommerce/ [Accessed 24 Apr. 2022].
Kusniyati, H. and Nugraha, A.A. (2020). Analysis of Matric Product Matching Between Cosine Similarity with Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec in PT. Pricebook Digital Indonesia. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, [online] pp.105–112. Available at: https://www.academia.edu/44773326/Analysis_of_Matric_Product_Matching_Between_Cosine_Similarity_with_Term_Frequency_Inverse_Document_Frequency_TF_IDF_and_Word2Vec_in_PT_Pricebook_Digital_Indonesia [Accessed 24 Apr. 2022]
Indico Data. (2016). Deep Learning in Fashion (Part 3): Clothing Matching Tutorial – Indico Data. [online] Available at: https://indicodata.ai/blog/fashion-matching-tutorial/ [Accessed 24 Apr. 2022].
Analytics Mayhem. (2021). Propensity Score Matching in Python | Analytics Mayhem. [online] Available at: https://analyticsmayhem.com/digital-analytics/propensity-score-matching-python/ [Accessed 24 Apr. 2022].
Mieczys?aw Paw?owski (2021). Machine Learning Based Product Classification for eCommerce. [online] ResearchGate. Available at: https://www.researchgate.net/publication/351390294_Machine_Learning_Based_Product_Classification_for_eCommerce [Accessed 24 Apr. 2022].
reddit. (2016). r/MachineLearning – Utilizing ML for our product matching algorithm? [online] Available at: https://www.reddit.com/r/MachineLearning/comments/4qnhwl/utilizing_ml_for_our_product_matching_algorithm/ [Accessed 24 Apr. 2022].