An association rules based method for classifying product offers from e-shopping

Abstract

Price comparison services are widely used by e-shopping customers. Such e-shopping sites receive product offers from thousands of online stores, and in order to provide price comparison, product categorization, and searching, it is necessary to match different offers referring to the same real-world product. This is a hard task, since they need to classify millions of product offers in thousands of classes, and distinct descriptions may exist for the same product, as well as very similar descriptions of distinct products. In this work, we propose a method that uses association rules to classify product offers from e-shopping web sites matching offers against offers without the need for a product catalog. This is a supervised learning method that trains a classifier, whose generated model comprises a set of association rules to identify product offer classes. Experimental evaluations show that our method is effective and efficient, and obtains better results than three baselines in several datasets with distinct characteristics. It is able to deal with large datasets containing thousands of classes and different types of products such as electronics and books. Moreover, we propose and evaluate strategies to reduce its execution time and we evaluate its weaknesses.

Keywords

Association rule entity resolution product classification product matching product offer e-commerce

1. Introduction

The comparison of product prices is a very important service, provided by e-shopping web sites, such as Google Shopping,1

¹
http://www.google.com/shopping.

Shopping.com, and Shopping UOL.2

http://shopping.uol.com.br.

Such e-shopping sites receive product offers from thousands of online stores, through the transferring of data or crawling. Frequently, product offers are not represented by structured data, but rather by textual descriptions, mixing the name and technical characteristics of a product. The main challenge is to match up different offers that refer to the same real-world product. Spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem.

As an illustration, Fig. 1 shows some of the results of searching for the product HP Photosmart C4780 in Google Shopping. It is noticeable that Google performs product matching, since the first result refers to offers from 4 stores. However, it was not able to identify that the second result also refers to the same product.

Figure 1.

Product offers related to search HP Photosmart C4780 in Google Shopping.

In addition to identifying when different descriptions refer to the same entity, as in the first two descriptions of Fig. 1, solutions for the problem of matching product offers also need to identify that very similar descriptions can refer to distinct products. For example, “HP OfficeJet Pro 8600 911a Wireless” and “HP OfficeJet Pro 8600 911n Wireless” refer to two distinct printers with different prices.

Given a set of entity references, such as product offer descriptions, the process of identifying which of them correspond to a same real-world entity is known as entity resolution [20, 10, 5, 17, 12]. The matching of products is a specific case of entity resolution, where the entities are products. Traditional entity resolution approaches deal with structured data, applying similarity functions to each attribute of the records and combining the result to find duplicated records. The product matching problem is hardest because product offers come from thousands of merchants, which use different descriptions of the products. Such descriptions may be short as “HP C4780” or contain several technical characteristics of the product in a free textual field.

In this work, we address the problem of aggregating product offers from e-shopping web sites by matching product offers against other offers. This problem is more challenging than matching product offers against a catalog of structured products [16] due to uncleaned and non-structured data. Matching of product offers is highly relevant in many scenarios where a catalog is not available. For example, applications that monitor product price from different web sites typically do not have a catalog. Furthermore, aggregated offers could be used as a starting point for construction of product catalogs.

Few studies addressed the problem of matching product offers against product offers. In [19], the authors preprocess product offers to extract product codes and use them for matching. However, the effectiveness of their approach depends on the product category since not all categories have product codes. Furthermore, their method needs to submit queries to a web search engine, which may make the method non-scalable. The approaches in [13, 21] also need to submit queries to a web search engine, and they did not provide experiments with a large number of distinct products. Evaluations of several traditional entity resolution approaches made by [18] report poor results for precision and recall metrics in the product offer matching task. They also concluded that learning-based approaches, which obtain better results than non-learning approaches, do not scale with large input sets. Therefore, there is space for more studies in this area.

Therefore, we have the questions: can we develop an effective and scalable method for matching product offers, knowing that we need to deal with a large number (tens of thousands) of distinct products? Can we develop a generic method that does not depend on characteristics of specific product categories?

Our hypothesis is that we can treat the problem of matching product offers as a classification problem, and train a classifier to identify sets of tokens (words) in the product offer descriptions that discriminate different products, i.e., sets of tokens that occur in the descriptions of only one class of product offer. For example, in Fig. 1, the set of tokens {C4780, Printer} occurs only in the two descriptions of the printer, and can distinguish it from its cartridge, which can be identified by the set of tokens {C4780, Cartridge}.

We propose a method to classify product offers from e-shopping web sites that uses association rules [3] to find sets of tokens that distinguish each product offer class. This is a supervised learning method that trains a classifier whose generated model comprises a set of association rules, used to identify product offer classes. The association rules are of the form $\mathcal{X}\rightarrow c_{i}$ , where $\mathcal{X}$ is a set of tokens, and $c_{i}$ is a product offer class (e.g., {C4780, Printer} $\rightarrow c_{1}$ ). In the test phase, this method generates sets of tokens from the product offer description string to be tested, and tries to match them with the antecedents of the rules in the model generated in the training phase.

Our method uses only the textual description of product offers, which is usually present for all product types. Therefore, it does not depend on the specific attributes of some categories of products. We have evaluated it on datasets containing several categories of products, such as electronics, perfumeries, fashion accessories, and books. Some categories of product contain an implicit, built-in, product code in their offer descriptions (e.g., C4780 in the example of Fig. 1). Our method is able to identify implicit product codes in the product offer descriptions, when they exist, and it is also generic enough to classify product offers that do not have codes.

We evaluated the effectiveness and the efficiency of our method experimentally, on several datasets with distinct characteristics, which indicates that our method is better than three baselines in the most situations. It is able to classify a large dataset containing thousands of instances and classes, in a reasonable execution time.

In [24], a preliminary proposal of our method was presented, applied to disambiguate publication venue titles in bibliographic citations. In this work, we changed some strategies of their basic algorithm, introducing the concept of support of a rule in the training phase and considering a new vote schema to choose the best rule in the test phase. We also proposed and evaluated new alternative strategies to the basic algorithm. And we evaluated several facets of the method, in terms of effectiveness and efficiency, in the application of classifying product offers. The main motivation to apply that method to classify product offers came from the fact that product offers usually contain implicit codes that uniquely identify each product, similar to bibliographic citations, which usually contain acronyms. Unlike bibliographic data whose methods get good results, product matching is a much harder problem to solve [18].

Our contributions can be summarized as follows:

•

We propose an association rules based method for classifying product offers from e-shopping web sites matching offers against offers without the need for a product catalog. Different from other associative classifiers, our method uses only the textual description of the entities, and we propose a new way of using support and confidence to find good rules;

•

Our method is able to classify at product level, which enables customers to compare prices (currently and historically) and obtain other information about a product, such as customer reviews;

•

It is able to classify large datasets, containing thousands of classes, which is the main challenge to manage e-shopping products. It works well for different types of products, such as electronics and books;

•

We discuss the computational complexity of our method, propose and evaluate strategies to reduce its execution time;

•

We evaluate experimentally our method, demonstrating its effectiveness and efficiency.

The remainder of this paper is organized as follows: In Section 2, we discuss related works on the classification of products and on association rules. In Section 3, we present our method, which uses association rules to classify product offers; and in Section 4, we discuss its computational complexity. In Section 5, we describe our experiments, evaluation metrics, and results. Finally, in Section 6, we present our conclusions and directions for future works.

2. Related work

E-shopping web sites usually organize their products hierarchically in categories, such as that in Fig. 2(a). They also keep a catalog of products, which contains detailed specifications for each product, organized in structures of attribute-value pairs, such as in the example of Fig. 2(b). Works in literature studied different tasks on classifying product offers in hierarchies and catalogs. Product offers can be matched to a catalog [16] or they can be classified in some level of the hierarchy, in classes such as “Clothing”, “Components” or “Printers” [32, 23, 8] or at product level (e.g., “HP PhotoSmart C4780”) [19, 13, 21]. In order to compare product prices, product offers need to be classified at product level, which is the focus of our work. This type of classification is much more challenging due to the large number of classes. It is also more challenging than matching product offers against a catalog due to the absence of structured and cleaned product data.

Figure 2.

(a) Example of a hierarchy of products and (b) examples of attribute-value pairs of two products in a catalog of products.

Related to product offers and catalog, [16] described their system, used by Bing Shopping, for matching unstructured offers to structured product descriptions in a catalog. They adopted a probabilistic approach to find the product in the catalog that has the largest probability of matching to the given offer. Their matching function takes into account matches and mismatches in attribute values between offer-product pairs, treats missing attribute values, and weights the importance of different attributes. In another work, [22] introduced the problem of product synthesis, which aims at identifying new products from a set of product offers and add them to a catalog, together with their structured attributes. Their solution addressed issues involved in data extraction from offers, schema reconciliation, and data fusion.

In order to classify product offers in categories above the product level, [8] presented a study about classification methods for large-scale categorization. They also proposed a probabilistic approach to model the classification problem, using a belief network. Their work differs from ours because we classify product offers in categories at product level. In our experiments, we used a dataset extracted from the same data source as their datasets, however, our dataset has much more classes than theirs’.

Other works also presented approaches to classify products in classes above the product level. In [23], they evaluated Naive Bayes classifiers for classifying product offers in Yahoo! Shopping. They studied the effects of data transformation on text classification with Naive Bayes classifier. Several heuristic feature transformations were experimented, such as IDF and normalization by the length of the text. In [32], the author presented a system for product categorization that uses a variation of the vector space model [26], modified to represent product attributes. The author considered textual and numeric attributes, such as product description, manufacturer name, and price. In our work, we used only the textual description of products. In a related study, [1] address the problem of matching of product categories from multiple web sites. They proposed an improved approach for the word sense disambiguation process.

The focus of our work is to classify product offers in categories at product level, matching offers against offers. The studies in [19, 13, 21, 18] addressed this problem. In [19], the authors used an entity resolution approach that preprocesses product offers to extract product codes and use them for matching. A product code is a manufacturer-specific identifier that typically appears in the product name and description. To extract product codes, they manually created list of regular expressions and used the web as an external knowledge source to verify the candidate codes. They employed a learning-based approach, combining several matchers on several attributes to derive a match decision for every pair of entities. The effectiveness of their approach depends on the product category since not all categories have product codes. Furthermore, their approach depends on submitting queries to a web search engine, and therefore is not scalable. Our method does not provide a specific strategy to extract product code like theirs’, however, if a code exists it is naturally extracted and embedded in the rules, and may contribute to improve the effectiveness of our method.

The approaches in [13, 21] also match product offers against product offers, however, they are not scalable, since they need to submit queries to a web search engine to enrich the descriptions of products and other entities before matching them. They also did not evaluate them using a large amount of data containing a large number of classes.

The product matching problem was evaluated by [18] by using several entity resolution approaches, and concluded that it is not sufficiently solved with conventional approaches based on the similarity of attribute values. Furthermore, the learning-based approaches, which obtained the best results, are not scalable even for a small number of classes and instances. Entity resolution aims at identifying equivalent entities or duplicates within a data source or between data sources. Some surveys and tutorials on entity resolution can be found in [10, 17, 20, 12, 14]. A web-based entity resolution approach to treat product matching and bibliographic data was proposed by [25].

Our work uses association rules, a data mining technique that can find out relationships among item sets in a dataset [3]. It is based on associative classification, which combines association rules and classification to build a classification model. Several associative classification algorithms have been proposed in the last years. A recent survey can be found in [30]. The algorithms vary according to their different methodologies in rule learning, ranking, pruning, and prediction procedures. The main differences of our work to the others are in the generation of rules and in the way we use the support and confidence concepts. Support is measured only among the instances of the same class, rather than measure it in a global context. Confidence does not measure the accuracy of a rule, instead, we use rules with 100% of confidence only. As a result, we generate a small number of rules when compared to works that use confidence to rank rules, and we propose an efficient data structure to assist in generating rules by using an inverted index. Such form of using confidence may be applied to other contexts where implicit codes are important pieces of evidence for classification.

Association rules were also used in [28] to disambiguate author names in bibliographic citations. They proposed supervised learning methods where tokens in coauthor names, work title, and publication venue titles are used as features to train classifiers that exploit rules associating citation features to specific authors. Association rules were also used to classify documents in [29]. Their approach combines textual features of documents and links to form an associative classifier. Both works explore specific features of their applications, author name disambiguation and document classification. They also use the confidence to measure the accuracy of a rule. Our method uses only tokens in a text, it does not use other evidences such as links or prices, so it may be applied to disambiguate other types of entities described by text only.

Our work is an extension of the [24]’s work. They proposed a method that uses association rules to disambiguate publication venue titles originated from bibliographic citations. The disambiguator is a supervised learning method that uses a publication venue authority file [11] to train a classifier, whose generated model is a set of association rules to identify publication venues. In this work, we changed the strategies of their algorithm, and promoted a complete evaluation of its effectiveness and efficiency on the product matching problem. We added the support of a rule, which decreases the number of rare tokens that generate bad rules in the classification model. In the prediction phase, we do not need to generate all itemsets, the new vote schema prioritizes short rules and stops to generate itemsets as soon as it finds a decision. We also proposed an alternative strategy that decreases the number of tokens to be combined. All these changes were experimentally evaluated demonstrating improvement in effectiveness and efficiency of the new algorithm.

3. Our method for classifying product offers

The product offer matching problem may be seen as follows. Given a set of product offer descriptions, originated from web stores, the objective is to map them into real classes of products known by the e-shopping, as illustrated in Fig. 3(a). Our solution to this problem uses a supervised machine learning technique, as illustrated in Fig. 3(b). It uses a set of manually classified product offers to train an associative classifier to predict the class of other unclassified product offers. In the training phase, the classified product offer descriptions are tokenized, cleaned, and indexed. Sets of tokens (itemsets) are generated from the tokens of each product offer description, and they are used to generate association rules that relate them to the correct product offer class. In the prediction (or test) phase, each unclassified product offer description is also tokenized, cleaned, and has its sets of tokens generated. The classification module uses the learning model created in the training phase to produce the product offer matching results. The following sections detail each one of these steps.

Figure 3.

(a) The product offer matching problem (b) Our solution to the problem.

3.1 Problem formulation

The task of classifying product offers may be formulated as follows. Let $P=\{p_{1},p_{2},\ldots,p_{k}\}$ be a set of product offers. Each product offer $p_{i}$ has a list of attributes, such as a product description, product price, and the name of the offering store. Let $C=\{c_{1},c_{2},\ldots,c_{l}\}$ be a set of $L$ classes, with their respective labels; in this case, a set of categories of product offers. The objective is to produce a classification function, which maps each product offer $p_{i}$ into one of the predefined classes of the set $C$ .

Our proposal for solving the classification problem uses a supervised machine learning technique. In this case, we are given an input dataset, called the training data and denoted as $\mathcal{D}$ , which consists of examples of product offer instances for which the correct product offer class is known. Each instance generates a set of $m$ features $\{f_{1},f_{2},\ldots,f_{m}\}$ . These features are tokens (words), extracted from string attributes, such as the product offer description or name. The training data is used to produce a learning model, using association rules, which relates the features in the training data to the correct product offer class. The test data, denoted as $\mathcal{T}$ , for the classification problem, consists of a set of product offers for which the features are known, while the correct product offer class is unknown. The learning model, which is a function that maps a set of features $\{f_{1},f_{2},\ldots,f_{m}\}$ to a class $c_{i}\in C$ , is used to predict the correct product offer class in the test set.

The learning function uses an associative classifier, which exploits associations among tokens in the product offers that uniquely identify each class. Such associations are uncovered using rules of the form $\mathcal{X}\rightarrow c_{i}$ , where $\mathcal{X}\subseteq\{f_{1},f_{2},\ldots,f_{m}\}$ and $c_{i}\in C$ . For example, the product offer description Officejet J3680 All-in-One Printer, Fax, Scanner, Copier, HEWCB071A, which belongs to the $c_{1}$ class, whose product is Officejet J3680, may produce the two association rules {J3680} $\rightarrow c_{1}$ and {HEWCB071A} $\rightarrow c_{1}$ , which indicate that the set of tokens {J3680}, and the set {HEWCB071A}, both uniquely identify the product offers of the $c_{1}$ class (Officejet J3680).

In order to produce association rules that uniquely identify each class, the associative classifier only learns rules that have a confidence of 100%. According to [3], a rule $\mathcal{X}\rightarrow\mathcal{Y}$ holds in the dataset $\mathcal{D}$ , with confidence $c$ , if $c$ % of instances in $\mathcal{D}$ that contain $\mathcal{X}$ also contain $\mathcal{Y}$ . Subsequently, the generated model does not contain rules $\mathcal{X}\rightarrow c_{i}$ and $\mathcal{X}\rightarrow c_{j}$ , such that $i\neq j$ . This strategy is not perfect, since it may not produce rules for all classes. Such a situation occurs when the sets of tokens in all product offers of a class are contained in some sets of tokens of distinct classes. In this case, no rule is generated for that class. In order to solve such situation, to predict the class of an instance for which no rule is found in the learning model in the test phase, our method uses another strategy, to be explained later.

The associative classifier also checks the support of a rule. The rule $\mathcal{X}\rightarrow c_{i}$ has support $s$ in $\mathcal{D}$ , if $s$ % of instances of the class $c_{i}$ in $\mathcal{D}$ contain $\mathcal{X}$ . Differently of the original concept presented in [3], the support of a rule is measured only among the instances of the same class, rather than measure it in a global context. The aim of the support of a rule is to decrease the number of rare tokens among instances of a class that generate bad rules in the classification model. The traditional concept of support [3] is not adequate for our method because it could loose important tokens that occur only one or few times in the dataset. Table 1 presents our main mathematical notations.

Table 1
Mathematical notations

Symbol	Definition
$P$	Set of product offers $P=\{p_{1},p_{2},\ldots,p_{k}\}$
$C$	Set of classes $C=\{c_{1},c_{2},\ldots,c_{l}\}$
$\mathcal{D}$	Training data: set of product offers for which the correct class $c_{i}$ is known
$\mathcal{T}$	Test data: set of product offers for which the correct class $c_{i}$ is unknown
$\{f_{1},f_{2},\ldots,f_{m}\}$	Set of features (tokens) extracted from product offer descriptions
$\mathcal{X}\rightarrow c_{i}$	An association rule, where $\mathcal{X}\subseteq\{f_{1},f_{2},\ldots,f_{m}\}$ and $c_{i}\in C$
$d_{j,i}$	String describing the $j^{\rm th}$ product offer instance for the class $c_{i}$
$\mathcal{R}_{c_{i}}^{d_{j,i}}$	Set of rules, $\mathcal{X}\rightarrow c_{i}$ , predicting the class $c_{i}$ originated from the product offer description $d_{j,i}$ ( $\mathcal{X}\subseteq d_{j,i}$ )
$\mathcal{R}_{c_{i}}$	Set of rules, $\mathcal{X}\rightarrow c_{i}$ , predicting the class $c_{i}$
$\mathcal{R}$	Set of all rules generated by the learning model, $\mathcal{R}_{c_{i}}^{d_{j,i}}\subseteq\mathcal{R}_{c_{i}}\subseteq\mathcal{R}$
$k$ -itemset	A set containing $k$ itens (features, tokens)

3.2 Training phase

The training data $\mathcal{D}$ can be provided by a data source, such as an e-shopping database, containing product offer descriptions. In the input data, there are one or more instances for each distinct class. Each input instance in $\mathcal{D}$ is a string describing a product offer for a class $c_{i}$ , represented by $d_{j,i}$ , a description $j$ for the class $c_{i}$ . Let $\mathcal{R}_{c_{i}}^{d_{j,i}}$ be the set of rules $\mathcal{X}\rightarrow c_{i}$ , where $\mathcal{X}\subseteq d_{j,i}$ ( $d_{j,i}$ contains all features in $\mathcal{X}$ ). That is, $\mathcal{R}_{c_{i}}^{d_{j,i}}$ is composed of rules predicting the class $c_{i}$ originated from the product offer description $d_{j,i}$ . Let $\mathcal{R}_{c_{i}}$ be the set of rules predicting the class $c_{i}$ , and let $\mathcal{R}$ be the set of all rules generated by the learning model. Then, $\mathcal{R}_{c_{i}}^{d_{j,i}}\subseteq\mathcal{R}_{c_{i}}\subseteq\mathcal{R}$ .

Algorithm 1 shows the steps of the training phase. It receives as input a set of product offers, for which the classes are known, and a minimum support for generating rules, and returns a set of associative rules that have a confidence of 100% and minimum support to predict product offer classes. The algorithm’s first step (Lines 1–6) inserts distinct tokens from the instances into an inverted index structure [4]. This structure is composed of key-value pairs, where the key is a token and the value is an occurrence list of this token, containing, in each position, the class $c_{i}$ and its specific description identification $j$ . The construction of this structure is performed by the InsertInvertedIndex function. The Tokenize function splits a description into tokens. Before being tokenized, each string is preprocessed, removing punctuation marks, symbols such as $()[]\{\}$ , and stopwords (articles, prepositions, and conjunctions), and converting letters to lowercase. Strings with hyphens (-) and slashes (/) are processed differently. The processing uses (-) or (/) in two ways, as a separator and as an aggregator, yielding up to three or more tokens. For example, to process the string “DSC-W730”, the following tokens will be generated: “dsc”, “w730”, and “dscw730”.

Algorithm 1
Training Phase
0:
Examples for training $\mathcal{D}$
0:
Minimum support $m s$
0:
The set of rules $\mathcal{R}$
1:
for each instance $d_{j,i}\in\mathcal{D}$ do
2:
$S_{0}\leftarrow$ Tokenize( $d_{j,i}$ )
3:
for each token $tk\in S_{0}$ do
4:
InsertInvertedIndex( $t k$ , $j$ , $c_{i}$ )
5:
end for
6:
end for
7:
$\mathcal{R}\leftarrow\emptyset$
8:
for each instance $d_{j,i}\in\mathcal{D}$ do
9:
$S_{0}\leftarrow$ Tokenize( $d_{j,i}$ )
10:
$m\leftarrow$ Length( $S_{0}$ )
11:
for $(k\leftarrow 1;k\leqslant m;k++)$ do
12:
$S_{k}\leftarrow$ GenItemSets(k, $S_{k-1}$ )
13:
for each k-itemset $it\subseteq S_{k}$ do
14:
if SizeOccurrenceList( $i t$ ) $=1$ then
15:
//100% confidence rule
16:
if SupportRule( $it\rightarrow c_{i}$ ) $\geqslant$ $m s$ then
17:
InsertRule( $it\rightarrow c_{i}$ , $\mathcal{R}$ )
18:
end if
19:
RemoveItemSet( $i t$ , $S_{k}$ )
20:
end if
21:
end for
22:
end for
23:
end for
24:
return $\mathcal{R}$

The second step of the algorithm (Lines 7–23) is an iterative process, to create associative rules. The GenItemSets function generates sets of items with $k$ tokens ( $k$ -itemset), combining the items with $k-1$ tokens obtained from the previous iteration. An algorithm to combine item sets is presented in [3]. Each $k$ -itemset is searched in the inverted index, and if it occurs in only one class and has a minimum support, a rule is created containing the $k$ -itemset as antecedent and the class id as consequent (Lines 12–17). The SizeOccurrenceList function returns the number of distinct classes in which a $k$ -itemset appears in some product offer description. This function searches in the inverted index, using each token in a $k$ -itemset as a key and retrieving its occurrence list. When $k>1$ , it performs an intersect operation on the occurrence lists of each token to find the result. If size is equal to 1, the rule formed by the $k$ -itemset and its class has a confidence of 100%. The algorithm also checks for the minimum support of a rule by using the SupportRule function. If a rule also attends the minimum support, it is inserted in the set of rules by the InsertRule function.

If a $k$ -itemset forms a 100% confidence rule, then any $l$ -itemset, $l>k$ , that includes this $k$ -itemset also does. Then, the algorithm uses a pruning strategy, to avoid combining this $k$ -itemset in the next iteration (RemoveItemSet function in Line 19).

Example 1. Table 2 illustrates an example of training data, where the class $c_{1}$ , whose product is Officejet J3680, has 4 instances. Table 3 shows the rules generated from the training data of Table 2. The rules are organized according to the minimum support used to generate them. Observe that when the minimum support is increased, the number of rules is reduced, in an attempt to eliminate tokens that do not contribute to the identification of product offers. Notice that there is no rule for class $c_{2}$ (Officejet Pro 8000), since the tokens of Instance 5 form a subset of the tokens of Instance 6 of class $c_{3}$ (Officejet Pro 8000 Wireless). They are in two distinct classes, because one is a wireless printer and the other not. Moreover, notice that our method does not provide special treatment for synonyms, but they may be uncovered naturally. For example, in the class $c_{1}$ , “J3680”, “HEWCB071A”, and “CB071A” are synonyms codes and they generated rules for the same class.

Table 2
An example of training data

Class # Product offer description

$c_{1}$ Officejet J3680 1 HP Officejet J3680 all-in-one printer, fax, scanner, copier

2 Officejet J3680 all-in-one printer, fax, scanner, copier, HEWCB071A

3 HP Officejet J3680 All-in-One – multifunction (fax/copier/scanner)

4 HEWCB071A HP Officejet J3680 all-in-one printer CB071A hewlett-packard

$c_{2}$ Officejet Pro 8000 5 HP Officejet Pro 8000 printer

$c_{3}$ Officejet Pro 8000 wireless 6 Hewlett Hp Officejet Pro 8000 wireless printer Cb9297a#b1h

$c_{4}$ Officejet Pro K5400 7 HP Officejet Pro K5400 color printer

Table 3
Rules generated from the training data of Table 2

Class #Instance Generated rules

Support (0%) Support (50%)

$c_{4}$ Officejet Pro K5400 7

$c_{1}$ Officejet J3680

1

2

3

4

$\textit{fax}\rightarrow c_{1}$

$\textit{allinone}\rightarrow c_{1}$

$\textit{packard}\rightarrow c_{1}$

$\textit{multifunction}\rightarrow c_{1}$

$\textit{j3680}\rightarrow c_{1}$

$\textit{hewlettpackard}\rightarrow c_{1}$

$\textit{copier}\rightarrow c_{1}$

$\textit{cb071a}\rightarrow c_{1}$

$\textit{hewcb071a}\rightarrow c_{1}$

$\textit{scanner}\rightarrow c_{1}$

$\textit{fax}\rightarrow c_{1}$

$\textit{allinone}\rightarrow c_{1}$

$\textit{j3680}\rightarrow c_{1}$

$\textit{copier}\rightarrow c_{1}$

$\textit{hewcb071a}\rightarrow c_{1}$

$\textit{scanner}\rightarrow c_{1}$

$c_{2}$ Officejet Pro 8000 5

$c_{3}$ Officejet Pro 8000 wireless 6

$\textit{cb9297a\#b1h}\rightarrow c_{3}$

$\textit{wireless}\rightarrow c_{3}$

$\textit{hewlett, pro}\rightarrow c_{3}$

$\textit{8000, hewlett}\rightarrow c_{3}$

$\textit{cb9297a\#b1h}\rightarrow c_{3}$

$\textit{wireless}\rightarrow c_{3}$

$\textit{hewlett, pro}\rightarrow c_{3}$

8000, $\textit{hewlett}\rightarrow c_{3}$

3.3 Test phase

Class	#	Product offer description
$c_{1}$ Officejet J3680	1	HP Officejet J3680 all-in-one printer, fax, scanner, copier
	2	Officejet J3680 all-in-one printer, fax, scanner, copier, HEWCB071A
	3	HP Officejet J3680 All-in-One – multifunction (fax/copier/scanner)
	4	HEWCB071A HP Officejet J3680 all-in-one printer CB071A hewlett-packard
$c_{2}$ Officejet Pro 8000	5	HP Officejet Pro 8000 printer
$c_{3}$ Officejet Pro 8000 wireless	6	Hewlett Hp Officejet Pro 8000 wireless printer Cb9297a#b1h
$c_{4}$ Officejet Pro K5400	7	HP Officejet Pro K5400 color printer

The test data $\mathcal{T}$ is composed of a set of product offer descriptions. Algorithm 2 shows the details of the test phase, for predicting the class of a description $d\in\mathcal{T}$ . First, $d$ , of size $m$ , is tokenized and its $k$ -itemsets, $1\leqslant k\leqslant m$ , are generated by an iterative process (Lines 1–5). Second, each itemset is matched against the antecedents of the rules $\mathcal{R}$ in the learning model. All rules whose antecedents match with any itemset form the set of candidate rules, $R_{d}$ (Lines 6–10). Third, using a vote schema, the consequent of each rule in $R_{d}$ is counted, and the class $c_{i}\in C$ , if any, with the highest counting is chosen as the class of the product offer description $d$ (Lines 11–17). Notice that the algorithm prioritizes shorter rules, with fewer antecedents, which usually include product codes.

In the case of tie in the voting for all itemsets, or no rule is found in $\mathcal{R}$ whose antecedent matches with the itemsets from $d$ , our method uses a similarity function (e.g., Jaccard [15] or Cosine [26]) to choose the class of $d$ . In this case, each product offer description used in training phase is compared with the string $d$ , and the class corresponding to the string with the highest similarity is chosen as the class of the product offer description $d$ (Line 19).

Algorithm 2
Test Phase
0:
$\mathcal{R}$ , $\mathcal{D}$ , $d\in\mathcal{T}$
0:
The predicted class of $d$
1:
$S_{0}\leftarrow$ Tokenize( $d$ )
2:
$m\leftarrow$ Length( $S_{0}$ )
3:
for $(k\leftarrow 1;k\leqslant m;k++)$ do
4:
$\mathcal{R}_{d}\leftarrow\emptyset$
5:
$S_{k}\leftarrow$ GenItemSets(k, $S_{k-1}$ )
6:
for each k-itemset $it\subseteq S_{k}$ do
7:
for each $\mathcal{X}\rightarrow c\subset\mathcal{R}$ such that $it=\mathcal{X}$ do
8:
$\mathcal{R}_{d}\leftarrow\mathcal{R}_{d}\cup\mathcal{X}\rightarrow c$
9:
end for
10:
end for
11:
for each $r\in\mathcal{R}_{d}$ in the form $\mathcal{X}\rightarrow c$ do
12:
$c.count++$
13:
end for
14:
$pc\leftarrow c_{i}$ such that $c_{i}.count>c_{j}.count$ $\forall j\neq i$
15:
if $pc\neq\emptyset$ then
16:
return $p c$ //the predicted class of $d$
17:
end if
18:
end for
19:
return PredictBySimilarity( $\mathcal{D}$ , $d$ )//tie in voting or no rule

Example 2. Table 5 illustrates an example of three instances in test data, and the rules from the training data of Table 3 that match with the itemsets generated by the test strings. For Instance #1, when $k=1$ all rules in the training model indicate $c_{1}$ as the correct class, and for voting it is the chosen class (Lines 14–16 of Algorithm 2). For Instance #2, when $k=1$ and the minimum support is equal to 0%, a tie occurs between the rules of the classes $c_{1}$ and $c_{4}$ . In this case, the algorithm is not able to classify the test instance using $k=1$ , and so it generates itemsets of size $k=2$ . For this size, only the rule $\textit{hewlett, pro}\rightarrow c_{3}$ is present, which incorrectly classifies the test instance as belonging to the class $c_{3}$ . However, when the minimum support is equal to 50%, the class $c_{4}$ is correctly chosen by the algorithm using $k=1$ . Furthermore, for Instance #3, there is no rule in the training model that matches with this string for all $k$ -itemsets. In such a case, the decision is taken with the use of a similarity metric, such as Cosine (Line 19 of Algorithm 2).
3.4 Alternatives to the basic algorithm

In addition to the algorithm described previously, we propose and evaluate two other alternative strategies, which constitute small variations of the basic algorithm.

Alternative 1: This involves limiting the number of tokens in product offer description strings. Our hypothesis is that the most interesting rules are formed by the first tokens in the strings, which usually describe the title and code of the product offers. The remainder tokens usually describe their technical characteristics, which are less important for the identification of the product offers. Sometimes, implicit codes are also inserted at the end of the strings. This way, we can use only a fixed number of tokens at the beginning and the end of the strings. Such a strategy also reduces the number of tokens that must be combined to form itemsets, which constitutes a pruning strategy, discussed in Section 4. In our experiments, based on the observation of our datasets and after experiment several different numbers of tokens, we used the first ten and the last three tokens of each input instance, in both the training and test phases.

Alternative 2: This alternative involves choosing the class of the rule containing the token that occurs most often in the beginning of a product offer description string, in the case of a tie among rules in the test phase of the algorithm for itemsets of size equal to 1. Our hypothesis is that if a test instance contains an implicit code, such code occurs in most cases near the beginning of the string. For example, consider a training model that contains the rules {5200tn $\rightarrow c_{x}$ } and {prints $\rightarrow c_{y}$ }. Consider also a situation that the instance “HP laserjet 5200tn printer trade compliant up to 35 ppm prints” in the test phase, for which the correct class is $c_{x}$ , for itemsets of size equal to 1, matched only the itemsets {5200tn} and {prints} to the training rules. Such an instance contains an implicit code “5200tn”. For this strategy, the tiebreaker between the classes $c_{x}$ and $c_{y}$ would be resolved in favor of $c_{x}$ , because the token “5200tn” occurs before the token “prints” in the test instance.

Table 4
An example of test data and the rules from Table 3 that match with the itemsets generated by the test instances

#Inst.

Test Instance

Rules found in training model

Support (0%)

Support (50%)

Hewlett-Packard

– HP Officejet

J3680 all-in-one

\textit{hewlettpackard}\rightarrow c_{1}

\textit{allinone}\rightarrow c_{1}

\textit{packard}\rightarrow c_{1}

\textit{j3680}\rightarrow c_{1}

\textit{allinone}\rightarrow c_{1}

\textit{j3680}\rightarrow c_{1}

Hewlett-Packard

OfficeJet Pro K5400

color inkjet printer

\textit{hewlettpackard}\rightarrow c_{1}

\textit{packard}\rightarrow c_{1}

\textit{k5400}\rightarrow c_{4}

\textit{color}\rightarrow c_{4}

\textit{hewlett, pro}\rightarrow c_{3}

\textit{k5400}\rightarrow c_{4}

\textit{color}\rightarrow c_{4}

HP OfficeJet Pro

8000 printer

Table 5

Example of a pruning strategy that does not generate the bold itemsets when the 1-itemset $\{D\}$ forms a rule

Size of the

k

-itemset

k=1

k=2

k=3

k=4

k=5

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

Figure 4.

Evolution of the results for the micro-F1 metric and the percentage of product offers classified by using only rules according to variations in maximum itemset size.

4. Computational complexity

In the training phase, the computational complexity time of our method is dominated by the number of product offer description strings to be trained, and the number of tokens in these strings. In the worst case, all tokens in each string need to be combined to form the itemsets. Let $n$ be the number of product offer description strings, and let $m$ be the average number of tokens in these strings. Then, the computational complexity time of the method is $O(n*2^{m})$ .

In order to reduce the execution time, we adopt some pruning strategies. The first is to avoid generating a rule whose antecedent is a super set of the antecedent of another rule; that is, the algorithm avoids continuing to combine a $k$ -itemset that generates a 100% confidence rule (Line 19 of Algorithm 1). For example, consider an instance formed by the tokens $\{A,B,C,D,E\}$ . Table 5 shows the itemsets pruned when the 1-itemset $\{D\}$ forms a rule. The itemsets of sizes from 2 to 5, in bold, are not generated by the algorithm.

The second pruning strategy uses the minimum support of a rule. A $k$ -itemset that does not occur in a minimum number of product offer descriptions of the same class is removed from the $k$ -itemset, to avoid combining it to form ${k+1}$ -itemsets. The GenItemSets function of Algorithm 1 (Line 12) can be modified, to include only itemsets that reach the minimum support $m s$ for at least one class.

The third pruning strategy limits the size $k$ of the itemsets to be generated. The stop condition, $k\leqslant m$ , in Line 11 of Algorithm 1, is changed to $k\leqslant\max$ , where $\max$ is the maximum size of the itemsets to be generated. In our experiments, we observe that the most interesting rules are found by using the shortest itemsets, in particular when product offers contain implicit identifiers. Figure 4 illustrates the results for the micro-F1 metric, and the percentage of product offers classified by using only rules (for further explanation see Section 5.4), according to variations in the maximum size of itemsets for an electronics and informatics dataset. Notice that for itemsets of sizes greater than 4, the results are stable.

The fourth pruning strategy limits the number of tokens in product offer description strings. It is the Alternative 1 discussed in Section 3.4. In this strategy, the number of tokens in the input strings, $m$ , can be considered a constant, and the computational complexity time of the method becomes linear, $O(n)$ .

The same analysis of complexity can be done for the test phase of the algorithm, and the third and fourth pruning strategies are also applicable. Notice that a search in the set of rules, $\mathcal{R}$ , in the learning model can be done in $O(1)$ using a hash table to keep the rules.

5. Experimental evaluation

In this section we describe our experiments, to evaluate the feasibility of using association rules to classify product offers.

5.1 Datasets

We evaluate our method using distinct datasets, as we describe in the following.

UOL datasets – obtained by crawling the Shopping UOL e-commerce site.3

³
http://shopping.uol.com.br/.

On this site, product offers are hierarchically organized by categories, and we use the product level categories (the leaf level of the hierarchy) to evaluate our classifier. We collected data from classes that had at least two product offers linked to them. These data include, for each instance, its product offer description, price, and store. We use only the product offer description. The gathering of this data was carried out in February, 2014. In order to evaluate different facets of our method, we divided this collection in three distinct datasets: (i) UOL-electronics dataset, formed of the product categories mobile phone, phone, electronics and informatics, and appliances, where product offers usually have an implicit, built-in, product code in their descriptions; (ii) UOL-non-electronics dataset, formed of the product categories perfumery and cosmetics, fashion accessories and jewelry, toys and games, sport and fitness, and babies and children, where product offers usually do not have an implicit product code; and (iii) UOL-book dataset, formed of the book category, which also does not exhibit implicit product codes.

Printer dataset – composed of printer descriptions, obtained by querying Google Shopping, and used by [25].

Abt-Buy and Amazon-Google datasets – composed of product descriptions, obtained from Abt.com and Buy.com, and from Amazon and Google Shopping, respectively. They were used by [18] to evaluate entity resolution approaches. We adapted them to form classes of products instead of matching pairs.

Table 6 presents the following statistics on our datasets: number of product offers, number of classes, average and range of the number of tokens per instance.

Table 6

Statistics on the datasets

Dataset	#Offers	#Classes	#Tokens (average)	#Tokens (range)
UOL-electronics	9,552	2,218	12.6	2–35
UOL-non-electronics	26,640	5,299	6.4	1–28
UOL-book	385,797	93,886	5.7	1–33
Printer	2,167	157	7.5	1–16
Abt-Buy	1,097	1,075	8.6	2–37
Amazon-Google	1,300	1,105	7.3	1–29

5.2 Evaluation metrics

In order to evaluate the quality of our classifier, we employed the metrics micro-average and macro-average $F_{1}$ . The $F_{1}$ measure is defined as:

$\mathit{F_{1}}=\frac{2rp}{r+p},$

where $p$ is the precision of the classifier, and $r$ is its recall.

The micro-average $F_{1}$ , or simply micro-F1, corresponds to a global $F_{1}$ value obtained by computing precision and recall over all classes. Micro-F1 is also known as accuracy; that is, the fraction of the test instances assigned to their correct classes by the classifier. The macro-average $F_{1}$ , or simply macro-F1, is computed by averaging $F_{1}$ across all classes [4].

We used a stratified ten-fold cross-validation technique to measure the experimental results; that is, each dataset was randomly divided in 10 folds, ensuring that each class was proportionally represented in each fold, and the experiments were run 10 times, each time using 9 folds for training and a distinct fold for testing. The results reported in this document are an average of the results among the 10 runs. The same strategy was also applied for the baseline methods, each time using the same sample used by our method for training and testing.

5.3 Baselines

We compared our method against three baselines: a Jaccard [15], a Cosine [26], and an SVM (Support Vector Machine) [27] based methods. Jaccard and Cosine are well known string similarity functions, and SVM is one of the best text classifier [4].

In the Jaccard based method, the Jaccard Coefficient Similarity is used to compare each instance in the test dataset against each instance in the training dataset. The class of the instance in the training dataset showing the highest similarity with each tested instance is chosen to be the class of the tested instance.

In the Cosine based method, we used the same strategy as the Jaccard method, except that the Cosine Similarity is used to compare the instances. Each instance is represented as a vector of token weights, using the distinct tokens in the training dataset. The token weights are computed as the inverse-document-frequency (IDF).

In the SVM based method, each product is associated with a class and an SVM classifier for that class is trained. Each instance is represented as a feature vector, with each token corresponding to a feature, and its IDF value being the feature weight. For the experiments, we used the package LibLinear.4

⁴
http://www.csie.ntu.edu.tw/∼cjlin/liblinear/.

5.4 Results and discussions

We run a set of experiments to evaluate our basic method, as well as its variants and pruning strategies. In the following, we describe each one of these experiments and discuss their results.

Maximum itemset size

There are some situations, where no rule is found in the training model that matches with the itemsets generated by a test string, or there is a tie among the classes of the rules found. In this case, our method needs to use another similarity function to decide the class of the test instance. In this experiment, we want to evaluate a pruning strategy related to the generation of rules, so we consider only those test instances whose class could be uncovered by using only rules, without using another similarity function. Situations where our method needs an auxiliary similarity function will be evaluated in other experiments in this section.

In Section 4, we discussed some pruning strategies to improve the execution time of our algorithm. The third pruning strategy discussed was about limiting the size of the itemsets to be generated. This experiment evaluates such a pruning strategy. Figure 4 illustrates the results of the micro-F1 metric, and the percentage of product offers classified by using only rules according to variations in the maximum size of the itemsets, for the UOL-electronics dataset. This experiment uses a minimum support equal to 30%.

Using only itemsets of size 1, our method obtains the highest result of micro-F1 for classification; however, it classifies only 66.73% of the product offers in the test dataset. As the maximum size of the itemsets increases, the number of classified product offers also increases, but the results for micro-F1 decrease. For itemset sizes higher than 3, both results remain stable. We observe a similar behavior in the experiments with the other datasets, which is what motivated the proposal of our pruning strategy. That is, we can limit the size of the itemsets to be generated, thereby gaining in efficiency, without losing effectiveness.

In the next experiments presented, we adopted a maximum itemset size equal to 3.

Product code identification

Our hypothesis was that our method could identify implicit product codes in offer descriptions. In the UOL-electronics dataset, most of the descriptions have an implicit product code that could be used to identify their classes. Most of the product codes are formed of one or two tokens. Analyzing Fig. 4, we can see that for itemsets of size 1 and 2, our method can classify up to 84.74% of the product offers. Looking in detail at the results, we observe that most of the hits were due to rules formed by product codes.

However, our method was not able to classify more product offers by using rules formed by product codes, due to noisy tokens that occur in the descriptions. By noisy tokens, we denote those tokens that are rare in the collection of product offers, and considered as product identifiers by our method. Examples of noisy tokens can be seen in Tables 2–5. For a support equal to 0%, the test instance #2 of Table 5 is incorrectly classified, due to the noisy tokens “hewlettpackard” and “packard” that point to class $c_{1}$ , while the correct class is $c_{4}$ , provided by the rule formed by the product code “k5400”. In this case, the presence of noisy tokens resulted in a tie among classes, and the product code was not able to classify the product offer. In the case of a tie among classes, the classification is performed by using other tokens in itemsets of higher size, or it is not classified by rules.

Minimum support

In this experiment, we evaluated the influence of the minimum support for generating rules on the quality of the classification results. Figure 5 illustrates the evolution of the results for the macro-F1 and micro-F1 metrics, and the percentage of product offers classified by using only rules according to variations in minimum support. As the minimum support increases, the quality of the classification also increases, but the percentage of classified product offers decreases. For a minimum support higher than 40%, the results for all quality metrics remain stable or decrease.

Figure 5.

Evolution of the results for the macro-F1 and micro-F1 metrics, and the percentage of product offers classified by using only rules according to variations in minimum support.

We observe a similar behavior in the experiments with the other datasets. Aiming at a trade-off between the quality of classification and the number of product offers classified by using only rules, the experiments suggest using a minimum support of around 30%. In the next experiments presented, we adopted a minimum support equal to 30%.

The minimum support also has an influence on the efficiency of the method, since a $k$ -itemset that does not reach a minimum support is not used to generated $k+1$ -itemsets. This is our second pruning strategy, as discussed in Section 4.

Comparison with baselines

This experiment compares our method against the baselines for classifying product offers. For those test instances where our method was not able to decide their classes by using rules, it used the SVM-based method as a similarity function. Table 7 presents the results on UOL-electronics, UOL-non-electronics, and Printer datasets, using the micro-F1 and macro-F1 metrics. Our method obtains the best numbers on all datasets for all metrics. Statistically, considering a 95% confidence level, we can state that the our method is superior to all baselines on the UOL-electronics and UOL-non-electronics datasets, and tied with SVM on the Printer dataset.

Table 7

Results comparing our method against the baselines

	UOL-electronics (%)		UOL-non-electronics (%)		Printer (%)
Method	MacF1	MicF1	MacF1	MicF1	MacF1	MicF1
Our method	87.25	88.88	81.96	84.75	97.86	98.34
SVM	84.54	86.81	78.90	82.53	97.72	98.20
Cosine	82.03	84.54	76.97	80.40	89.22	90.40
Jaccard	74.10	77.52	73.30	77.54	77.65	81.63

Comparing the quality of the results among the three datasets, the more instances and more classes a dataset contains, the worse the results were on it. Our method obtained the highest gains on the UOL-non-electronics dataset, which contains more instances and classes.

We also evaluated the performance of our method for predicting the class of a product offer by using only rules. That is, when no rule is found in the training model to predict the class of a test instance, or there is a tie among classes for all itemsets, these test instances were not considered in this experiment. In such cases, our method would use an auxiliary similarity function to decide the class of the instances. As shown in Figs 4 and 5, our method was able to classify 87.28% of the test instances by using only rules in the UOL-electronics dataset, for maximum itemset size equal to 3, and minimum support equal to 30%. The test instances classified by our method, by using only rules, were given to our baselines to be classified. Table 8 presents the results on the UOL-electronics, UOL-non-electronics, and Printer datasets, using the micro-F1 and macro-F1 metrics.

Table 8

Results comparing our method against the baselines for the test instances classified by using only rules by our method

	UOL-electronics (%)		UOL-non-electronics (%)		Printer (%)
Method	MacF1	MicF1	MacF1	MicF1	MacF1	MicF1
Our method	91.37	92.49	90.53	92.09	99.53	99.71
SVM	88.48	90.15	86.88	88.89	99.42	99.56
Cosine	85.62	87.55	85.68	87.56	91.20	91.85
Jaccard	77.42	80.25	81.48	84.04	79.92	83.36

As in the previous experiment, our method is statistically superior to all baselines on the UOL-electronics and UOL-non-electronics datasets, and tied with SVM on the Printer dataset. This experiment demonstrates that when our method is able to classify an instance using rules, it is usually better than the baselines. However, the number of test instances classified using rules varies by dataset. For the UOL-electronics and Printer datasets, in which most instances contain an implicit product code, our method classified 87.28% and 93.98% of the test instances, respectively, while for the UOL-non-electronics dataset it classified 69.22%. This result is an indication that we need to investigate ways to increase the number of instances classified by rules, or improve the results of those classified. In the following experiments, we investigated other variants of our basic method.

Alternatives to the basic algorithm

In Section 3.4, we described two alternative strategies on top of our basic algorithm. In this experiment, we evaluate them. Table 9 presents the results on the UOL-electronics, UOL-non-electronics, and Printer datasets, using the micro-F1 and macro-F1 metrics. “Our method” corresponds to our basic method, “Alternative 1” and “Alternative 2” correspond to the alternative strategies with the same name as described in Section 3.4.

Table 9

Results comparing our basic method against its alternative strategies

	Uol-electronics (%)		UOL-non-electronics (%)		Printer (%)
Method	MacF1	MicF1	MacF1	MicF1	MacF1	MicF1
Our method	87.25	88.88	81.96	84.75	97.86	98.34
Alternative 1	86.83	88.46	81.93	84.67	97.86	98.34
Alternative 2	87.58	89.17	81.99	84.77	97.28	97.88

The results are statistically similar. However, “Alternative 1” is particularly interesting when the number of tokens in product offer descriptions is high. It limits the number of tokens to be combined to generate itemsets, which improves the execution time. In this experiment, we set a limit of 13 tokens. On the UOL-electronics dataset, which has the highest number of tokens per instance, the execution time of “Alternative 1” was six times faster than for the basic algorithm, in the training phase.

“Alternative 2” may also improve the execution time, since the decision in the case of a tie among rules is made by using itemsets of size 1, avoiding the generation of larger itemsets. However, in our experiments, this alternative had an execution time similar to the basic algorithm, because the number of decisions taken due to ties among rules was low.

Training using few instances per class

Several entity resolution approaches on match problems were evaluted by [18]. For matching product entities from online shopping, none of the approaches obtained good results. In this experiment, we used their datasets to evaluate our method and our baselines. Our results are not comparable with their results, because the problem formulation is different. We adapted their datasets to our classification task, as we described in Section 5.1, for the Abt-Buy and Amazon-Google datasets.

We executed four experiments on these datasets: (1) training with the Abt dataset and testing with the Buy dataset, (2) training with the Buy dataset and testing with the Abt dataset, (3) training with the Amazon dataset and testing with the Google dataset, and (4) training with the Google dataset and testing with the Amazon dataset. Therefore, we did not use cross-validation on these experiments. The main characteristic of these experiments is that in the training dataset, most classes contain only one instance. Tables 11 and 11 present the results.

Table 10

Results comparing our method against the baselines on Abt-Buy dataset

Abt-Buy dataset

(%) training Abt

and test Buy

Abt-Buy dataset

(%) training Buy

and test Abt

Method

MacF1

MicF1

MacF1

MicF1

Our method

78.07

81.85

74.30

78.35

SVM

81.15

84.23

73.71

77.24

Cosine

81.97

84.97

77.81

81.87

Jaccard

76.77

80.57

69.91

74.75

Table 11

Results comparing our method against the baselines on Amazon-Google datasets

Amazon-Google dataset

(%) training Amazon

and test Google

Amazon-Google dataset

(%) training Google

and test Amazon

Method

MacF1

MicF1

MacF1

MicF1

Our method

63.25

64.91

70.20

75.29

SVM

55.05

56.08

58.39

63.98

Cosine

75.07

77.92

72.92

78.28

Jaccard

72.85

75.21

69.66

75.02

Our method, as well as the SVM-based method, did not obtain the best results on these experiments. The main reason for this was the small number of instances per class, to create the training model. Our method could not take advantage of the minimum support, because most of the classes contained only one instance. These datasets also contain many distinct products, which generated many rules with noisy tokens by our method; that is, many rules were generated for tokens that were not identifiers of classes. A simple method, such as the Cosine-based method, obtained the best results on these experiments, although it required a higher execution time than our method.

However, our method worked well when there was at least two instances per class in the training set. The UOL datasets are imbalanced. In the UOL-Electronics, for example, 37% of the classes contain 2 instances, 85% of the classes contain 6 or less instances, and the other 15% contain up to 53 instances.

Behavior on a large dataset

In this experiment, we evaluate our method, and the baselines, on the UOL-book dataset, which has thousands of instances and classes. We evaluate the performance for predicting the class of instances by our method, using only rules. This method was able to classify 80.43% of the test instances. These test instances were given to our baselines to be classified. However, as we explain in the next section, we were not able to run the baselines on these datasets. Table 13 presents the results.

Table 12

Results of our method on the UOL-book dataset

UOL-book dataset (%)

Method

MacF1

MicF1

Our method

96.12

96.67

Table 13

Execution time, in seconds, and standard deviation for our method and the baselines

Dataset	Metric	Training	Test	Training	Test	Test	Test
		Our method		SVM		Cosine	Jaccard
UOL-non-	Exec. time	100.97	16.45	427.05	15.51	2,119.10	2,202.83
electronics	Std. Dev.	0.66	1.08	14.12	1.83	38.87	38.81
UOL-Book	Exec. time	122,402.53	17,003.89	–	–	–	–
	Std. Dev.	956.05	154.13	–	–	–	–

Our method obtained good results on this dataset, better than on the other datasets (except on Printer dataset). It is important to highlight that this dataset is difficult to classify, due to its large number of classes. Although the product offers in this dataset do not have implicit codes in their descriptions, the titles of the books are usually different from each other, and our method was able to identify the combination of tokens that define each class. This dataset has the shortest average number of tokens per instance, and stopwords were not removed. There are some descriptions formed from only stopwords, for instance, “Who, I?”.

Execution time

We measured the execution time of our method, and of the baselines, for classifying the two largest datasets, UOL-non-electronics and UOL-book. We measured the time taken to execute each one of the 10 folds of the cross-validation, and report here the average time among these 10 executions, as well as the standard deviation. The experiments with the UOL-non-electronics dataset were performed on a computer with the following configuration: x86_64 architecture, processor Intel(R) Core(TM)2 Quad 2.66 GHz, 4 GB RAM, and operating system Ubuntu 12.04.4 LTS 64-bits. The experiments with the UOL-book dataset were performed on a computer with the following configuration: i686 architecture, processor Intel(R) Xeon(R) E5620 2.4 GHz, 4 GB RAM, and operating system Ubuntu 12.04.5 LTS 32-bit. Table 13 presents the results.

Our method presented the lowest execution times, except for the test on UOL-non-electronics, for which SVM was slightly faster. The Cosine and Jaccard based methods were too slow; they execute a full cartesian product among the instances in the test subset and in the training subset.

We could not obtain execution times for the baselines on the UOL-book dataset. These experiments ran for several days, and were not completed. This dataset contains thousands of instances and classes. Our method was able to classify them within a reasonable time. The SVM-based method is not viable for classifying datasets with a large number of classes [9]. SVM can take only binary decisions, i.e., an instance belongs to or does not belong to a given class. With multiple classes, usually, a different classifier needs to be learned for each class, or for each pair of classes [4].

Experiments using other machine learning algorithms

We tried to compare our method against the machine learning algorithms Random Forests [6] and Naive Bayes [31], running in the Weka tool [31], however, we could not obtain success. The effectiveness of both algorithms were worst than ours in the Printer and UOL-electronics datasets, and their execution times much longer. We could not run these algorithms for the other larger datasets, UOL-non-electronics and UOL-book, due to lack of memory problem, even in a machine with the double of memory than that we ran our method. We also did not succeed even running a feature selection algorithm to reduce the dimensionality.

The problem is that our datasets have a large number of features and classes, and those algorithms do not work well for datasets with such characteristics. For Random Forests, according to [33], the suggestion of Breiman [6] to select features in a subspace works well for data with certain dimensions (less than 100 features) but is not suitable for very high dimensional data consisting of thousands of features. They proposed a strategy that overcomes Breiman’s, but they evaluated it only in datasets containing up to 25 classes. For Naive Bayes, according to [7], the conventional Naive Bayes cannot be directly applied for high-dimensional data classification, because it essentially assumes that all the features are equally important for classification, which hardly holds in high-dimensional spaces.

Limitations and failure cases

In this section, we discuss the limitations of our method, and present some cases for which it failed. We discuss the reasons for these failures, and illustrate them with some examples.

Our method is limited in solving cases where the tokens for all instances of a class are subsets of the tokens for instances of other classes. In this case, no rule is generated for the class that contains the subsets of tokens, and our method depends on the other similarity function in order to classify product offers for this class. For example, the class $c_{1}$ has only one instance for training with the description “HP Officejet Pro 8000 Printer”, and the class $c_{2}$ has an instance with the description “Hewlett Hp Officejet Pro 8000 Wireless Printer Cb9297a#b1h”. As the tokens for the instance of $c_{1}$ is subset of the tokens for an instance of $c_{2}$ , no rule is generated for the class $c_{1}$ . In our datasets, there are cases where two distinct classes have the same set of tokens; however, they appear in a different order in the product offer descriptions. For example, “Peace and War” and “War and Peace” are two descriptions of books from distinct classes.

Our method also demonstrated itself to be limited in solving cases where the number of instances per class is too small, as in the experiments with the Abt-Buy and Amazon-Google datasets. In such cases, our method is not able to take advantage of the minimum support, which was demonstrated to be an important strategy.

The main case of a failure for our method is due to rules formed by noisy tokens, i.e., tokens that are rare in the training dataset, but are not product identifiers. Such rules may result in an error, when such tokens occur in test instances of other classes. An example from our dataset, the training instance “Blu-ray Player Philips Bdp2100x/78 Full Hd, Bd-live, Dolby Truehd, Easylink, Simplyshare, Output HDMI”, generated a rule containing the token “Simplyshare” as antecedent. In addition, the test instance, from another class, “Blu-ray Player 3D Philips Bdp3380x/78 Full Hd, Wi-Fi Ready, Dolby Truehd, Simplyshare, Output HDMI”, which contains the noisy token “Simplyshare”, was erroneously classified using the rule containing the noisy token. Rules formed by noisy tokens may also be generated as a result of some characteristics of products being described in several forms. For example, for digital cameras, we found descriptions containing “16.1 MP”, “16.1 MP”, and “16.1 megapixels”, which are similar. One of these forms, which occurred in only one class, generated a rule for that class.

Our method also failed due to misclassified data in our datasets. For example, the instances “Smartphone Samsung Galaxy S4 I9500 3G GSM unlocked” and “Mobile Phone Samsung Smartphone Galaxy S4 GT-I9505 Black Box”, from two distinct classes, were put together in the same class in the UOL-electronics dataset.

6. Conclusions and future work

In this work, we have proposed and evaluated a new method that uses association rules to classify product offers from e-shopping sites, matching offer against offers, without using a product catalog. Classification at product level is particularly important, in that it enables customers to compare prices. However, it is a difficult problem to solve automatically, due to the high number of classes. Our method demonstrated to be effective and efficient for solving this type of problem.

Our method uses a supervised machine learning technique, which exploits associations among tokens contained in descriptions of product offers, and generates a set of rules that uniquely identify each class in the training dataset. It is simple, does not require the adjustment of complex parameters, and has a good computational complexity. The experiments we performed show that our method obtains better results than three state of the art baselines, in several datasets with distinct characteristics. We found that the larger the number of classes and instances better behaves our method compared to the baselines. We also presented and discussed some situations where our method failed, as well as its limitations.

Our main contributions, which differs our method from existing researches on e-commerce, are: (i) our method is able to classify a large dataset, containing thousands of instances and classes, within a reasonable execution time, other methods in the literature did not experimented with datasets with such characteristics; (ii) it is able to identify implicit product codes in product offer descriptions, when they exist, and it is also generic enough to be able to classify product offers that do not have codes; and (iii) it works well for different types of products, such as electronics and books. Besides, it classifies at product level, using product offer descriptions only, and does not need a cleaned and structured catalog of products. A few other methods present such characteristics, however they did not experiment with large datasets and are not scalable since they depend on submitting queries to a search engine.

In terms of e-commerce management, our method may contribute to increase the scalability and consistency of the product offer classification. Besides price comparison, the classification at product level may improve customer product reviews and product specifications, satisfying consumer needs and expectations. It may contribute to find product historical prices and facilitate the use of prediction algorithms for forecasting future prices, as discussed by [2]. By using tokens that distinguish each product, our method may improve approaches for extracting concepts and hashtags, and for identifying web pages talking about products [34]. Moreover, our method may be used for grouping product offers by creating more context to another method to match them to a product catalog.

Although our experiments have focused on product offers, our method is generic and may be applied in any situation where the description of the entity references are composed by short strings. We experimented with strings containing up to 37 tokens, although the average was shorter string. Very long strings could have an impact on efficiency.

Regarding future research, we are working on a mechanism to detect new product offers, and add them to the training model. Such mechanism is based on the reliability of predicting the class of a product offer by using a combination of the number of rules predicting it and a similarity function. We are also developing strategies for the removal of noisy tokens from the instances, in order to improve the generation of rules. Finally, we are developing a distributed version of our algorithm to run in the Hadoop environment,5

⁵

http://hadoop.apache.org/.

in order to improve its efficiency.

Footnotes

Acknowledgments

This work was partially supported by the FAPEMIG grant CEX-APQ-01834-14 and an individual scholarship from CAPES. We also thank João A. Silva for his contributions in some experiments.

References

Aanen

S.S.

Vandic

and Frasincar

, Automated product taxonomy mapping in an e-commerce environment, Expert Systems with Applications 42(3) (2015), 1298–1313.

Agrawal

and Ieong

, Aggregating web offers to determine product prices, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, ACM, New York, NY, USA (2012), 435–443.

Agrawal

and Srikant

, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile (1994), 487–499.

Baeza-Yates

and Ribeiro-Neto

, Modern Information Retrieval: The Concepts and Technology Behind Search, Addison-Wesley Professional, 2011.

Benjelloun

Garcia-Molina

Menestrina

Whang

S.E.

and Widom

, Swoosh: A generic approach to entity resolution, The VLDB Journal 18(1) (January 2009), 255–276.

Breiman

, Random forests, Machine Learning 45(1) (October 2001), 5–32.

Chen

and Wang

, Automated feature weighting in naive bayes for high-dimensional data classification, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM’12, ACM, New York, NY, USA (2012), 1243–1252.

Cortez

Herrera

M.R.

da Silva

A.S.

de Moura

E.S.

and Neubert

, Lightweight methods for large-scale product categorization, Journal of the American Society for Information Science and Technology 62(9) (September 2011), 1839–1848.

Crammer

and Singer

, On the algorithmic implementation of multiclass kernel-based vector machines, The Journal of Machine Learning Research 2 (2002), 265–292.

10.

Elmagarmid

A.K.

Ipeirotis

P.G.

and Verykios

V.S.

, Duplicate record detection: A survey, IEEE Transaction on Knowledge and Data Engineering 19(1) (2007), 1–16.

11.

French

J.C.

Powell

A.L.

and Schulman

, Using clustering strategies for creating authority files, Journal of the American Society for Information Science 51(8) (2000), 774–786.

12.

Getoor

and Machanavajjhala

, Entity resolution: Theory, practice and open challenges, Proceedings of the VLDB Endowment 5(12) (2012), 2018–2019. Tutorial available at http://www.cs.umd.edu/∼getoor/Tutorials/ER_VLDB2012.pdf.

13.

Gopalakrishnan

Iyengar

Madaan

Rastogi

and Sengamedu

, Matching product titles using web-based enrichment, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, ACM, New York, NY, USA (2012), 605–614.

14.

Ioannou

Rassadko

and Velegrakis

, On Generating Benchmark Data for Entity Matching, Journal on Data Semantics 2(1) (2013), 37–56.

15.

Jaccard

, Étude comparative de la distribuition florale dans une portion des alpes et des jura, Bulletin Del La Société Vaudoise Des Sciences Naturelles 37 (1901), 547–579.

16.

Kannan

Givoni

I.E.

Agrawal

and Fuxman

, Matching unstructured product offers to structured product specifications, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, San Diego, USA (2011), 404–412.

17.

Köpcke

and Rahm

, Frameworks for entity matching: A comparison, Data & Knowledge Engineering 69(2) (2010), 197–210.

18.

Köpcke

Thor

and Rahm

, Evaluation of entity resolution approaches on real-world match problems, Proceedings of the VLDB Endowment 3(1–2) (2010), 484–493.

19.

Köpcke

Thor

Thomas

and Rahm

, Tailoring entity resolution for matching product offers, in: Proceedings of the 15th International Conference on Extending Database Technology, ACM, New York, NY, USA, Berlin, Germany (2012), 545–550.

20.

Koudas

Sarawagi

and Srivastava

, Record linkage: Similarity measures and algorithms, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM New York, NY, USA, Chicago, USA (June 2006), 802–803.

21.

Londhe

Gopalakrishnan

Zhang

Ngo

H.Q.

and Srihari

, Matching titles with cross title web-search enrichment and community detection, Proceedings of the VLDB Endowment 7(12) (Aug 2014), 1167–1178.

22.

Nguyen

Fuxman

Paparizos

Freire

and Agrawal

, Synthesizing products for online catalogs, Proceedings of the VLDB Endowment 4(7) (April 2011), 409–418.

23.

Pavlov

Balasubramanyan

Dom

Kapur

and Parikh

, Document preprocessing for naive bayes classification and clustering with mixture of multinomials, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, Seattle, USA (2004), 829–834.

24.

Pereira

D.A.

da Silva

E.E.B.

and Esmin

A.A.A.

, Disambiguating publication venue titles using association rules, in: Proceedings of the IEEE/ACM Joint Conference on Digital Libraries, London, UK (Setember 2014), 77–86.

25.

Pereira

D.A.

Ribeiro-Neto

Ziviani

Laender

A.H.F.

and Gonçalves

M.A.

, A generic web-based entity resolution framework, Journal of the American Society for Information Science and Technology 62(5) (May 2011), 919–932.

26.

Salton

and McGill

M.J.

, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, USA, 1983.

27.

Vapnik

V.N.

, The Nature of Statistical Learning Theory, Springer-Velag, New York, USA, 1995.

28.

Veloso

Ferreira

A.A.

Gonçalves

M.A.

Laender

A.H.F.

and Jr.

W.M.

, Cost-effective on-demand associative author name disambiguation, Information Processing and Management 48(4) (2012), 680–697.

29.

Veloso

Jr.

W.M.

Cristo

Gonçalves

M.A.

and Zaki

M.J.

, Multi-evidence, multi-criteria, lazy associative document classification, in: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA, Arlington, Virginia, USA (2006), 218–227.

30.

Wedyan

, Review and comparison of associative classification data mining approaches, International Journal of Computer, Electrical, Automation, Control and Information Engineering 8(1) (2014), 34–45.

31.

Witten

I.H.

Frank

and Hall

M.A.

, Data Mining: Practical Machine Learning Tools and Techniques, third edition, Morgan Kaufmann, 2011.

32.

Wolin

, Automatic classification in product catalogs, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA (2002), 351–352.

33.

Huang

J.Z.

Williams

Wang

and Ye

, Classifying very high-dimensional data with random forests built from small subspaces, International Journal of Data Warehousing and Mining 8(2) (2012), 44–63.

34.

Zhang

Mukherjee

and Soetarman

, Concept extraction and e-commerce applications, Electronic Commerce Research and Applications 12(4) (2013), 289–296. Social Commerce-Part 2.

An association rules based method for classifying product offers from e-shopping

Abstract

Keywords

1. Introduction

1 http://www.google.com/shopping.

Table 1 Mathematical notations

Table 4 An example of test data and the rules from Table 3 that match with the itemsets generated by the test instances

5. Experimental evaluation

5.1 Datasets

3 http://shopping.uol.com.br/.

5.3 Baselines

4 http://www.csie.ntu.edu.tw/∼cjlin/liblinear/.

Maximum itemset size

Product code identification

Minimum support

Comparison with baselines

Alternatives to the basic algorithm

Training using few instances per class

Behavior on a large dataset

Execution time

Experiments using other machine learning algorithms

Limitations and failure cases

Footnotes

Acknowledgments

References

¹
http://www.google.com/shopping.

Table 1
Mathematical notations

Table 4
An example of test data and the rules from Table 3 that match with the itemsets generated by the test instances

³
http://shopping.uol.com.br/.

⁴
http://www.csie.ntu.edu.tw/∼cjlin/liblinear/.