An Association Rules based Method for Classifying Product Oers from e-shopping Claudiane Maria Oliveira
Denilson Alves Pereira
Departamento de Ciência da Computação Universidade Federal de Lavras PO Box 3037, 37.200-000 Lavras, Brazil
[email protected]a.br
[email protected]a.br Corresponding author: Denilson Alves Pereira keywords: Association Rule, Entity Resolution, Product Classication, Product Matching, Product Oer, e-commerce.
Abstract Price comparison services are widely used by e-shopping customers. Such e-shopping sites receive product oers from thousands of online stores, and in order to provide price comparison, product categorization, and searching, it is necessary to match dierent oers referring to the same real-world product. This is a hard task, since they need to classify millions of product oers in thousands of classes, and distinct descriptions may exist for the same product, as well as very similar descriptions of distinct products. In this work, we propose a method that uses association rules to classify product oers from e-shopping web sites matching oers against oers without the need for a product catalog. This is a supervised learning method that trains a classier, whose generated model comprises a set of association rules to identify product oer classes. Experimental evaluations show that our method is eective and ecient, and obtains 1
better results than three baselines in several datasets with distinct characteristics. It is able to deal with large datasets containing thousands of classes and dierent types of products such as electronics and books. Moreover, we propose and evaluate strategies to reduce its execution time and we evaluate its weaknesses.
1 Introduction The comparison of product prices is a very important service, provided by e-
1
shopping web sites, such as Google Shopping , Shopping.com, and Shopping
2
UOL . Such e-shopping sites receive product oers from thousands of online stores, through the transferring of data or crawling. Frequently, product oers are not represented by structured data, but rather by textual descriptions, mixing the name and technical characteristics of a product. The main challenge is to match up dierent oers that refer to the same real-world product. Spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem. As an illustration, Figure 1 shows some of the results of searching for the product
HP Photosmart C4780
in Google Shopping. It is noticeable that Google
performs product matching, since the rst result refers to oers from 4 stores. However, it was not able to identify that the second result also refers to the same product. In addition to identifying when dierent descriptions refer to the same entity, as in the rst two descriptions of Figure 1, solutions for the problem of matching product oers also need to identify that very similar descriptions can refer to distinct products. For example, HP OceJet Pro 8600 911a Wireless and HP OceJet Pro 8600 911n Wireless refer to two distinct printers with dierent prices. Given a set of entity references, such as product oer descriptions, the pro-
1 http://www.google.com/shopping 2 http://shopping.uol.com.br
2
Figure 1:
Product oers related to search
Shopping
HP Photosmart C4780
in Google
cess of identifying which of them correspond to a same real-world entity is known as entity resolution [20, 10, 5, 17, 12]. The matching of products is a specic case of entity resolution, where the entities are products.
Traditional entity
resolution approaches deal with structured data, applying similarity functions to each attribute of the records and combining the result to nd duplicated records. The product matching problem is hardest because product oers come from thousands of merchants, which use dierent descriptions of the products. Such descriptions may be short as HP C4780 or contain several technical characteristics of the product in a free textual eld. In this work, we address the problem of aggregating product oers from eshopping web sites by matching product oers against other oers. This problem is more challenging than matching product oers against a catalog of structured products [16] due to uncleaned and non-structured data. Matching of product oers is highly relevant in many scenarios where a catalog is not available. For example, applications that monitor product price from dierent web sites typically do not have a catalog. Furthermore, aggregated oers could be used as a starting point for construction of product catalogs.
3
Few studies addressed the problem of matching product oers against product oers.
In [19], the authors preprocess product oers to extract product
codes and use them for matching. However, the eectiveness of their approach depends on the product category since not all categories have product codes. Furthermore, their method needs to submit queries to a web search engine, which may make the method non-scalable.
The approaches in [13] and [21]
also need to submit queries to a web search engine, and they did not provide experiments with a large number of distinct products.
Evaluations of several
traditional entity resolution approaches made by [18] report poor results for precision and recall metrics in the product oer matching task. They also concluded that learning-based approaches, which obtain better results than non-learning approaches, do not scale with large input sets.
Therefore, there is space for
more studies in this area. Therefore, we have the questions: can we develop an eective and scalable method for matching product oers, knowing that we need to deal with a large number (tens of thousands) of distinct products?
Can we develop a generic
method that does not depend on characteristics of specic product categories? Our hypothesis is that we can treat the problem of matching product oers as a classication problem, and train a classier to identify sets of tokens (words) in the product oer descriptions that discriminate dierent products, i.e., sets of tokens that occur in the descriptions of only one class of product oer. For example, in Figure 1, the set of tokens
{C4780, Printer}
occurs only in the two
descriptions of the printer, and can distinguish it from its cartridge, which can be identied by the set of tokens
{C4780, Cartridge}
.
We propose a method to classify product oers from e-shopping web sites that uses association rules [3] to nd sets of tokens that distinguish each product oer class. This is a supervised learning method that trains a classier whose generated model comprises a set of association rules, used to identify product oer classes. The association rules are of the form tokens, and
ci
is a product oer class (e.g.,
4
X → ci ,
where
X
{C4780, P rinter} → c1 ).
is a set of In the test
phase, this method generates sets of tokens from the product oer description string to be tested, and tries to match them with the antecedents of the rules in the model generated in the training phase. Our method uses only the textual description of product oers, which is usually present for all product types. Therefore, it does not depend on the specic attributes of some categories of products. We have evaluated it on datasets containing several categories of products, such as electronics, perfumeries, fashion accessories, and books. Some categories of product contain an implicit, built-in, product code in their oer descriptions (e.g.,
C4780
in the example of Figure
1). Our method is able to identify implicit product codes in the product oer descriptions, when they exist, and it is also generic enough to classify product oers that do not have codes. We evaluated the eectiveness and the eciency of our method experimentally, on several datasets with distinct characteristics, which indicates that our method is better than three baselines in the most situations. It is able to classify a large dataset containing thousands of instances and classes, in a reasonable execution time. In [24], a preliminary proposal of our method was presented, applied to disambiguate publication venue titles in bibliographic citations. In this work, we changed some strategies of their basic algorithm, introducing the concept of support of a rule in the training phase and considering a new vote schema to choose the best rule in the test phase.
We also proposed and evaluated new
alternative strategies to the basic algorithm. And we evaluated several facets of the method, in terms of eectiveness and eciency, in the application of classifying product oers. The main motivation to apply that method to classify product oers came from the fact that product oers usually contain implicit codes that uniquely identify each product, similar to bibliographic citations, which usually contain acronyms. Unlike bibliographic data whose methods get good results, product matching is a much harder problem to solve [18]. Our contributions can be summarized as follows:
5
•
We propose an association rules based method for classifying product oers from e-shopping web sites matching oers against oers without the need for a product catalog.
Dierent from other associative classiers, our
method uses only the textual description of the entities, and we propose a new way of using support and condence to nd good rules;
•
Our method is able to classify at product level, which enables customers to compare prices (currently and historically) and obtain other information about a product, such as customer reviews;
•
It is able to classify large datasets, containing thousands of classes, which is the main challenge to manage e-shopping products. It works well for dierent types of products, such as electronics and books;
•
We discuss the computational complexity of our method, propose and evaluate strategies to reduce its execution time;
•
We evaluate experimentally our method, demonstrating its eectiveness and eciency.
The remainder of this paper is organized as follows: In Section 2, we discuss related works on the classication of products and on association rules. In Section 3, we present our method, which uses association rules to classify product oers; and in Section 4, we discuss its computational complexity. In Section 5, we describe our experiments, evaluation metrics, and results. Finally, in Section 6, we present our conclusions and directions for future works.
2 Related Work E-shopping web sites usually organize their products hierarchically in categories, such as that in Figure 2 (a). They also keep a catalog of products, which contains detailed specications for each product, organized in structures of attributevalue pairs, such as in the example of Figure 2 (b). Works in literature studied
6
dierent tasks on classifying product oers in hierarchies and catalogs. Product oers can be matched to a catalog [16] or they can be classied in some level of the hierarchy, in classes such as Clothing, Components or Printers [32, 23, 8] or at product level (e.g. HP PhotoSmart C4780) [19, 13, 21]. In order to compare product prices, product oers need to be classied at product level, which is the focus of our work. This type of classication is much more challenging due to the large number of classes. It is also more challenging than matching product oers against a catalog due to the absence of structured and cleaned product data. Related to product oers and catalog, [16] described their system, used by Bing Shopping, for matching unstructured oers to structured product descriptions in a catalog. They adopted a probabilistic approach to nd the product in the catalog that has the largest probability of matching to the given oer. Their matching function takes into account matches and mismatches in attribute values between oer-product pairs, treats missing attribute values, and weights the importance of dierent attributes. In another work, [22] introduced the problem of product synthesis, which aims at identifying new products from a set of product oers and add them to a catalog, together with their structured attributes. Their solution addressed issues involved in data extraction from oers, schema reconciliation, and data fusion. In order to classify product oers in categories above the product level, [8] presented a study about classication methods for large-scale categorization. They also proposed a probabilistic approach to model the classication problem, using a belief network. Their work diers from ours because we classify product oers in categories at product level.
In our experiments, we used a dataset
extracted from the same data source as their datasets, however, our dataset has much more classes than theirs'. Other works also presented approaches to classify products in classes above the product level. In [23], they evaluated Naive Bayes classiers for classifying product oers in Yahoo!
Shopping.
They studied the eects of data trans-
7
Figure 2: (a) Example of a hierarchy of products and (b) examples of attributevalue pairs of two products in a catalog of products
formation on text classication with Naive Bayes classier.
Several heuristic
feature transformations were experimented, such as IDF and normalization by the length of the text. In [32], the author presented a system for product categorization that uses a variation of the vector space model [26], modied to
8
represent product attributes.
The author considered textual and numeric at-
tributes, such as product description, manufacturer name, and price. work, we used only the textual description of products.
In our
In a related study,
[1] address the problem of matching of product categories from multiple web sites. They proposed an improved approach for the word sense disambiguation process. The focus of our work is to classify product oers in categories at product level, matching oers against oers.
The studies in [19], [13], [21], and [18]
addressed this problem. In [19], the authors used an entity resolution approach that preprocesses product oers to extract product codes and use them for matching.
A product code is a manufacturer-specic identier that typically
appears in the product name and description. To extract product codes, they manually created list of regular expressions and used the web as an external knowledge source to verify the candidate codes.
They employed a learning-
based approach, combining several matchers on several attributes to derive a match decision for every pair of entities.
The eectiveness of their approach
depends on the product category since not all categories have product codes. Furthermore, their approach depends on submitting queries to a web search engine, and therefore is not scalable. Our method does not provide a specic strategy to extract product code like theirs', however, if a code exists it is naturally extracted and embedded in the rules, and may contribute to improve the eectiveness of our method. The approaches in [13] and [21] also match product oers against product oers, however, they are not scalable, since they need to submit queries to a web search engine to enrich the descriptions of products and other entities before matching them. They also did not evaluate them using a large amount of data containing a large number of classes. The product matching problem was evaluated by [18] by using several entity resolution approaches, and concluded that it is not suciently solved with conventional approaches based on the similarity of attribute values. Furthermore,
9
the learning-based approaches, which obtained the best results, are not scalable even for a small number of classes and instances. Entity resolution aims at identifying equivalent entities or duplicates within a data source or between data sources. Some surveys and tutorials on entity resolution can be found in [10], [17], [20], [12] and [14]. A web-based entity resolution approach to treat product matching and bibliographic data was proposed by [25]. Our work uses association rules, a data mining technique that can nd out relationships among item sets in a dataset [3]. It is based on associative classication, which combines association rules and classication to build a classication model. Several associative classication algorithms have been proposed in the last years. A recent survey can be found in [30]. The algorithms vary according to their dierent methodologies in rule learning, ranking, pruning, and prediction procedures. The main dierences of our work to the others are in the generation of rules and in the way we use the support and condence concepts. Support is measured only among the instances of the same class, rather than measure it in a global context. Condence does not measure the accuracy of a rule, instead, we use rules with 100% of condence only. As a result, we generate a small number of rules when compared to works that use condence to rank rules, and we propose an ecient data structure to assist in generating rules by using an inverted index. Such form of using condence may be applied to other contexts where implicit codes are important pieces of evidence for classication. Association rules were also used in [28] to disambiguate author names in bibliographic citations. They proposed supervised learning methods where tokens in coauthor names, work title, and publication venue titles are used as features to train classiers that exploit rules associating citation features to specic authors.
Association rules were also used to classify documents in [29].
Their
approach combines textual features of documents and links to form an associative classier. Both works explore specic features of their applications, author name disambiguation and document classication. They also use the condence to measure the accuracy of a rule. Our method uses only tokens in a text, it
10
does not use other evidences such as links or prices, so it may be applied to disambiguate other types of entities described by text only. Our work is an extension of the [24]'s work. They proposed a method that uses association rules to disambiguate publication venue titles originated from bibliographic citations. The disambiguator is a supervised learning method that uses a publication venue authority le [11] to train a classier, whose generated model is a set of association rules to identify publication venues. In this work, we changed the strategies of their algorithm, and promoted a complete evaluation of its eectiveness and eciency on the product matching problem. We added the support of a rule, which decreases the number of rare tokens that generate bad rules in the classication model. In the prediction phase, we do not need to generate all itemsets, the new vote schema prioritizes short rules and stops to generate itemsets as soon as it nds a decision. We also proposed an alternative strategy that decreases the number of tokens to be combined. All these changes were experimentally evaluated demonstrating improvement in eectiveness and eciency of the new algorithm.
3 Our Method for Classifying Product Oers The product oer matching problem may be seen as follows. Given a set of product oer descriptions, originated from web stores, the objective is to map them into real classes of products known by the e-shopping, as illustrated in Figure 3a. Our solution to this problem uses a supervised machine learning technique, as illustrated in Figure 3b. It uses a set of manually classied product oers to train an associative classier to predict the class of other unclassied product oers. In the training phase, the classied product oer descriptions are tokenized, cleaned, and indexed. Sets of tokens (itemsets) are generated from the tokens of each product oer description, and they are used to generate association rules that relate them to the correct product oer class. In the prediction (or test) phase, each unclassied product oer description is also tokenized,
11
cleaned, and has its sets of tokens generated.
The classication module uses
the learning model created in the training phase to produce the product oer matching results. The following sections detail each one of these steps.
(a)
(b)
Figure 3: (a)The product oer matching problem (b)Our solution to the problem
3.1
Problem Formulation
The task of classifying product oers may be formulated as follows. Let
{p1 , p2 , ..., pk }
be a set of product oers.
Each product oer
pi
P =
has a list of
attributes, such as a product description, product price, and the name of the oering store. Let
C = {c1 , c2 , ..., cl }
be a set of
L
classes, with their respective
labels; in this case, a set of categories of product oers.
The objective is to
produce a classication function, which maps each product oer the predened classes of the set
pi
into one of
C.
Our proposal for solving the classication problem uses a supervised machine learning technique.
training data
In this case, we are given an input dataset, called the
and denoted as
D,
which consists of examples of product oer
instances for which the correct product oer class is known. generates a set of
m features {f1 , f2 , ..., fm }.
Each instance
These features are tokens (words),
extracted from string attributes, such as the product oer description or name. The training data is used to produce a learning model, using association rules, which relates the features in the training data to the correct product oer class.
12
T,
The test data, denoted as
for the classication problem, consists of a set of
product oers for which the features are known, while the correct product oer class is unknown. The learning model, which is a function that maps a set of features
{f1 , f2 , ..., fm }
to a class
ci ∈ C ,
is used to predict the correct product
oer class in the test set. The learning function uses an associative classier, which exploits associations among tokens in the product oers that uniquely identify each class. Such associations are uncovered using rules of the form
{f1 , f2 , ..., fm }
and
ci ∈ C .
X → ci ,
where
For example, the product oer description
J3680 All-in-One Printer, Fax, Scanner, Copier, HEWCB071A
X ⊆
Ocejet
, which belongs
to the
c1
class, whose product is Ocejet J3680, may produce the two associ-
ation rules
{J3680} → c1
set of tokens
and
{HEW CB071A} → c1 ,
which indicate that the
{J3680}, and the set {HEW CB071A}, both uniquely identify the c1
product oers of the
class (Ocejet J3680).
In order to produce association rules that uniquely identify each class, the associative classier only learns rules that have a condence of to [3], a rule in
D
X →Y
that contain
not contain rules
holds in the dataset
X
also contain
X → ci
and
Y.
100%.
According
D, with condence c, if c% of instances
Subsequently, the generated model does
X → cj ,
such that
i 6= j .
This strategy is not
perfect, since it may not produce rules for all classes. Such a situation occurs when the sets of tokens in all product oers of a class are contained in some sets of tokens of distinct classes. In this case, no rule is generated for that class. In order to solve such situation, to predict the class of an instance for which no rule is found in the learning model in the test phase, our method uses another strategy, to be explained later. The associative classier also checks the support of a rule. The rule has support
s in D,
if
s% of instances of the class ci
in
D
contain
X.
X → ci
Dierently
of the original concept presented in [3], the support of a rule is measured only among the instances of the same class, rather than measure it in a global context. The aim of the support of a rule is to decrease the number of rare tokens among
13
instances of a class that generate bad rules in the classication model.
The
traditional concept of support [3] is not adequate for our method because it could loose important tokens that occur only one or few times in the dataset. Table 1 presents our main mathematical notations. Table 1: Mathematical notations
Symbol
Denition
Set of product oers P = {p1 , p2 , ..., pk } Set of classes C = {c1 , c2 , ..., cl } Training data: set of product oers for which the correct class ci is known Test data: set of product oers for which the correct class ci is unknown Set of features (tokens) extracted from product oer descriptions An association rule, where X ⊆ {f1 , f2 , ..., fm } and ci ∈ C String describing the j th product oer instance for the class ci Set of rules, X → ci , predicting the class ci originated from the product oer description dj,i (X ⊆ dj,i ) Set of rules, X → ci , predicting the class ci d Set of all rules generated by the learning model, Rcij,i ⊆ Rci ⊆ R A set containing k itens (features, tokens)
P C D T {f1 , f2 , ..., fm } X → ci dj,i d
Rcij,i R ci R k-itemset
3.2
Training Phase
The training data
D
can be provided by a data source, such as an e-shopping
database, containing product oer descriptions. In the input data, there are one or more instances for each distinct class. Each input instance in describing a product oer for a class the class
ci .
Let
all features in
d
Rcij,i
X ).
ci ,
represented by
be the set of rules
That is,
d
Rcij,i
X → ci ,
model. Then,
is a string
a description
X ⊆ dj,i (dj,i
j
for
contains
is composed of rules predicting the class
originated from the product oer description predicting the class
where
dj,i ,
D
dj,i .
Let
Rci
ci
be the set of rules
ci , and let R be the set of all rules generated by the learning
d
Rcij,i ⊆ Rci ⊆ R.
Algorithm 1 shows the steps of the training phase. It receives as input a set of product oers, for which the classes are known, and a minimum support for generating rules, and returns a set of associative rules that have a condence of 100% and minimum support to predict product oer classes. The algorithm's rst step (Lines 16) inserts distinct tokens from the instances into an inverted index structure [4].
This structure is composed of key-value pairs, where the
key is a token and the value is an occurrence list of this token, containing,
14
in each position, the class
ci
and its specic description identication
j.
The
construction of this structure is performed by the InsertInvertedIndex function. The Tokenize function splits a description into tokens. Before being tokenized, each string is preprocessed, removing punctuation marks, symbols such as
()[]{},
and stopwords (articles, prepositions, and conjunctions), and converting letters to lowercase. Strings with hyphens(-) and slashes(/) are processed dierently. The processing uses (-) or (/) in two ways, as a separator and as an aggregator, yielding up to three or more tokens. For example, to process the string DSCW730, the following tokens will be generated: dsc, w730, and dscw730.
Algorithm 1 Training Phase Require: Examples for training D Require: Minimum support ms Ensure: The set of rules R 1: for each instance dj,i ∈ D do
2: S0 ← Tokenize(dj,i ) 3: for each token tk ∈ S0 do 4: InsertInvertedIndex(tk , j , ci ) 5: end for 6: end for 7: R ← ∅ 8: for each instance dj,i ∈ D do 9: S0 ← Tokenize(dj,i ) 10: m ← Length(S0 ) 11: for (k ← 1; k ≤ m; k + +) do 12: Sk ← GenItemSets(k, Sk−1 ) 13: for each k-itemset it ⊆ Sk do 14: if SizeOccurrenceList(it) = 1 then 15: // 100% condence rule 16: if SupportRule(it → ci ) ≥ ms then 17: InsertRule(it → ci , R) 18: end if 19: RemoveItemSet(it, Sk ) 20: end if 21: end for 22: end for 23: end for 24: return R
The second step of the algorithm (Lines 723) is an iterative process, to create associative rules. The GenItemSets function generates sets of items with
k
tokens (k -itemset), combining the items with
15
k−1
tokens obtained from the
previous iteration. An algorithm to combine item sets is presented in [3]. Each
k -itemset is searched in the inverted index, and if it occurs in only one class and has a minimum support, a rule is created containing the
k -itemset as antecedent
and the class id as consequent (Lines 1217). The SizeOccurrenceList function returns the number of distinct classes in which a product oer description. each token in a
k > 1,
k -itemset
k -itemset
appears in some
This function searches in the inverted index, using as a key and retrieving its occurrence list.
When
it performs an intersect operation on the occurrence lists of each token
to nd the result. If size is equal to
1,
the rule formed by the
k -itemset
and
its class has a condence of 100%. The algorithm also checks for the minimum support of a rule by using the SupportRule function. If a rule also attends the minimum support, it is inserted in the set of rules by the InsertRule function. If a
k -itemset
forms a 100% condence rule, then any l-itemset,
l > k,
that
k -itemset
also does. Then, the algorithm uses a pruning strategy,
to avoid combining this
k -itemset in the next iteration (RemoveItemSet function
includes this
in Line 19).
Example 1. class
c1 ,
Table 2 illustrates an example of training data, where the
whose product is Ocejet J3680, has 4 instances. Table 3 shows the
rules generated from the training data of Table 2.
The rules are organized
according to the minimum support used to generate them. Observe that when the minimum support is increased, the number of rules is reduced, in an attempt to eliminate tokens that do not contribute to the identication of product oers. Notice that there is no rule for class
c2
(Ocejet Pro 8000), since the tokens of
Instance 5 form a subset of the tokens of Instance 6 of class
c3
(Ocejet Pro
8000 Wireless). They are in two distinct classes, because one is a wireless printer and the other not. Moreover, notice that our method does not provide special treatment for synonyms, but they may be uncovered naturally. For example, in the class
c1 ,
J3680, HEWCB071A, and CB071A are synonyms codes and
they generated rules for the same class.
16
Table 2: An example of training data
Class
# 1 2
c1 Ocejet J3680
3 4
Product oer description HP Ocejet J3680 All-in-One Printer, Fax, Scanner, Copier Ocejet J3680 All-in-One Printer, Fax, Scanner, Copier, HEWCB071A HP Ocejet J3680 All-in-One - multifunction ( fax / copier / Scanner ) HEWCB071A HP Ocejet J3680 All-in-One Printer CB071A Hewlett-Packard
c2 Ocejet Pro
5
HP Ocejet Pro 8000 Printer
8000
c3 Ocejet Pro
6
8000 Wireless
Hewlett Hp Ocejet Pro 8000 Wireless Printer Cb9297a#b1h
c4 Ocejet Pro
7
HP Ocejet Pro K5400 Color Printer
K5400 Table 3: Rules generated from the training data of Table 2
Class
#Instance
c1
Ocejet J3680
c2
Ocejet Pro 8000
f ax → c1 allinone → c1 packard → c1 multif unction → c1 j3680 → c1 hewlettpackard → c1 copier → c1 cb071a → c1 hewcb071a → c1 scanner → c1
f ax → c1 allinone → c1 j3680 → c1 copier → c1 hewcb071a → c1 scanner → c1
cb9297a#b1h → c3 wireless → c3 hewlett, pro → c3 8000, hewlett → c3 k5400 → c4 color → c4
cb9297a#b1h → c3 wireless → c3 hewlett, pro → c3 8000, hewlett → c3 k5400 → c4 color → c4
5
c3
Ocejet Pro 8000 Wireless
6
c4
7
Ocejet Pro K5400 3.3
1 2 3 4
Generated Rules Support (0%) Support (50%)
Test Phase
The test data
T
is composed of a set of product oer descriptions. Algorithm
2 shows the details of the test phase, for predicting the class of a description
17
d ∈ T.
First,
d,
of size
m,
is tokenized and its
k -itemsets, 1 ≤ k ≤ m,
are
generated by an iterative process (Lines 15). Second, each itemset is matched against the antecedents of the rules
R
in the learning model. All rules whose
antecedents match with any itemset form the set of candidate rules, 610). Third, using a vote schema, the consequent of each rule in and the class
Rd
Rd
(Lines
is counted,
ci ∈ C , if any, with the highest counting is chosen as the class of the
product oer description
d
(Lines 1117). Notice that the algorithm prioritizes
shorter rules, with fewer antecedents, which usually include product codes. In the case of tie in the voting for all itemsets, or no rule is found in antecedent matches with the itemsets from function (e.g.:
d,
R whose
our method uses a similarity
Jaccard [15] or Cosine [26]) to choose the class of
d.
In this
case, each product oer description used in training phase is compared with the string
d,
and the class corresponding to the string with the highest similarity is
d
(Line 19).
1: S0 ← Tokenize(d) 2: m ← Length(S0 ) 3: for (k ← 1; k ≤ m; k + +) do 4: Rd ← ∅ 5: Sk ← GenItemSets(k, Sk−1 ) 6: for each k-itemset it ⊆ Sk do 7: for each X → c ⊂ R such that it = X do 8: Rd ← Rd ∪ X → c 9: end for 10: end for 11: for each r ∈ Rd in the form X → c do 12: c.count + + 13: end for 14: pc ← ci such that ci .count > cj .count ∀j 6= i 15: if pc 6= ∅ then 16: return pc // the predicted class of d 17: end if 18: end for 19: return PredictBySimilarity(D , d) // tie in voting
or no rule
chosen as the class of the product oer description
Algorithm 2 Test Phase Require: R, D, d ∈ T Ensure: The predicted class of d
Example 2.
Table 4 illustrates an example of three instances in test data,
18
and the rules from the training data of Table 3 that match with the itemsets generated by the test strings. training model indicate
c1
For Instance #1, when
minimum support is equal to and
using rule
c4 .
all rules in the
as the correct class, and for voting it is the chosen
class (Lines 1416 of Algorithm 2).
c1
k = 1
0%,
For Instance #2, when
k = 1
and the
a tie occurs between the rules of the classes
In this case, the algorithm is not able to classify the test instance
k = 1,
and so it generates itemsets of size
hewlett, pro → c3
For this size, only the
is present, which incorrectly classies the test instance
as belonging to the class
50%, the class c4
k = 2.
c3 .
However, when the minimum support is equal to
is correctly chosen by the algorithm using
k = 1.
Furthermore,
for Instance #3, there is no rule in the training model that matches with this string for all
k -itemsets.
In such a case, the decision is taken with the use of a
similarity metric, such as Cosine (Line 19 of Algorithm 2). Table 4: An example of test data and the rules from Table 3 that match with the itemsets generated by the test instances
Rules found in training model Support (0%) Support (50%)
#Inst.
Test Instance
1
Hewlett-Packard - HP Ocejet J3680 All-in-One
hewlettpackard → c1 allinone → c1 packard → c1 j3680 → c1
2
Hewlett-Packard OceJet Pro K5400 Color Inkjet Printer
hewlettpackard → c1 packard → c1 k5400 → c4 color → c4 hewlett, pro → c3
k5400 → c4 color → c4
HP OceJet Pro 8000 Printer
3
3.4
allinone → c1 j3680 → c1
Alternatives to the Basic Algorithm
In addition to the algorithm described previously, we propose and evaluate two other alternative strategies, which constitute small variations of the basic algorithm.
19
Alternative 1
: This involves limiting the number of tokens in product oer
description strings. Our hypothesis is that the most interesting rules are formed by the rst tokens in the strings, which usually describe the title and code of the product oers.
The remainder tokens usually describe their technical
characteristics, which are less important for the identication of the product oers.
Sometimes, implicit codes are also inserted at the end of the strings.
This way, we can use only a xed number of tokens at the beginning and the end of the strings. Such a strategy also reduces the number of tokens that must be combined to form itemsets, which constitutes a pruning strategy, discussed in Section 4. In our experiments, based on the observation of our datasets and after experiment several dierent numbers of tokens, we used the rst ten and the last three tokens of each input instance, in both the training and test phases.
Alternative 2
: This alternative involves choosing the class of the rule con-
taining the token that occurs most often in the beginning of a product oer description string, in the case of a tie among rules in the test phase of the algorithm for itemsets of size equal to
1.
Our hypothesis is that if a test instance
contains an implicit code, such code occurs in most cases near the beginning of the string.
For example, consider a training model that contains the rules
{5200tn → cx }
and
{prints → cy }.
Consider also a situation that the instance
HP laserjet 5200tn printer trade compliant up to 35 ppm prints in the test phase, for which the correct class is only the itemsets
cx ,
for itemsets of size equal to
{5200tn} and {prints} to the training rules.
1,
matched
Such an instance
contains an implicit code 5200tn. For this strategy, the tiebreaker between the classes
cx
and
cy
would be resolved in favor of
cx ,
because the token 5200tn
occurs before the token prints in the test instance.
4 Computational Complexity In the training phase, the computational complexity time of our method is dominated by the number of product oer description strings to be trained, and
20
the number of tokens in these strings. In the worst case, all tokens in each string need to be combined to form the itemsets. Let description strings, and let
m
n be the number of product oer
be the average number of tokens in these strings.
Then, the computational complexity time of the method is
O(n ∗ 2m ).
In order to reduce the execution time, we adopt some pruning strategies. The rst is to avoid generating a rule whose antecedent is a super set of the antecedent of another rule; that is, the algorithm avoids continuing to combine a
k -itemset that generates a 100% condence rule (Line 19 of Algorithm 1).
example, consider an instance formed by the tokens shows the itemsets pruned when the 1-itemset
{D}
{A, B, C, D, E}.
For
Table 5
forms a rule. The itemsets
of sizes from 2 to 5, in bold, are not generated by the algorithm. Table 5: Example of a pruning strategy that does not generate the bold itemsets when the 1-itemset
{D}
k=1
A B C D
E
forms a rule
Size of the k-itemset
k=2 AB AC
k=3 ABC
k=4
ABD ABE AE ACD BC ACE BD ADE BE BCD CD BCE CE BDE DE CDE
AD
k=5
ABCD ABCE ABDE ABCDE ACDE BCDE
The second pruning strategy uses the minimum support of a rule.
A
k-
itemset that does not occur in a minimum number of product oer descriptions of the same class is removed from the
k+1
k -itemset,
to avoid combining it to form
-itemsets. The GenItemSets function of Algorithm 1 (Line 12) can be mod-
ied, to include only itemsets that reach the minimum support
ms
for at least
one class. The third pruning strategy limits the size The stop condition,
max,
where
max
k ≤ m,
k
of the itemsets to be generated.
in Line 11 of Algorithm 1, is changed to
k ≤
is the maximum size of the itemsets to be generated. In our
21
experiments, we observe that the most interesting rules are found by using the shortest itemsets, in particular when product oers contain implicit identiers. Figure 4 illustrates the results for the micro-F1 metric, and the percentage of product oers classied by using only rules (for further explanation see Section 5.4), according to variations in the maximum size of itemsets for an electronics and informatics dataset.
Notice that for itemsets of sizes greater than
4,
the
results are stable.
Figure 4: Evolution of the results for the micro-F1 metric and the percentage of product oers classied by using only rules according to variations in maximum itemset size.
The fourth pruning strategy limits the number of tokens in product oer description strings.
It is the
Alternative 1
discussed in Section 3.4.
strategy, the number of tokens in the input strings,
m,
In this
can be considered a
constant, and the computational complexity time of the method becomes linear,
O(n). The same analysis of complexity can be done for the test phase of the algorithm, and the third and fourth pruning strategies are also applicable. Notice that a search in the set of rules,
R,
in the learning model can be done in
using a hash table to keep the rules.
22
O(1)
5 Experimental Evaluation In this section we describe our experiments, to evaluate the feasibility of using association rules to classify product oers.
5.1
Datasets
We evaluate our method using distinct datasets, as we describe in the following.
UOL datasets - obtained by crawling the Shopping UOL e-commerce site3 . On this site, product oers are hierarchically organized by categories, and we use the product level categories (the leaf level of the hierarchy) to evaluate our classier.
We collected data from classes that had at least two product
oers linked to them. These data include, for each instance, its product oer description, price, and store. We use only the product oer description. The gathering of this data was carried out in February, 2014.
In order to evalu-
ate dierent facets of our method, we divided this collection in three distinct datasets: (i)
UOL-electronics dataset, formed of the product categories mo-
bile phone, phone, electronics and informatics, and appliances, where product oers usually have an implicit, built-in, product code in their descriptions; (ii)
UOL-non-electronics dataset,
formed of the product categories perfumery
and cosmetics, fashion accessories and jewelry, toys and games, sport and tness, and babies and children, where product oers usually do not have an implicit product code; and (iii)
UOL-book dataset, formed of the book category, which
also does not exhibit implicit product codes.
Printer dataset - composed of printer descriptions, obtained by querying Google Shopping, and used by [25].
Abt-Buy and Amazon-Google datasets - composed of product descriptions, obtained from Abt.com and Buy.com, and from Amazon and Google Shopping, respectively.
They were used by [18] to evaluate entity resolution
approaches. We adapted them to form classes of products instead of matching
3 http://shopping.uol.com.br/
23
pairs. Table 6 presents the following statistics on our datasets: number of product oers, number of classes, average and range of the number of tokens per instance. Table 6: Statistics on the datasets
#Oers 9,552 26,640 385,797 2,167 1,097 1,300
Dataset UOL-electronics UOL-non-electronics UOL-book Printer Abt-Buy Amazon-Google
5.2
#Classes 2,218 5,299 93,886 157 1,075 1,105
#Tokens (average) 12.6 6.4 5.7 7.5 8.6 7.3
#Tokens (range) 235 128 133 116 237 129
Evaluation Metrics
In order to evaluate the quality of our classier, we employed the metrics microaverage and macro-average
F1 .
The
F1
F1 = where
p
measure is dened as:
2rp , r+p
is the precision of the classier, and
The micro-average
F1 ,
r
is its recall.
or simply micro-F1, corresponds to a global
obtained by computing precision and recall over all classes.
F1
value
Micro-F1 is also
known as accuracy; that is, the fraction of the test instances assigned to their correct classes by the classier. The macro-average computed by averaging
F1
F1 ,
or simply macro-F1, is
across all classes [4].
We used a stratied ten-fold cross-validation technique to measure the experimental results; that is, each dataset was randomly divided in
10 folds, ensuring
that each class was proportionally represented in each fold, and the experiments were run
10 times, each time using 9 folds for training and a distinct fold for test-
ing. The results reported in this document are an average of the results among the
10
runs. The same strategy was also applied for the baseline methods, each
time using the same sample used by our method for training and testing.
24
5.3
Baselines
We compared our method against three baselines:
a Jaccard [15], a Cosine
[26], and an SVM (Support Vector Machine) [27] based methods. Jaccard and Cosine are well known string similarity functions, and SVM is one of the best text classier [4]. In the Jaccard based method, the Jaccard Coecient Similarity is used to compare each instance in the test dataset against each instance in the training dataset.
The class of the instance in the training dataset showing the high-
est similarity with each tested instance is chosen to be the class of the tested instance. In the Cosine based method, we used the same strategy as the Jaccard method, except that the Cosine Similarity is used to compare the instances. Each instance is represented as a vector of token weights, using the distinct tokens in the training dataset. The token weights are computed as the inversedocument-frequency (IDF). In the SVM based method, each product is associated with a class and an SVM classier for that class is trained. Each instance is represented as a feature vector, with each token corresponding to a feature, and its IDF value being the
4
feature weight. For the experiments, we used the package LibLinear .
5.4
Results and Discussions
We run a set of experiments to evaluate our basic method, as well as its variants and pruning strategies.
In the following, we describe each one of these
experiments and discuss their results.
Maximum Itemset Size There are some situations, where no rule is found in the training model that matches with the itemsets generated by a test string, or there is a tie among the classes of the rules found. In this case, our method needs to use another
4 http://www.csie.ntu.edu.tw/∼cjlin/liblinear/
25
similarity function to decide the class of the test instance. In this experiment, we want to evaluate a pruning strategy related to the generation of rules, so we consider only those test instances whose class could be uncovered by using only rules, without using another similarity function. Situations where our method needs an auxiliary similarity function will be evaluated in other experiments in this section. In Section 4, we discussed some pruning strategies to improve the execution time of our algorithm. The third pruning strategy discussed was about limiting the size of the itemsets to be generated.
This experiment evaluates such a
pruning strategy. Figure 4 illustrates the results of the micro-F1 metric, and the percentage of product oers classied by using only rules according to variations in the maximum size of the itemsets, for the experiment uses a minimum support equal to Using only itemsets of size
UOL-electronics dataset
.
This
30%.
1, our method obtains the highest result of micro-
F1 for classication; however, it classies only
66.73%
of the product oers in
the test dataset. As the maximum size of the itemsets increases, the number of classied product oers also increases, but the results for micro-F1 decrease. For itemset sizes higher than
3, both results remain stable.
We observe a similar
behavior in the experiments with the other datasets, which is what motivated the proposal of our pruning strategy. That is, we can limit the size of the itemsets to be generated, thereby gaining in eciency, without losing eectiveness. In the next experiments presented, we adopted a maximum itemset size equal to
3.
Product Code Identication Our hypothesis was that our method could identify implicit product codes in oer descriptions. In the
UOL-electronics dataset
, most of the descriptions have
an implicit product code that could be used to identify their classes. Most of the product codes are formed of one or two tokens. Analyzing Figure 4, we can see that for itemsets of size
1
and
2,
our method can classify up to
26
84.74%
of
the product oers. Looking in detail at the results, we observe that most of the hits were due to rules formed by product codes. However, our method was not able to classify more product oers by using rules formed by product codes, due to noisy tokens that occur in the descriptions. By noisy tokens, we denote those tokens that are rare in the collection of product oers, and considered as product identiers by our method. Examples of noisy tokens can be seen in Tables 2, 3, and 4. test instance
#2
For a support equal to
the
of Table 4 is incorrectly classied, due to the noisy tokens
hewlettpackard and packard that point to class
c4 ,
0%,
c1 ,
while the correct class is
provided by the rule formed by the product code k5400. In this case, the
presence of noisy tokens resulted in a tie among classes, and the product code was not able to classify the product oer. In the case of a tie among classes, the classication is performed by using other tokens in itemsets of higher size, or it is not classied by rules.
Minimum Support In this experiment, we evaluated the inuence of the minimum support for generating rules on the quality of the classication results. Figure 5 illustrates the evolution of the results for the macro-F1 and micro-F1 metrics, and the percentage of product oers classied by using only rules according to variations in minimum support.
As the minimum support increases, the quality of the
classication also increases, but the percentage of classied product oers decreases.
For a minimum support higher than
40%,
the results for all quality
metrics remain stable or decrease. We observe a similar behavior in the experiments with the other datasets. Aiming at a trade-o between the quality of classication and the number of product oers classied by using only rules, the experiments suggest using a minimum support of around
30%.
a minimum support equal to
In the next experiments presented, we adopted
30%.
The minimum support also has an inuence on the eciency of the method,
27
Figure 5: Evolution of the results for the macro-F1 and micro-F1 metrics, and the percentage of product oers classied by using only rules according to variations in minimum support.
since a
k+1
k-
itemset that does not reach a minimum support is not used to generated
-itemsets. This is our second pruning strategy, as discussed in Section 4.
Comparison with Baselines This experiment compares our method against the baselines for classifying product oers. For those test instances where our method was not able to decide their classes by using rules, it used the SVM-based method as a similarity function. and
Table 7 presents the results on
Printer
UOL-electronics UOL-non-electronics ,
,
datasets, using the micro-F1 and macro-F1 metrics. Our method
obtains the best numbers on all datasets for all metrics. Statistically, considering a 95% condence level, we can state that the our method is superior to all baselines on the
UOL-electronics Printer
with SVM on the
and
UOL-non-electronics
datasets, and tied
dataset.
Comparing the quality of the results among the three datasets, the more instances and more classes a dataset contains, the worse the results were on it. Our method obtained the highest gains on the which contains more instances and classes.
28
UOL-non-electronics
dataset,
Table 7: Results comparing our method against the baselines.
Method Our method SVM Cosine Jaccard
(%) MicF1 88.88 86.81 84.54 77.52
Uol-electronics
MacF1 87.25 84.54 82.03 74.10
(%) MicF1 84.75 82.53 80.40 77.54
UOL-non-electronics
MacF1 81.96 78.90 76.97 73.30
Printer
MacF1 97.86 97.72 89.22 77.65
(%) MicF1 98.34 98.20 90.40 81.63
We also evaluated the performance of our method for predicting the class of a product oer by using only rules. That is, when no rule is found in the training model to predict the class of a test instance, or there is a tie among classes for all itemsets, these test instances were not considered in this experiment. In such cases, our method would use an auxiliary similarity function to decide the class of the instances. As shown in Figures 4 and 5, our method was able to classify
87.28% of the test instances by using only rules in the for maximum itemset size equal to
3,
UOL-electronics
and minimum support equal to
dataset,
30%.
The
test instances classied by our method, by using only rules, were given to our baselines to be classied. Table 8 presents the results on the
UOL-non-electronics
, and
Printer
UOL-electronics
,
datasets, using the micro-F1 and macro-F1
metrics. Table 8: Results comparing our method against the baselines for the test instances classied by using only rules by our method.
Method Our Method SVM Cosine Jaccard
(%) MicF1 92.49 90.15 87.55 80.25
Uol-electronics
MacF1 91.37 88.48 85.62 77.42
(%) MicF1 92.09 88.89 87.56 84.04
UOL-non-electronics
MacF1 90.53 86.88 85.68 81.48
Printer
MacF1 99.53 99.42 91.20 79.92
(%) MicF1 99.71 99.56 91.85 83.36
As in the previous experiment, our method is statistically superior to all baselines on the with SVM on the
UOL-electronics Printer
and
UOL-non-electronics
datasets, and tied
dataset. This experiment demonstrates that when our
method is able to classify an instance using rules, it is usually better than the baselines. However, the number of test instances classied using rules varies by dataset. For the
UOL-electronics
and
Printer 29
datasets, in which most instances
contain an implicit product code, our method classied the test instances, respectively, while for the classied
69.22%.
87.28%
and
UOL-non-electronics
93.98%
of
dataset it
This result is an indication that we need to investigate ways
to increase the number of instances classied by rules, or improve the results of those classied. In the following experiments, we investigated other variants of our basic method.
Alternatives to the Basic Algorithm In Section 3.4, we described two alternative strategies on top of our basic algorithm. In this experiment, we evaluate them. Table 9 presents the results on the
UOL-electronics UOL-non-electronics ,
, and
F1 and macro-F1 metrics.
Printer
datasets, using the micro-
Our method corresponds to our basic method,
Alternative 1 and Alternative 2 correspond to the alternative strategies with the same name as described in Section 3.4. Table 9: Results comparing our basic method against its alternative strategies.
Method Our method Alternative 1 Alternative 2
Uol-electronics(%)
MacF1 87.25 86.83 87.58
MicF1 88.88 88.46 89.17
UOL-non-electronics(%)
MacF1 81.96 81.93 81.99
MicF1 84.75 84.67 84.77
Printer(%)
MacF1 97.86 97.86 97.28
MicF1 98.34 98.34 97.88
The results are statistically similar. However, Alternative 1 is particularly interesting when the number of tokens in product oer descriptions is high. It limits the number of tokens to be combined to generate itemsets, which improves the execution time. In this experiment, we set a limit of On the
UOL-electronics
13
tokens.
dataset, which has the highest number of tokens per
instance, the execution time of Alternative 1 was six times faster than for the basic algorithm, in the training phase. Alternative 2 may also improve the execution time, since the decision in the case of a tie among rules is made by using itemsets of size generation of larger itemsets.
1,
avoiding the
However, in our experiments, this alternative
had an execution time similar to the basic algorithm, because the number of
30
decisions taken due to ties among rules was low.
Training Using Few Instances per Class Several entity resolution approaches on match problems were evaluted by [18]. For matching product entities from online shopping, none of the approaches obtained good results. In this experiment, we used their datasets to evaluate our method and our baselines. Our results are not comparable with their results, because the problem formulation is dierent. We adapted their datasets to our classication task, as we described in Section 5.1, for the
Google
Abt-Buy
and
Amazon-
datasets.
We executed four experiments on these datasets: (1) training with the Abt dataset and testing with the Buy dataset, (2) training with the Buy dataset and testing with the Abt dataset, (3) training with the Amazon dataset and testing with the Google dataset, and (4) training with the Google dataset and testing with the Amazon dataset.
Therefore, we did not use cross-validation
on these experiments. The main characteristic of these experiments is that in the training dataset, most classes contain only one instance. Tables 10 and 11 present the results. Table 10:
Results comparing our method against the baselines on Abt-Buy
dataset.
Method Our method SVM Cosine Jaccard
Abt-Buy Dataset (%) Training Abt and Test Buy MacF1 MicF1 78.07 81.85 81.15 84.23 81.97 84.97 76.77 80.57
Abt-Buy Dataset (%) Training Buy and Test Abt MacF1 MicF1 74.30 78.35 73.71 77.24 77.81 81.87 69.91 74.75
Our method, as well as the SVM-based method, did not obtain the best results on these experiments. The main reason for this was the small number of instances per class, to create the training model.
Our method could not
take advantage of the minimum support, because most of the classes contained only one instance. These datasets also contain many distinct products, which generated many rules with noisy tokens by our method; that is, many rules were
31
Table 11:
Results comparing our method against the baselines on Amazon-
Google datasets.
Method Our method SVM Cosine Jaccard
Amazon-Google Dataset (%) Training Amazon and Test Google MacF1 MicF1 63.25 64.91 55.05 56.08 75.07 77.92 72.85 75.21
Amazon-Google Dataset (%) Training Google and Test Amazon MacF1 MicF1 70.20 75.29 58.39 63.98 72.92 78.28 69.66 75.02
generated for tokens that were not identiers of classes. A simple method, such as the Cosine-based method, obtained the best results on these experiments, although it required a higher execution time than our method. However, our method worked well when there was at least two instances per class in the training set. The UOL datasets are imbalanced. In the UOLElectronics, for example, 37% of the classes contain classes contain
2
instances, 85% of the
6 or less instances, and the other 15% contain up to 53 instances.
Behavior on a Large Dataset In this experiment, we evaluate our method, and the baselines, on the
book
dataset, which has thousands of instances and classes.
UOL-
We evaluate the
performance for predicting the class of instances by our method, using only rules. This method was able to classify
80.43%
of the test instances. These test
instances were given to our baselines to be classied. However, as we explain in the next section, we were not able to run the baselines on these datasets. Table 12 presents the results. Table 12: Results of our method on the
Method Our method
UOL-book
dataset.
(%) MicF1 96.67
UOL-book dataset
MacF1 96.12
Our method obtained good results on this dataset, better than on the other datasets (except on
Printer
dataset).
It is important to highlight that this
dataset is dicult to classify, due to its large number of classes. Although the product oers in this dataset do not have implicit codes in their descriptions,
32
the titles of the books are usually dierent from each other, and our method was able to identify the combination of tokens that dene each class. This dataset has the shortest average number of tokens per instance, and stopwords were not removed. There are some descriptions formed from only stopwords, for instance, Who, I?.
Execution Time We measured the execution time of our method, and of the baselines, for classifying the two largest datasets,
UOL-non-electronics
the time taken to execute each one of the report here the average time among these deviation.
The experiments with the
10
and
UOL-book
. We measured
folds of the cross-validation, and
10 executions, as well as the standard
UOL-non-electronics
dataset were per-
formed on a computer with the following conguration: x86_64 architecture, processor Intel(R) Core(TM)2 Quad 2.66GHz, 4GB RAM, and operating system Ubuntu 12.04.4 LTS 64-bits. The experiments with the
UOL-book
dataset were
performed on a computer with the following conguration: i686 architecture, processor Intel(R) Xeon(R) E5620 2.4GHz, 4GB RAM, and operating system Ubuntu 12.04.5 LTS 32-bits. Table 13 presents the results. Table 13: Execution time, in seconds, and standard deviation for our method and the baselines
Dataset UOL-nonelectronics Uol-Book
Metric Exec. Time Std. Dev. Exec. Time Std. Dev.
Our method Training Test 100.97 16.45 0.66 1.08 122,402.53 17,003.89 956.05 154.13
SVM Training Test 427.05 15.51 14.12 1.83 -
Cosine Test 2,119.10 38.87 -
Jaccard Test 2,202.83 38.81 -
Our method presented the lowest execution times, except for the test on
UOL-non-electronics
, for which SVM was slightly faster. The Cosine and Jac-
card based methods were too slow; they execute a full cartesian product among the instances in the test subset and in the training subset. We could not obtain execution times for the baselines on the
UOL-book
dataset. These experiments ran for several days, and were not completed. This
33
dataset contains thousands of instances and classes. Our method was able to classify them within a reasonable time. The SVM-based method is not viable for classifying datasets with a large number of classes [9]. SVM can take only binary decisions, i.e., an instance belongs to or does not belong to a given class. With multiple classes, usually, a dierent classier needs to be learned for each class, or for each pair of classes [4].
Experiments using other machine learning algorithms We tried to compare our method against the machine learning algorithms Random Forests [6] and Naive Bayes [31], running in the Weka tool [31], however, we could not obtain success. The eectiveness of both algorithms were worst than ours in the Printer and UOL-electronics datasets, and their execution times much longer. We could not run these algorithms for the other larger datasets, UOL-non-electronics and UOL-book, due to lack of memory problem, even in a machine with the double of memory than that we ran our method. We also did not succeed even running a feature selection algorithm to reduce the dimensionality. The problem is that our datasets have a large number of features and classes, and those algorithms do not work well for datasets with such characteristics. For Random Forests, according to [33], the suggestion of Breiman [6] to select features in a subspace works well for data with certain dimensions (less than 100 features) but is not suitable for very high dimensional data consisting of thousands of features. They proposed a strategy that overcomes Breiman's, but they evaluated it only in datasets containing up to 25 classes. For Naive Bayes, according to [7], the conventional Naive Bayes cannot be directly applied for high-dimensional data classication, because it essentially assumes that all the features are equally important for classication, which hardly holds in highdimensional spaces.
34
Limitations and Failure Cases In this section, we discuss the limitations of our method, and present some cases for which it failed. We discuss the reasons for these failures, and illustrate them with some examples. Our method is limited in solving cases where the tokens for all instances of a class are subsets of the tokens for instances of other classes. In this case, no rule is generated for the class that contains the subsets of tokens, and our method depends on the other similarity function in order to classify product oers for this class. For example, the class
c1
has only one instance for training with the
description HP Ocejet Pro 8000 Printer, and the class
c2 has an instance with
the description Hewlett Hp Ocejet Pro 8000 Wireless Printer Cb9297a#b1h. As the tokens for the instance of
c1
no rule is generated for the class
c1 .
is subset of the tokens for an instance of
c2 ,
In our datasets, there are cases where two
distinct classes have the same set of tokens; however, they appear in a dierent order in the product oer descriptions. For example, Peace and War and War and Peace are two descriptions of books from distinct classes. Our method also demonstrated itself to be limited in solving cases where the number of instances per class is too small, as in the experiments with the
Abt-Buy
and
Amazon-Google
datasets. In such cases, our method is not able
to take advantage of the minimum support, which was demonstrated to be an important strategy. The main case of a failure for our method is due to rules formed by noisy tokens, i.e., tokens that are rare in the training dataset, but are not product identiers.
Such rules may result in an error, when such tokens occur in
test instances of other classes. An example from our dataset, the training instance Blu-ray Player Philips Bdp2100x/78 Full Hd, Bd-live, Dolby Truehd, Easylink, Simplyshare, Output HDMI, generated a rule containing the token Simplyshare as antecedent. In addition, the test instance, from another class, Blu-ray Player 3D Philips Bdp3380x/78 Full Hd, Wi-Fi Ready, Dolby Truehd, Simplyshare, Output HDMI, which contains the noisy token Simplyshare, was
35
erroneously classied using the rule containing the noisy token. Rules formed by noisy tokens may also be generated as a result of some characteristics of products being described in several forms.
For example, for digital cameras, we found
descriptions containing 16.1MP, 16.1 MP, and 16.1 megapixels, which are similar. One of these forms, which occurred in only one class, generated a rule for that class. Our method also failed due to misclassied data in our datasets. For example, the instances Smartphone Samsung Galaxy S4 I9500 3G GSM unlocked and Mobile Phone Samsung Smartphone Galaxy S4 GT-I9505 Black Box, from two distinct classes, were put together in the same class in the
UOL-electronics
dataset.
6 Conclusions and Future Work In this work, we have proposed and evaluated a new method that uses association rules to classify product oers from e-shopping sites, matching oer against oers, without using a product catalog. Classication at product level is particularly important, in that it enables customers to compare prices. However, it is a dicult problem to solve automatically, due to the high number of classes. Our method demonstrated to be eective and ecient for solving this type of problem. Our method uses a supervised machine learning technique, which exploits associations among tokens contained in descriptions of product oers, and generates a set of rules that uniquely identify each class in the training dataset. It is simple, does not require the adjustment of complex parameters, and has a good computational complexity. The experiments we performed show that our method obtains better results than three state of the art baselines, in several datasets with distinct characteristics. We found that the larger the number of classes and instances better behaves our method compared to the baselines. We also presented and discussed some situations where our method failed, as well
36
as its limitations. Our main contributions, which diers our method from existing researches on e-commerce, are:
(i) our method is able to classify a large dataset, con-
taining thousands of instances and classes, within a reasonable execution time, other methods in the literature did not experimented with datasets with such characteristics; (ii) it is able to identify implicit product codes in product oer descriptions, when they exist, and it is also generic enough to be able to classify product oers that do not have codes; and (iii) it works well for dierent types of products, such as electronics and books. Besides, it classies at product level, using product oer descriptions only, and does not need a cleaned and structured catalog of products.
A few other methods present such characteristics,
however they did not experiment with large datasets and are not scalable since they depend on submitting queries to a search engine. In terms of e-commerce management, our method may contribute to increase the scalability and consistency of the product oer classication. Besides price comparison, the classication at product level may improve customer product reviews and product specications, satisfying consumer needs and expectations. It may contribute to nd product historical prices and facilitate the use of prediction algorithms for forecasting future prices, as discussed by [2]. By using tokens that distinguish each product, our method may improve approaches for extracting concepts and hashtags, and for identifying web pages talking about products [34]. Moreover, our method may be used for grouping product oers by creating more context to another method to match them to a product catalog. Although our experiments have focused on product oers, our method is generic and may be applied in any situation where the description of the entity references are composed by short strings.
We experimented with strings
containing up to 37 tokens, although the average was shorter string. Very long strings could have an impact on eciency. Regarding future research, we are working on a mechanism to detect new product oers, and add them to the training model. Such mechanism is based on
37
the reliability of predicting the class of a product oer by using a combination of the number of rules predicting it and a similarity function.
We are also
developing strategies for the removal of noisy tokens from the instances, in order to improve the generation of rules. Finally, we are developing a distributed
5
version of our algorithm to run in the Hadoop environment , in order to improve its eciency.
Acknowledgements This work was partially supported by the FAPEMIG grant CEX-APQ-01834-14 and an individual scholarship from CAPES. We also thank João A. Silva for his contributions in some experiments.
References [1] S. S. Aanen, D. Vandic, and F. Frasincar. Automated product taxonomy mapping in an e-commerce environment.
Expert Systems with Applications
,
42(3):12981313, 2015.
[2] R. Agrawal and S. Ieong.
Aggregating web oers to determine product
Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining prices. In
, KDD '12, pages 435443, New
York, NY, USA, 2012. ACM.
[3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
Proceedings of the 20th International Conference on Very Large Data Bases In
, pages 487499, Santiago, Chile, 1994.
Modern information retrieval: The Concepts and Technology behind Search
[4] R. Baeza-Yates and B. Ribeiro-Neto.
.
2011.
5 http://hadoop.apache.org/
38
Addison-Wesley Professional,
[5] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: A generic approach to entity resolution.
Journal
The VLDB
, 18(1):255276, January 2009.
[6] L. Breiman. Random forests.
Machine Learning
, 45(1):532, October 2001.
[7] L. Chen and S. Wang. Automated feature weighting in naive bayes for high-
Proceedings of the 21st ACM International Conference on Information and Knowledge Management dimensional data classication. In
, CIKM'12,
pages 12431252, New York, NY, USA, 2012. ACM.
[8] E. Cortez, M. R. Herrera, A. S. da Silva, E. S. de Moura, and M. Neubert. Lightweight methods for large-scale product categorization.
the American Society for Information Science and Technology
Journal of
, 62(9):1839
1848, September 2011.
[9] K. Crammer and Y. Singer.
On the algorithmic implementation of mul-
ticlass kernel-based vector machines.
Research
The Journal of Machine Learning
, 2:265292, 2002.
[10] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey.
IEEE Transaction on Knowledge and Data Engineering
,
19(1):116, 2007.
[11] J. C. French, A. L. Powell, and E. Schulman. Using clustering strategies for creating authority les.
Science
Journal of the American Society for Information
, 51(8):774786, 2000.
[12] L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice and open challenges.
Proceedings of the VLDB Endowment
, 5(12):20182019,
2012. Tutorial available at http://www.cs.umd.edu/∼getoor/Tutorials/ER_VLDB2012.pdf.
[13] V. Gopalakrishnan, S. Iyengar, A. Madaan, R. Rastogi, and S. Sengamedu. Matching product titles using web-based enrichment.
39
In
Proceedings of
the 21st ACM International Conference on Information and Knowledge Management
, CIKM '12, pages 605614, New York, NY, USA, 2012. ACM.
[14] E. Ioannou, N. Rassadko, and Y. Velegrakis. On Generating Benchmark Data for Entity Matching.
[15] P. Jaccard.
Journal on Data Semantics
étude comparative de la distribuition orale dans une por-
tion des alpes et des jura.
Naturelles
, 2(1):3756, 2013.
Bulletin del la Société Vaudoise des Sciences
, 37:547579, 1901.
[16] A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstruc-
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining tured product oers to structured product specications. In
, pages 404412, San Diego, USA, 2011. ACM, New York,
NY, USA.
[17] H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison.
Data & Knowledge Engineering
, 69(2):197210, 2010.
[18] H. Köpcke, A. Thor, and E. Rahm.
Evaluation of entity resolution ap-
proaches on real-world match problems.
ment
Proceedings of the VLDB Endow-
, 3(12):484493, 2010.
[19] H. Köpcke, A. Thor, S. Thomas, and E. Rahm.
Tailoring entity resolu-
Proceedings of the 15th International Conference on Extending Database Technology tion for matching product oers. In
, pages 545550, Berlin, Ger-
many, 2012. ACM, New York, NY, USA.
[20] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity mea-
Proceedings of the ACM SIGMOD International Conference on Management of Data sures and algorithms. In
, pages 802803, Chicago, USA, June
2006. ACM New York, NY, USA.
[21] N. Londhe, V. Gopalakrishnan, A. Zhang, H. Q. Ngo, and R. Srihari.
40
Matching titles with cross title web-search enrichment and community detection.
Proceedings of the VLDB Endowment
, 7(12):11671178, Aug. 2014.
[22] H. Nguyen, A. Fuxman, S. Paparizos, J. Freire, and R. Agrawal. Synthesizing products for online catalogs.
Proceedings of the VLDB Endowment
,
4(7):409418, April 2011.
[23] D. Pavlov, R. Balasubramanyan, B. Dom, S. Kapur, and J. Parikh. Document preprocessing for naive bayes classication and clustering with mix-
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ture of multinomials. In
, pages 829
834, Seattle, USA, 2004. ACM, New York, NY, USA.
[24] D. A. Pereira, E. E. B. da Silva, and A. A. A. Esmin. Disambiguating publication venue titles using association rules. In
Joint Conference on Digital Libraries
Proceedings of the IEEE/ACM
, pages 7786, London, UK, Setember
2014.
[25] D. A. Pereira, B. Ribeiro-Neto, N. Ziviani, A. H. F. Laender, and M. A. Gonçalves. A generic web-based entity resolution framework.
American Society for Information Science and Technology
Journal of the
, 62(5):919932,
May 2011.
[26] G. Salton and M. J. McGill.
Introduction to Modern Information Retrieval
.
McGraw-Hill, Inc., New York, NY, USA, 1983.
[27] V. N. Vapnik.
The Nature of Statistical Learning Theory
. Springer-Velag,
New York, USA, 1995.
[28] A. Veloso, A. A. Ferreira, M. A. Gonçalves, A. H. F. Laender, and W. M. Jr. Cost-eective on-demand associative author name disambiguation.
formation Processing and Management
In-
, 48(4):680697, 2012.
[29] A. Veloso, W. M. Jr., M. Cristo, M. A. Gonçalves, and M. J. Zaki. Multievidence, multi-criteria, lazy associative document classication. In
41
Pro-
ceedings of the 15th ACM International Conference on Information and Knowledge Management
, pages 218227, Arlington, Virginia, USA, 2006.
ACM, New York, NY, USA.
[30] S. Wedyan. Review and comparison of associative classication data mining
International Journal of Computer, Electrical, Automation, Control and Information Engineering approaches.
, 8(1):3445, 2014.
[31] I. H. Witten, E. Frank, and M. A. Hall.
Learning Tools and Techniques
Data Mining: Practical Machine
. Morgan Kaufmann, third edition, 2011.
Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
[32] B. Wolin. Automatic classication in product catalogs. In
, pages 351352, New York, NY, USA,
2002. ACM.
[33] B. Xu, J. Z. Huang, G. Williams, Q. Wang, and Y. Ye. Classifying very high-dimensional data with random forests built from small subspaces.
ternational Journal of Data Warehousing and Mining [34] Y. Zhang, R. Mukherjee, and B. Soetarman. commerce applications.
In-
, 8(2):4463, 2012.
Concept extraction and e-
Electronic Commerce Research and Applications
12(4):289296, 2013. Social Commerce- Part 2.
42
,