USING SUPPORT VECTOR MACHINES TO RECOGNIZE PRODUCTS IN E-COMMERCE PAGES José Martins Junior1 Edson dos Santos Moreira2 ICMC - Instituto de Ciências Matemáticas e de Computação Universidade de São Paulo Postal Box 668, 13560-970, São Carlos, SP, Brazil 1
[email protected],
[email protected]
Abstract Common searching engines locate Web pages by means of a lexical comparison among sets of words and the hypertext’s contents. These are usually inefficient mechanisms when looking for concepts or objects, such as goods and services, in electronic commerce sites. Semantic Web was announced in 2000 for this purpose, with its implementation initially estimated for ten years. The DEEPSIA project was envisaged to provide a purchaser centered solution to find products for sale on the Web. Testing DEEPSIA on Brazilian sites presented low efficacy, when deciding whether the pages contained or not relevant data. We present here the application and evaluation of Support Vector Machines method to recognize Web pages containing products for sale.
Key Words E-Commerce, Support Vector Machines, DEEPSIA, intelligent agents, text classification.
1. Introduction On 1990, Tim Berners-Lee presented a new application to exhibit documents. Hypertexts, so called due to their capacity to associate links to other documents into the text, were displayed by a graphical tool “WorldWideWeb”, designed by Berners-Lee. The “Web”, as it was known, was the main responsible for the huge Internet use and its consequent growth. Since then, a large number of hypertext documents have been created. Google’s homepage (www.google.com) mentions more than 3 billions currently. In the recent years, several resources have improved the Web framework making it possible for it to include and exhibit other media types, such as pictures, movies and sound, thus making the user interaction more interesting. The problem concerning its scalability and future use doesn’t relate to page formatting, but to the appropriate organization and classification of its contents. This problem also affects e-Commerce application, regarding the ways we find goods and services offered on the Web.
Several search engines, like Google, Altavista (www.altavista.com), Yahoo (www.yahoo.com), work based on keywords given by its users. Its efficiency depends on the set of words that the user provides, predicting a possible content in the desired pages. Therefore, this kind of agent searches pages by lexical comparisons among a provided set of words and the contents of the pages on the Web. Marketplaces on the Web provide specialized tools to organize, search and present information about products for sale. Unfortunately, the searches are limited to the domain of contracted suppliers. Another possibility is to provide intelligence to these agents letting them search pages that express concepts of the desired object. The problem addressed here refers to the language currently used to create Web pages: the HTML. Unfortunately, it doesn’t contain appropriate ways to represent formal concepts. Regarding this matter, Berners-Lee presented the Semantic Web in 2000. This initiative, instituted by WWW Consortium, aimed to develop new features, such as XML (Extensible Markup Language) and RDF (Resource Description Framework), in order to represent concepts of objects into hypertext pages. These concepts will constitute a global ontology to be shared on the Web. The initial period estimated for Semantic Web implantation was for the next 10 years. The DEEPSIA’s system [1], for which the work described here was done, was developed to provide a consumer centered solution, using text classification methods to find textual descriptions about products for sale on Web pages. This paper describes some of the work done by the Brazilian partners on the DEEPSIA’s consortium, to adapt text classification operations of the system when testing it in Brazil. Section 2 describes e-Commerce and e-Business concepts; section 3 presents DEEPSIA’s system and its main characteristics; section 4 discusses the application of text classification methods to decide whether a page contains or not textual description of products for sale; section 5 shows the main result of the performed tests; conclusions and further work are discussed on section 6.
2. E-Commerce on the Web Information Systems to support online electronic business (e-Business) are widely developed on the Web. Some factors have collaborated to implement them, like the good and simple interface of Web applications, which provides multimedia features to display product contents, and wide access to the Internet. These factors, associated with the current wave of economy globalization, made the Web a suitable media to disclose and trade goods and services. The infrastructure evolution, for instance, logistically supporting credit card operation and bank’s system integration, allied with the wideness of the Internet, makes it possible to fetch market niches that were inaccessible due to geographic limitations. The trade transactions through the Internet have promoted new ways of operating e-Commerce. Several works were recently published involving some aspects of the information systems evolving to eBusiness, such as: formal methods creation [2]; architectures [3] and infrastructure [4][5] definition; agents [3] and ontology [6] utilization; and information retrieving [7]. Nowadays, there are several kind of business applications in the Web. Taxonomy of the business models available to the Web was presented in [8]. A common business model on the Web is the Marketplace Exchange. Some Brazilian portals, like BuscaPé (www.buscape.com.br), and BondFaro (www.bondfaro.com.br), are efficient on searching products for consumers. The problem encountered on these portals relates to the search domain. These portals establish associations with contracted suppliers’ sites and restrict the search to them. The solution is a good deal for subscribed suppliers. Unfortunately, the limitation on search domain disfavors the consumer’s side in the transaction. In any business, the consumer (person or enterprise) aims to find the best offer according to the price/benefit ratio. The marketplace model reduces the choices of the consumer to a set of contracted suppliers.
3. DEEPSIA’s project The Dynamic on-linE IntErnet Purchasing System based on Intelligent Agents (DEEPSIA – IST-1999-20483) was instituted on the European research program called Information Society Technologies (IST). The DEEPSIA consortium was composed by enterprises and research institutes of several European countries: ComArch enterprise (Poland); UNINOVA (Portugal); Universite Libre de Bruxelles (Belgium); University of Sunderland (England); Zeus enterprise (Greece); Comunicación Interativa enterprise (Spain). Universidade de São Paulo (Brazil) represents the Brazilian partner in the project.
DEEPSIA aims to provide the inclusion of Small and Medium Enterprises (SMEs) on electronic commerce. At this point, the solution required the development of a computational infrastructure based on a multi-agents system that supports the electronic purchasing process (via Internet) performed by any SME, using the concept of a Product Catalog locally stored on the SME side. DEEPSIA’s strategy is based on a solution centered on the consumer and aims to facilitate the incorporation of the technological knowledge by the comer SME (new to the electronic commerce environment), by providing access to product offers appropriate to its particular needs. The computational infrastructure developed by DEEPSIA offers a friendly interface based on a personalized product catalog that is automatically updated with information obtained from electronic portals on the Internet. The catalog is able to maintain data about products provided by specific suppliers previously subscribed on the system. This personalized catalog provides the consumer with a set of suitable offers providing quality, diversity, use and delivery period. The architecture of the system, proposed in [1], basically contains three modules or subsystems, as showed on Figure 1: Dynamic Catalogue (DC), Portal Interface Agent (PIA) and Multi-Agent System (MAS).
Figure 1 – Architecture of DEEPSIA’s system The Multi-Agent System (MAS) is described as an autonomous system that collects data and a semiautomatic process to update the catalog, composed by a set of agents with specific tasks. An initial knowledge base is assigned to the individual behavior of each agent, which reflects the strategic SME’s objectives. The internal modules of MAS describe specialist kinds of agents, such as: • Web Crawler Agent (WCA): Its main function is to search pages on the Web that contains interesting data to the user, based on a recursive process initiated by a seed, and previously trained by an offline process. It executes the first selection and classification of pages for latter processing by the Miner Agent. For each obtained page WCA must
decide whether it contains or not textual descriptions about product(s) for sale, using supervised learning machine techniques applied to text classification methods, as k-NN and C4.5;
denotes a certain type of machine learning called supervised, where a new document can be classified through the comparison of the predefined classifier that was trained with tagged documents.
• Miner Agent (MA): This agent executes analysis of Web pages obtained by the WCA and stores significant information on a database. The concept ontology (defined by the user) and a set of rules to associate the concepts compose the knowledge base. The knowledge base is used by the MA to describe relevant data and is updated on each requisition.
In [11] it was described a general process, which involves the automated text classification tasks, and was assumed by the present work.
3.1 USP collaboration on DEEPSIA’s project The Universidade de São Paulo (São Carlos – Brazil), represents the Brazilian initiative on the project since 2001, funded by CNPq, the Brazilian Research Federal Agency. Two research groups integrate the work team: ICMC’s Intermedia Laboratory (Instituto de Ciências Matemáticas e de Computação) and NUMA/EESC (Núcleo de Manufatura Avançada / Escola de Engenharia de São Carlos). Based upon some user requirements demanded from NUMA (by testing the system within the Brazilian SMEs), the ICMC group applied efforts to adapt the DEEPSIA system by translating the interface and the ontology to Portuguese and developing new searching tools and customizing features to the system interface. Testing DEEPSIA’s system on Brazilian sites presented low efficacy on the decision process of Crawler Agent. This result motivated the research group to study and apply other methods to recognize textual contents on Web pages.
4.1 Document acquisition This task defines ways to obtain textual documents that will constitute training, test and classification sets. Sets of documents used on training and test phases are composed by pairs: documents (di) and represented classes (ci) [10]. In this project [12], the following classes were implemented to do the following: – DocumentExample – obtains and stores locally positive and negative examples under a predefined category. Category, in this case, defines pages containing (or not) descriptions of products for sale on textual contents. The class also creates a directory structure to separate examples. Each example is represented by its URL, and is saved in individual files on respective directory. URL is stored at the first line of the file and into an index file. Figure 2 represents these processes; – WebPage – makes Web pages retrieval based on the provided URLs.
4. Deciding: does the page present a product? There are three textual components on Web pages that express meaning to humans: contents of the page’s body, its title and the URL. The one that represents meaning to Web applications and is understandable to computers is the URL. Unfortunately, described concepts on hypertexts don’t make sense when we take the URL into account. A solution here consists in abstracting these concepts to a format useful to computers, correctly classifying them as meaningful objects to humans. This defines the main objective of the Text Classification (TC) research area. Joachims [9] defined TC as a process to group documents in different categories or classes. The automated text classification refers to the automatic construction of classifiers using inductive processes (learners). The goal here is to provide a classifier to a category ci observing characteristics in a set of documents, previously (and manually) assigned to ci by a specialist of the involved domain [10]. This approach
Figure 2 – Obtaining process
4.2 Document preprocessing The statistical approach aims to obtain an attribute-value representation of the textual content. This representation is usually called a bag of words [13][10] and describes a vector that contains attribute-weight pairs. The vector conception requires some tasks to be relied on, such as attributes identification, weight assigning and representation reduction [11]. Some operations must be provided to the attribute’s identification task, such as remove HTML tags, identify and extract individual words, ignoring terms that are part of a stop list.
An interesting point described by Joachims [14] refers to a general rule to select features (words) from each textual document. It consists of accepting as valid features only the words found on at least three documents of a training set and didn’t appear on the stop list. The encountered words are ordinarily stored on a text file (words) to create a dictionary for the domain of involved classes. Each line on the file must contain an unprecedented word, which will be referred by the algorithm as a different feature, through a numeric index that was assigned line number on the file. There are two approaches to calculate the weight of the features: • Boolean: 0 and 1 values are used to represent, respectively, absence or presence of a term (attribute) on textual content; • Numeric: represents statistical metrics based on term frequency (TF) into the document. Each word (attribute) corresponds to a feature with TF(wi,x), which is the number of times where a word wi is encountered on document x. A variance of this approach, called TFIDF (Term Frequency Inverse Document Frequency), suggests the inclusion of the term frequency on all documents as a reverse measure of its capacity to represent specifically a document [9][14]. Calculating IDF(wi) (Inverse Document Frequency of a word wi) involves previously DF(wi) (Document Frequency) acquisition. DF(wi) consists of the number of documents where word wi occurs. Equation 1 presents the IDF formula, where n is the total number of documents.
n IDF ( wi ) = log DF ( w ) i
(1)
Aiming to provide the correct data representation to use on the SVM method, the preprocessing phase was divided on the following tasks. SVM application required development and implementation [12] of some classes: • Parsing: extracting text from each HTML file stored by DocumentExample. Classes: – HtmlParser – reads the contents of HTML files by using a HtmlReader class. It removes HTML tags, comments, JavaScript codes and style definitions. It also interprets the special characters (ISO85591) through a Charset class and stores the extracted text (lower case) on a file with the same name of the HTML file (but with “.txt” extension); – HtmlReader – provides the sequential reading of characters from the HTML files;
– Charset – translates ISO8559-1 codes to the respective special characters. • Tokenizing: its function is the acquisition of valid words from the text files. Classes: – TextTokenizer – reads tokens from the text file, removes dots, commas and other invalid characters from their edges and verifies if they are valid words, by using a StopList class. It also recognizes the price pattern: if a token represents a number with a price format, a generic sentence “pricestring” is assumed; – StopList – maintains a set of predefined sentences, which must be ignored as valid words. • Dictionary: defining a specific dictionary (referenced as a words file) from related domain. Class: – Dictionary – obtains all the valid words from each text file by using a TextTokenizer class. Each word found at least on three documents is ordinarily stored on a text file, called words. The complete set of words is maintained on a HashTable. • Weighting: calculating TFIDF weight for each feature of the text files. Classes: – FeaturesFrequences – makes the unique access of each text file at this task. It counts the number of times that each feature (a valid word on Dictionary) appears on the document, and creates Set structures for each document containing pairs: feature/TF; – FeaturesIDF – receives Set structures from FeaturesFrequences and provides them to FeaturesDF class. So, it obtains DF (Document Frequency) value for each feature, from FeaturesDF and calculates its IDF (Inverse Document Frequency) weight, which is ordinarily stored on a file (words.idf) as a float value; – FeaturesDF – counts documents that contain each word (Document Frequency) based on Set structures received from FeaturesIDF. • Vectors: creating feature vectors for each example. – FeaturesVectors – obtains TF and IDF weights for each feature, by using FeaturesFrequences and FeaturesIDF instances, and calculates TFIDF (TF*IDF) value. It stores each vector that contains all pairs feature/TFIDF for each example, on a single line of the train.dat (or test.dat) file.
Figure 3 presents a generic vision of the initial and final states of examples separated by acquisition and preprocessing phases. Initial state represents the complete domain, composed by all Brazilian Web pages.
5. Obtained results Several tests were performed to detect and assure the learning ability of the implemented solution. Table 1 presents the examples of the training set, which were used to generate the model (classifier). This set contributed to create a dictionary containing 11205 words. The time duration of the training, running on a AMDAthlon 1GHz with 512 MB of RAM, was: 20s to create the dictionary, and 2min50s for the generation of vectors.
Figure 3 – Web Pages and Vectorial representation
4.3 Knowledge Extraction Several algorithms can be applied at this task to implement different methods for automated text classification. Support Vector Machines (SVM) was adopted on the present work. SVM was developed by Vapnik [15] based on the Structural Risk Minimization principle of statistical learning theory. It is a method used to recognize patterns defined in a vectorial space, and the problem consists of finding a decision surface that distinctively separates data into two classes (positive and negative). So, it can be applied to solve classification problems. The use of SVM on text classification was introduced by Joachims [14]. In his paper, Joachims presented a comparative study between SVM and other algorithms, such as Bayes, Rocchio, C4.5 and K-NN. Aiming to facilitate the learning process, each category is assigned to a specific problem of binary classification [16]. SVMlight is an implementation of the SVM algorithm, developed by Joachims, and can be freely used for scientific researches. Basically, it consists of two modules: svm_learn, the training module of SVMlight; svm_classify, the classification module.
4.4 Knowledge Evaluation On the classification problem, involving classes described by discreet values, some quality measures [11] can be obtained from result analysis: • Precision: proportion of positive examples that were correctly classified; • Recall: relates the portion correctly classified as positive examples; • Accuracy: proportion of correct predictions.
The generated model was applied on classification of positive and negative examples obtained from another sites. Table 2 presents the classification results when applying the solution to classify Web pages from commercial and noncommercial sites, and Table 3 shows the evaluation of precision, recall and accuracy measures. Main URL www.americanas.com.br www.bondfaro.com.br www.buscape.com.br www.estadao.com.br/economia www.extra.com.br globoshopping.globo.com www.kalunga.com.br www.livcultura.com.br www.livrariasaraiva.com.br www.magazineluiza.com.br www.mercadolivre.com.br shopping.uol.com.br www.shoptime.com.br www.submarino.com.br
Positive 212 40 20 0 33 21 54 33 51 103 50 27 36 45 725 Table 1 – The training set
Negative 68 50 19 25 25 2 17 33 46 93 71 10 26 42 527
Classification results FP FN TP TN www.banespa.com.br 1 0 0 59 www.icmc.usp.br 0 0 0 90 www.receita.fazenda.gov.br 0 0 0 40 www.pontofrio.com.br 0 0 85 0 www.rihappy.com.br 0 0 33 3 www.somlivre.com.br 1 7 47 18 Table 2 – Classification results. Shows False Positive, False Negative, True Positive and True Negative results. Main URL
Evaluation Accuracy Precision Banespa S/A 98,33% ICMC (USP) 100,00% Receita Federal 100,00% Ponto Frio 100,00% 100,00% Ri Happy 100,00% 100,00% Som Livre 89,04% 97,92% Table 3 – Result evaluation Site
Recall 100,00% 100,00% 87,04%
6. Conclusion Observing the evaluation of the presented result, we conclude that a statistical method, such as SVM, can be used to solve the problem, which consists here in deciding whether a page contains or not textual description of product(s) for sale. The efficacy of the classifier depends on the quality of the positive and negative examples selected to compose the training set. Another important point addressed in this work consists on the adaptation of the tasks involved in order to analyze the domain, such as the price recognition pattern on the page, presented on section 4. Finally, we suggest as further work the application of the presented method to solve other decision problems on text classification, and also testing it when applied on other domains.
7. Acknowledgement The authors are grateful to DEEPSIA consortium (IST1999-20483) and CNPq (Proc. 68.0263/01-2) for financial support.
References [1] A.S. Garção, P.A. Sousa, J.P. Pimentão, B.R. Santos, V. Blasquéz, L. Obratanski, Annex to DEEPSIA’s Deliverable 4 – System Architecture, Technical Report of IST PROJECT-1999-20483, January 2001. [2] M. Song, A. Pereira, G. Gorgulho, S. Campos, W. Meira Jr, Model Checking Patterns for e-Commerce Systems, Proc. First International Seminar on Advanced Research in Electronic Business, Rio de Janeiro, Brazil, 2002, 2-10. [3] J. Magalhães, C. Lucena, A Multi-Agent Product Line Architecture for Websearching, Proc. First International Seminar on Advanced Research in Electronic Business, Rio de Janeiro, Brazil, 2002, 12-16. [4] B. Coutinho, G. Teodoro, T. Tavares, R. Pinto, D. Nogueira, W. Meira, Assessing the impact of distribution on e-business services, Proc. First International Seminar on Advanced Research in Electronic Business, Rio de Janeiro, Brazil, 2002, 147-154. [5] F. Milagres, E.S. Moreira, J. Pimentão, P. Sousa, Security Analysis of a Multi-Agent System in EU’s DEEPSIA Project, Proc. First International Seminar on Advanced Research in Electronic Business, Rio de Janeiro, Brazil, 2002, 155-162.
[6] J.F. Herrera, J. Martins Junior, E.S. Moreira, A Model for Data Manipulation and Ontology Navigation in DEEPSIA, Proc. First International Seminar on Advanced Research in Electronic Business, Rio de Janeiro, Brazil, 2002, 139-145. [7] R. Baeza-Yates, C. Badue, W. Meira Jr, B. RibeiroNeto, N. Ziviani, Distributed Architecture for Information Retrieval, Proc. First International Seminar on Advanced Research in Electronic Business, Rio de Janeiro, Brazil, 2002, 114-122. [8] M. Rappa, Managing the Digital Enterprise: Business Models on the Web, In: E-Commerce Learning Center of North Carolina State University web site: http://ecommerce.ncsu.edu/research.html, last accessed on 20/04/2003. [9] T. Joachims, A Probabilistic Analysis of the Roccio Algorithm with TFIDF for Text Categorization, Universitat Dortmund, 1997. [10] F. Sebastiani, Machine Learning in Automated Text Categorization, Technical Report IEI B4-31-12-99, Instituto di Elaborazione della Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy, 1999, 63pp. [11] C.Y. Imamura, Pré-processamento para Extração de Conhecimento de Bases Textuais, Dissertation (Mastering), Instituto de Ciências Matemáticas e de Computação – Universidade de São Paulo, São Carlos, Brazil, 2001, 92pp. [12] J. Martins Junior, Classificação de páginas na Internet, Dissertation (Mastering), Instituto de Ciências Matemáticas e de Computação – Universidade de São Paulo, São Carlos, 2003, 81pp. [13] D. Mladenic, Machine Learning on nonhomogeneous, distributed text data, PhD thesis, University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, 1998, 108pp. [14] T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Universitat Dortmund, 1998. [15] V. Vapnik, The Nature of Statistical Learning Theory (Springer-Verlag, 1995). [16] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proc. International Conference on Machine Learning (ICML), 1999.