A knowledge-based question answering system for ...

3 downloads 13636 Views 352KB Size Report
Apr 20, 2008 - Now, this system can answer all questions in domain of digital .... A1= {Product Attributes, Seller Name, Manufacturer Name, Product. Type ...
Knowledge-Based Systems 21 (2008) 946–950

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A knowledge-based question answering system for B2C eCommerce Ali Ghobadi Tapeh a,*, Maseud Rahgozar b a b

Database Research Group (DBRG), School of ECE, University of Tehran, Tehran, Iran Control and Intelligent Processing Center of Excellence, School of ECE, University of Tehran, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 17 June 2007 Received in revised form 8 April 2008 Accepted 13 April 2008 Available online 20 April 2008 Keywords: Comparative shopping Question answering Semantic similarity Semantic correspondence

a b s t r a c t The evolution of Business-to-Consumer (B2C) eCommerce has been formed through various generations. Last models of B2C eCommerce are comparative shopping systems that connect to multiple vendors’ databases and collect the information requested by the user. The comparative result obtained is then displayed in a tabular format in the user’s browser. Although this scenario is much better than the multiple manual site comparisons, user still needs to face inconsistent user interfaces when he is linked from the comparison site to the actual purchasing site for shopping. Therefore, user has to learn logics of each site’s user interface. In this paper, we propose a question answering system based on natural language processing techniques for retail (B2C) in eCommerce. This system gets a question in natural language formats, decomposes it to keywords, and extracts constraints automatically. Corresponding answers are then retrieved from the vendors’ Web sites by exploiting the question constraints. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction The eCommerce began with the introduction of EDI between companies, and ATMs for banking [1,2]. Introduction of the Web Browsers opened up a new age by combining open internet and easy user interface approaches [1,2]. B2C ordinarily refer to online trading and auctions, for example, online stock trading markets, online auction for computers and other goods. B2C eCommerce refers to the emerging commerce model where businesses/companies and consumers interact electronically or digitally in some way. One of the best examples of B2C eCommerce is Amazon.com, an online bookstore that launched its site in 1995. In a B2C eCommerce the focus is more about enticing prospects and converting them into customers, retaining them and share value created during the process. The ultimate goal is the conversion of shoppers into buyers as aggressively and consistently as possible. In a typical B2C flow of information between business and consumer typically is through the medium of Internet. This flow includes product orders/service requests from customers and product information, specifications, providing of services by Business. B2C eCommerce is the predominant commercial experience of Web users. A typical scenario involves a user’s visiting one or several online shops, browsing their offers, selecting and ordering products. Ideally, a user would collect information about price,

* Corresponding author. E-mail addresses: [email protected] (A.G. Tapeh), [email protected] (M. Rahgozar). 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.04.005

terms, and conditions (such as availability) of all or at least all major, online shops and then proceed to select the best offer. But manual browsing is too time-consuming to be conducted on this scale. Typically a user will visit one or a very few online stores before making a decision. However, the evolution of B2C eCommerce has been formed through various generations. Last models of B2C eCommerce are comparative shopping catalogs. Models such as pricescan.com [3] that visit several shops, extract product and price information, and compile a market overview. The comparative result obtained is then displayed in a tabular format in the user’s browser. This approach suffers from several drawbacks. First, it’s necessary for these models to get access grant from vendors before to access their databases for retrieving any information. Since some vendors may not give access grant to their databases, their product information will not appear in the information provided by these models. We have proposed a knowledge-based approach to resolve this problem in [15]. In this approach, products and price information are understood and extracted from Web pages of vendors’ sites to build virtual catalog directly. Second, user still needs to face inconsistent user interfaces when he is linked from the comparison site to the actual purchasing site for shopping. Therefore, user has to learn logics of each site’s user interface. For example, user has to analyze his question into some keywords based on logics of user interface and give them to the system. It means that there is not such a possibility that user can ask his question in form of natural language (such as English) and get his answer. We can say that using keywords based on logics of third generation system’s user interfaces is not a good way to establish relationship between user and system [4]. Because at first a user is not interested to extract

A.G. Tapeh, M. Rahgozar / Knowledge-Based Systems 21 (2008) 946–950

keywords of his question or maybe he is unable to do so. On the other hand, usually a few keywords cannot cover the complete meaning of user’s question. In most cases, users are searching clear responses for their questions, while the outputs of third generation systems are collection of answers related to user’s question that probably they contain the correct answer. In recent years, Question Answering (QA) systems have evolved out of the field of Information Retrieval to meet better the needs of information seekers. Unlike simple keyword-based information retrieval systems, they aim to communicate directly with users through a natural language. They accept natural language questions and return exact answers eliminating the burden of query formulation and reading lots of irrelevant documents to attain the answer. Open-domain QA systems deal with unrestricted questions upon large-scale text corpora typically by means of statistical approaches whereas restricted-domain systems endeavor to concentrate on a controlled domain of interest (e.g. weather forecast or UNIX technical manuals). MELISA [5] is a good example for restricted-domain QA systems. In this paper, we propose a QA system for B2C eCommerce. Now, this system can answer all questions in domain of digital camera while it can be developed for any retail domains. This system exploits an initial knowledge base which makes some advantages in contrast with Open-domain QA systems (i.e. systems do not have any specific domain knowledge [6–8]). We present the details of our approach in the remainder of the paper as follows. After a short overview of the related work in Section 2, Section 3 describes the system architecture. Section 4 explains how we define an initial knowledge’s concepts, relations, and instances. In Section 5, we describe our approach to analyze the NL questions. In Section 6, we report the experiments we conducted involving digital camera advertisements on the Web. Finally, Section 7 presents the conclusion of this work. 2. Related work Halo [9] is one of the most ambitious recent investments in knowledge-based question answering systems, ‘‘a staged, longterm research and development initiative toward the development of a ‘Digital Aristotle’ capable of answering novel questions and solving advanced problems in a broad range of scientific disciplines.” In the pilot phase of the project the state of the art in knowledge representation and reasoning was applied for a limited syllabus in chemistry with promising results [10]. Phase two of the project is going to promote technologies to ease the task of knowledge entering and formulation for domain experts reducing the

947

cost of such knowledge-based systems. In [16] an application of knowledge-based QA systems is studied for a home agent robot. In [17] authors present an approach for augmenting online text with knowledge-based question answering capabilities. As they argue, their prototypes have been well received but none have achieved regular usage primary due to the incompleteness of the underlying knowledge bases. In [11] a system has been introduced that considers four aspects syntactic, lexical, semantic, and world knowledge in information extraction process [4,11]. In [13] another QA system has been introduced like [12]. TelQAS [4] is another domain-specific knowledge-based QA system which employs a reasoning engine built based on an extended version of Human Plausible Reasoning theory. The knowledge base of the system has been filled manually with logical statements about Fiber Optics. 3. System architecture In the proposed system, there is an agent makes possibility of natural language negotiation with user. This agent analyses the user NL questions and extracts the keywords and conditions of the questions. In the next step, extracted keywords are given to another agent called web crawler to search and retrieve the related pages which include same keywords. Retrieved pages are then passed to information extraction agent that extracts user’s exact answers using questions’ keywords and conditions. Finally, extracted results are displayed to users by user negotiation agent. In Fig. 1, the architecture of our proposed QA system has been illustrated for B2C eCommerce. In this article we focus on explanation of user negotiation agent, and we have investigated other parts of the architecture in [15]. 4. Knowledge extraction Knowledge is defined as concepts, their relationships, and concepts instances of specific domain. Concepts and relationships are identified and defined by domain experts. When we apply the knowledge to a Web page, the objects and relationships are identified and associated with concepts and relationships in the knowledge’s conceptual-model. Thus the strings on a Web page are recognized and understood in terms of the answers. Fig. 2 shows partial knowledge extraction for digital cameras advertisements. We have defined 10 relations between concepts. Most of them have been used to describe UML associations. The main relations defined in our knowledge extraction are PRO, OFR, OFD, ATT, VAL, SIM, and ISA. PRO means a manufacturer pro-

Fig. 1. Architecture of the proposed QA system for B2C eCommerce.

948

A.G. Tapeh, M. Rahgozar / Knowledge-Based Systems 21 (2008) 946–950

The words ‘and’, ‘of’, and ‘with’ belong to set A4, and then are omitted from the question. Therefore, we reach a set of words as follow:

W ¼ fCanon; Powershot; image; sensor; 5:0; minimum; priceg

Fig. 2. Digital camera-Ads knowledge representation (partial).

duces a product. OFR has been used to define offer relations between sellers and costs. OFD means a price is offered by a seller. A concept has some attributes which has been defined by ATT relations between concepts and their attributes. For some attributes, we have defined some values by VAL relations. SIM which is the core of knowledge extraction has been used to define semantic correspondence between concepts. Finally, we have used ISA to define kind of relations between concepts. 5. Question analysis In proposed system, there is a possibility for user to ask his question about sellers and products in natural language (i.e. English). In Fig. 3, there are some typical user questions about digital cameras. User negotiation agent must make the NL questions machine understandable. It uses a question analyzer component for this job. This component analyses the user NL questions and extracts the keywords and conditions of the questions. Extracted keywords are given to another agent called web crawler to search and retrieve the related pages which include the same keywords. Retrieved pages are then passed to information extraction agent that extracts user exact answers using questions’ keywords and conditions. Question analyzer will use the knowledge to understand the meaning of question’s words during the analyzing process. In this system, it is considered that questions are asked in format of examples shown in Fig. 3. By this consideration, we can say the most questions will be the combination of sets A1, A2, A3, and A4’s words shown in Fig. 4. First, the analyzer decomposes the question phrase by a part-ofspeech tagger component and specifies the role of each word in the phrase. If a word belongs to set A4 will be deleted. For example, consider the question ‘a’ from Fig. 3. It will be decomposed as follow after part-of-speech tagging:

ðCanonÞN ðPowershotÞN ðwithÞR ðimageÞN ðsensorÞN ðofÞR ð5:0ÞN# ðandÞC ðminimumÞJ ðpriceÞN

a) Canon Powershot with image sensor of 5.0 and minimum price b) Canon Powershot SD600 c) Sony with price of $200 or less d) Best sellers of Canon Powershot SD600

Now, we can extract keywords and conditions of question from set W. The simplest way to determine keywords of the question is that all words in W are considered as independent keywords. Of course in some cases, combination of several words maybe considered as one keyword (e.g. ‘Canon Powershot’ in question ‘a’). Therefore, combinational keywords and semantic correspondence of the questions’ words must be considered to retrieve more relevant pages too. 5.1. Extraction of combinational keywords The question analyzer uses knowledge to identify combinational keywords. If combination of some words in W is equivalent to one of Al’s items or their values, it will be considered as a combinational keyword. In other word, for any wi 2 W if the phrase (wi + wi+1) 2 A1 is correct, we must replace wi with (wi + wi+1) in the set W. For example, after extracting combinational keywords for question ‘a’, its W will be as follows:

W ¼ fCanon Powershot; image sensor; 5:0; minimum; priceg In this example, ‘Canon Powershot’ is a model of digital cameras and ‘image sensor’ is one the digital cameras features. This knowledge is extracted from system knowledge. 5.2. Extraction of final keywords There are some words in the set W which cannot be supposed as suitable keywords without processing their relations and dependencies with other words. For example, suppose the word ‘minimum’ in question ‘a’. It cannot be as an independent keyword because it’s an adjective for word ‘price’ which indicates minimum value in product’s prices. Words ‘less’ and ‘best’ in questions ‘c’ and ‘d’ are other examples, respectively, which are not independent keywords. In addition to, it is possible to be some words in the question that there are not such words in web pages but there are their similarity in meaning words. Like word ‘seller’ which its meaning is similar to ‘vendor’ and ‘carrier’. Therefore, for extracting final keywords of the questions we must consider similarity in meanings too. As mentioned above, the system uses the knowledge to identify and extract similarities in meanings. In addition to, because numbers in most questions specify the up or down end of a range, so they usually cannot be suitable keywords. Fig. 5 shows the process of final keywords extraction where the set KWI is considered as final keywords set and initially is equal to set W. For example, if we use this algorithm for question ‘a’, KWI will be as follows:

KWI ¼ fCanon Powershot; image sensor; price; Seller; Vendor; Carrier; Store Nameg

Fig. 3. Some typical user questions about digital cameras.

A1= {Product Attributes, Seller Name, Manufacturer Name, Product Type, Product Name and Model} A2= {“Less”, Over”, ”Minimum”, ”Maximum”, ”Best”, ”Price”, ”Dollar”, ”Dollars”} A3= {“$”, Digits} A4= {“,”, Article, Relative and Extra Letters} Fig. 4. User questions’ words.

Fig. 5. Final keywords extraction process algorithm.

A.G. Tapeh, M. Rahgozar / Knowledge-Based Systems 21 (2008) 946–950

Note that, the keywords resulted trough similarity meanings must become OR with other and the rest become AND. 5.3. Extraction of conditions By analyzing the role and place of the words in questions phrases and sentences, we can extract the conditions and constraints of the questions [14]. For example, if there is the word ‘minimum’ in question phrase, we know that this word plays the adjective role for the next word in the phrase, and specifies that the next word must has least possible value in quality or quantity in the specific domain. So we can extract conditions and constraints of the questions from their W (i.e. initial keywords set of the question) by processing and regarding the words orders and roles. For example, suppose the question ‘c’ (in Fig. 3). We have W = {sony, price, $200, Less} where w1 = sony, w2 = price, w3 = $200, and w4 = less. By processing the words orders and roles, it can be inferred w4 is related to w2. Because ‘price’ is the nonnumeric nearest word to ‘less’ from left side. So, we can understand that the user mean is condition ‘price  $200’. Fig. 6 shows the process of conditions extraction where set C is used to store the question’s conditions and W = {w1, w2, . . .} specifies the initial keywords of the question. 6. Preliminary experiments This section explains our experiments conducted to verify the validity of our approach. First we describe the process in which the underlying knowledge was created and implemented. Then we present the evaluation of our proposed approach. 6.1. Creation of the knowledge Determining and defining the requisite knowledge for knowledge-based systems is a cumbersome task. Classical expert systems required years to be crafted by perfect and highly skilled knowledge engineers. In this system, we should provide mechanisms to represent effectively the knowledge about the working domain. To implement our system knowledge, some digital camera domain experts were asked to fill simple templates with triple relations they were familiar with. The basics of system’s knowledge-

Fig. 6. Conditions extraction process algorithm.

949

conceptual model were explained to them in advance to make them understand what types of relations were needed. Finally, we implemented the concepts and their relationships (defined by domain experts) in the text format that its partial view has been shown in Fig. 2. 6.2. Evaluation of approach We have tested our proposed approach by developing a tool. All components of this tool have been developed in Java. First, we retrieved 140 Web pages (i.e. product and sellers’ information) in domain of digital cameras form pricescan.com and nextag.com manually. Then, we applied these pages to our system and asked 70 questions in natural language. We use the precision and recall measures to evaluate the performance of our system. Table 1 shows the statistical details of system inputs and outputs. In a statistical classification task, the Precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the class) divided by the total number of elements labeled as belonging to the class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the class. According to the above definitions, system precision and recall are calculated as follows: Table 1 Statistical details of system inputs and outputs Total number of questions (inputs) Number of digital camera domain related questions Number of domain irrelative questions Number of correctly processed questions (outputs #1) Number of incorrectly processed questions (outputs #2)

Fig. 7. Typical output of QA system.

70 68 2 55 13

950

A.G. Tapeh, M. Rahgozar / Knowledge-Based Systems 21 (2008) 946–950

Precision ¼

Recall ¼

Number of correctly processed questions  0:78 Total number of questions

Number of correctly processed questions Number of digital camera domain related questions

 0:81 Therefore precision and recall were obtained about 0.78 and 0.81, respectively. Fig. 7 shows the typical output of our QA system. 7. Conclusion In this paper we reported on a knowledge-based domain-specific question answering system for B2C eCommerce. Although the problem has been studied by several researchers, existing techniques are limited to specific heuristics and databases. An effective method is proposed to decompose the user’s NL questions and extract the keywords and conditions automatically. In the next step, we will be working on developing our system to cover all formats of the users’ questions. We believe as the expectations of QA users rise, it becomes inevitable to employ more sophisticated AI techniques in QA systems. At the far end, an ideal QA system will converse with users to fully understand their information needs.

[5]

[6]

[7]

[8]

[9] [10]

[11]

[12] [13]

[14]

[15]

References [1] EDI Forum, 2006. Available from: . [2] R. Kalakota, A.B. Whinston, Electronic Commerce, A Manager’s Guide, first ed., Addison Wesley Professional, 1997 (Chapter 1). [3] Product Comparison Shopping in PriceSCAN.com, 2006. Available from: . [4] E. Darrudi, F. Oroumchian, M. Rahgozar, M.S. Mirian, K. Neshatian, B.R. Ofoghi, TeLQAS: a realization of humanlike inferences for knowledge-based question

[16]

[17]

answering systems, Journal of Computational Linguistics, Submitted for publication. J.M. Abasolo, M. Gmez, MELISA: an ontology based agent for information retrieval in medicine, ECDL 2000 Workshop on the Semantic Web Lisbon, Portugal, 2000. D. Moldovan, S.Harabagiu, R. Gîrju, P. Morãrescu, F. Lãcãtuou, A. Novischi, A. Bãdulescu, O. Bolohan, Lcc tools for question answering, in: Proceedings of the 11th Text REtrieval Conference (TREC-2002), Gaithersburg, MD, pp. 144–155. Hui Yang, Tat-Seng Chua, Shuguang Wang, Modeling web knowledge for answering event-based questions, in: 12th International World Wide Web Conference (WWW’03), May 2003, Hungary. B. Magniti, M. Negri, R. Prevete, H. Tanev, Mining knowledge from repeated cooccurrences: DIOGENE, in: Proceedings of the 11th Text Retrieval Conference (TREC-2002). N.S. Friedland et al., Project Halo: towards a digital Aristotle, AI Magazine 25 (4) (2004) 29–47 . N.S. Friedland et al., Towards a quantitative platform independent quantitative analysis of knowledge systems, in: Proceedings of the Ninth International Conference of Knowledge Representation and Reasoning, AAAI Press, Menlo Park, CA, 2004, pp. 507–515. S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, P. Wang, Employing two question answering systems in TREC-2005, in: Proceedings of TREC Conference at NIST, 2005. C.D. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, London, England, 1998. V.M. Vorhees, Overview of the TREC 2002 question answering track, in: The Eleventh Text Retrieval Conference (TREC-2002), NIST Special Publication: SP 500-251. C.S. Lee, Y.F. Kao, Y.H. Kuo, M.H. Wang, Automated ontology construction for unstructured text documents, Journal of Data & Knowledge Engineering 60 (2007) 547–566. Ali Ghobadi Tapeh, Masoud Rahgozar, A virtual catalog generated from web pages of vendors for comparative shopping, in: Proceedings of 4th International Conference on Information Technology – New Generations (ITNG2007; Web Technology Track), April 2–4, IEEE Computer Society, Las Vegas, USA, 2007, pp. 463–468. Hoojung Chung et al., A practical QA system in restricted domains, in: Proceedings of the ACL 2004 Workshop on Question Answering in Restricted Domains, Spain, 2004. P. Clark, J. Thompson, B. Porter, A knowledge-based approach to questionanswering, in: The AAAI’99 Fall Symposium on Question-Answering Systems, AAAI, CA, 1999, pp. 43–51.

Suggest Documents