COMMA: A Result-oriented Composite ...

4 downloads 1081 Views 411KB Size Report
offers. For price comparison engines like ShoppyDoo5 or. PriceGabber6 (or other .... occur in offers describing monitors, mobile phones, laptops, and so on) ... Targeted queries (e.g. “Samsung Galaxy Tab”) are submitted by users that look for ...
COMMA: A Result-oriented Composite Autocompletion Method for e-Marketplaces Matteo Palmonari, Giuseppe Vizzari DISCo, University of Milano-Bicocca Viale Sarca, 336/14, 20126, Milan, Italy {palmonari,vizzari}@disco.unimib.it

Abstract—Autocompletion systems support users in the formulation of queries in different computer systems, from development environments to the web. In this paper we describe Composite Match Autocompletion (COMMA), a lightweight approach to the introduction of semantics in the realization of a semi-structured data autocompletion matching algorithm. The approach is formally described, then it is applied and evaluated with specific reference to the e-commerce context. The semantic extension to the matching algorithm exploits available information about product categories and distinguishing features of products to enhance the elaboration of exploratory queries. COMMA supports a seamless management of both targeted/precise queries and exploratory/vague ones, combining different filtering and scoring techniques. The algorithm is evaluated with respect both to effectiveness and efficiency in a real-world scenario: the achieved improvement is significant and not associated to a sensible increase of computational costs. Keywords-autocompletion systems; semantic approaches;

I. I NTRODUCTION An ordinary e-marketplace typically performs a matchmaking activity between a set of products and search queries submitted by the users. Maximizing the accuracy and coverage of the matchmaking algorithm over search queries is a crucial aim that a successful e-marketplace must achieve. Providing an autocompletion interface is a recent but already established approach to improve product search functionalities, as testified by the fact that many successful search engines for e-commerce like the ones implemented by Amazon1 [1], Bing2 , Google3 and Yahoo4 implemented autocompletion interfaces to drive users in their searches for products and offers. For price comparison engines like ShoppyDoo5 or PriceGabber6 (or other e-marketplace aggregators) providing an autocompletion feature in terms of an outsourced service to e-marketplaces with poor retrieval features represents an interesting business opportunity. An autocompletion interface can perform two types of autocompletion operations: it can complete the query that the user is expressing, proposing a set of queries that are most relevant and reasonable to the query fragment already typed 1 http://www.amazon.com/ 2 http://www.bing.com/shopping 3 http://www.google.it/shopping 4 http://shopping.yahoo.com/ 5 http://www.shoppydoo.com 6 http://www.pricegrabber.com

Andrea Broglia, Nicola Lamberti, Riccardo Porrini 7Pixel Srl Via Copernico, 6, 20082, Binasco (MI) {broglia,lamberti,porrini}@trovaprezzi.it

by the user; this type of autocompletion can be called queryoriented[2], [3], [4], [5]. The autocompletion interface can also preview the results of the query fragment that is being typed by the user; this type of autocompletion can be called resultoriented[6], [7], [8], [9], [10]. Most of the autocompletion interfaces proposed so far in the e-commerce domain (e.g., Amazon, and Bing, Google and Yahoo Shopping) are based on query-oriented autocompletion. In fact, all the above systems integrate product offers coming from several e-marketplaces, achieving an almost complete covering of the products available on the market. However, when a query is submitted to a single small/medium-sized emarketplace, correctly completing a query that would not lead to any result, or that would lead to poorly ranked results, would not be effective. In this cases, result-oriented autocompletion can be highly effective, by previewing only product offers available in the e-marketplace, and by providing an insight on the internal working of the search functionality. In this paper we focus on the latter scenario, investigating matching and ranking techniques to support result-oriented autocompletion for small/medium e-marketplaces. One of the main problems that a good autocompletion system should address is to seamlessly handle the different types of query that can be submitted by a user. Autocompletion techniques employing on string-based matching algorithms considering offers’ titles and descriptions, can return accurate results when users submit queries targeted to specific products (e.g., “Samsung i7500”), and can be made error-tolerant by adopting approximate matching methods [11], [7], [8] over misspelled terms (e.g., “Samsong 7500”). However, sometimes user queries are not aimed at describing a specific product that is being searched but they are rather aimed at exploring the set of available products or offers, looking for products of a given category (e.g. mobile phones) or characterized by specific features (e.g. Android as operative system). Sometimes, more targeted and more generic keywords are even mixed in a hybrid type of query (e.g. “Samsung Android”). As an example, in the Lina247 e-marketplace, 54.13% of the 242 most popular queries are targeted queries, where exploratory queries and hybrid queries cover respectively a significant 21.90% and 23.97% of the queries. 7 http://www.lina24.com

The hierarchies of product categories, the most significant product features and the systematic annotation of products with respect to these schemes of metadata represent an invaluable source of information for handling many kinds of query. Indexing the terms contained in these annotations enables an autocompletion interface to effectively provide results also for exploratory and hybrid queries but obtaining a relatively poor ranking, as shown by experiments discussed later on. For seek of clarity, we will call semantic techniques any matching technique aimed at handling these annotations (categories and product features) in a different way from pure textual descriptions, in the vein of research carried out in semantic search [12]. In this paper we present Composite Match Autocompletion (COMMA), a novel semantic approach to the realization of a result-oriented Autocompletion System (AS) over semistructured data based on the integration of syntactic and semantic matching techniques. COMMA has been implemented and tested in a real world scenario in terms of an outsourced service that was installed in an Italian e-marketplace. Experimental results show that the adopted approach is viable and scalable, and it improves the accuracy and coverage of the autocompletion results when compared to the naive syntactic approach (i.e. indexing category and product feature terms) respecting the strict time constrain of an AS. The approach proposed in this paper is novel with respect to other autocompletion methods proposed so far. To the best of our knowledge, most of the approaches to result-oriented AS adopt only syntactic matching techniques (i.e. [11], [7], [8]); instead, the few approaches considering semantics assume that data have been annotated with Web ontology terms [13], [10], and, more significantly, do not consider the problem of ranking the matches (in addition, the efficiency of these approaches has not been systematically evaluated). Finally, Bing Shopping, which is the only system that explicitly considers categories and product features and uses them for autocompletion, is a query-oriented AS and it works very differently from the approach proposed in this paper. The paper is organized as follows: the autocompletion problem and the application domain will be discussed in Section II; the matching algorithm, at the core of the COMMA approach, will be described in Section III; the experiments carried out to evaluate the system are discussed in Section IV; the comparison with related work is discussed in Section V, and Conclusions end the paper. II. P ROBLEM DEFINITION An e-marketplace publishes his product offers as semistructured documents, which contain several descriptive fields such as summaries, free-text descriptions, and so on. We focus on three types of elements that occur in the offer descriptions published by most of the e-marketplaces (see, e.g., Lina24). An offer is associated with exactly one title that shortly describes the offer content (e.g. Samsung 7.000 16 GB Galaxy Tab with Android 2.2 Wi-Fi Only). It is also associated with exactly one category (e.g. Tablet PCs) taken from a category hierarchy. It is

associated with a set of facets describing technical features of the offered item, and represented by hattribute name, valuei couples (e.g. “O.S.: Android 2.2”, “battery capacity: 4000 mAh” and so on). An offer is thus represented by the triple o = hto , co , F o i, where to is an offer name, co is a category, and F o = {f1 , ..., fn } is a set of facets. Categories and facets are annotations organized according to a simple, yet semantic, structure. When an offer o is associated with a category c (or a facet f ), we also say that o is annotated with c (or f ); since each offer o is associated with exactly one category c, we also say that o belongs to c. Titles, categories and facet will be also called descriptive dimensions (dimensions, for short). Since one facet can occur in offers belonging to several categories (e.g. “screen size” can occur in offers describing monitors, mobile phones, laptops, and so on) categories and facets represent two different and orthogonal annotation schemes. Moreover, it must be observed that an offer is usually annotated with many facets, one facet attribute can be associated with several values, and one facet value can be associated with several attributes (e.g. “recording format: mp3” and “supported media: mp3”). A. The Autocompletion Problem A result-driven Autocompletion System (AS) present a set of k results, which are most relevant to the keyword-based query typed by the user, by processing the query at each interaction of the user with the system. Observe that every query fragment (a query where the last keyword typed in by the user is a string representing a word fragment, e.g., “smartphone nok”) is considered an input query by the AS. In order to be effective and perceived as instantaneous the autocompletion operation must be completed in a relatively short time span (i.e. maximum 100 ms)[14]: this constraint represents a serious limit to the possibility of exploiting complex techniques for the improvement of the accuracy and coverage of the autocompletion results. Since there is an element of distraction caused by User Interface interrupts, which can overcome the user while typing the query, the high quality of the data retrieved by the AS must justify the distraction caused to the user by the AS [15]. An AS has therefore to fulfill both efficiency and effectiveness requirements, and since an improvement in the latter usually has a negative impact on the former, finding a good trade-off between efficiency and effectiveness is a major goal for matching algorithms developed in this context. Targeted queries (e.g. “Samsung Galaxy Tab”) are submitted by users that look for a a specific product; the keywords used in this case usually point to specific terms used in offer titles (e.g. brand and model). These queries can be handled using well known exact (e.g. prefix matching) and approximate matching techniques [11], which support more flexible autocompletion providing results also when query terms are misspelled [7], [8]. However, even approximate matching techniques would be unsuccessful when an exploratory query is submitted. Consider a query such as ”Tablet PC with Office Mobile” which refers to a class of products rather than to a specific product. In this query, neither the keywords nor any string

approximately matching these keywords occur in the offers’ titles. Instead, many keywords used in exploratory queries aim at product categories (e.g. Tablet PC) and/or technical features. These pieces of information are are often available available from the annotations, i.e. categories and facets, associated with the offers. Furthermore, the distinction between targeted and exploratory queries is not sharp; in fact queries such as “Samsung Tab Office Mobile” can contain terms referring to specific products as well as to generic product features; we will refer to these queries as hybrid queries. Semantic matching can therefore play a crucial role in order to effectively produce results to exploratory (and hybrid) queries considering wellstructured information associated with the offers. One of the greatest challenge for a result-driven AS seamlessly processing all the above types of queries consists in combining syntactic and semantic matching so that the capability of handling more types of queries (improving the coverage of AS) does not lower the quality of the results returned when targeted queries are submitted. III. T HE MATCHING ALGORITHM The COMMA approach consists of three main phases: • indexing (off-line): offers descriptions are indexed along three different dimensions: their title, category and the facets they are annotated with; • filtering (runtime): given a query fragment q submitted by the user, the matching algorithm applies three filters, one for each descriptive dimension; one filter selects offers whose titles match with the query (syntactic matching); one filter selects offers whose categories match with the query (semantic matching); one filter selects offers whose facets match with the query (semantic matching); the results of the three filters are combined by means of a set union returning an overall set of matching offers; • ranking (runtime): several scoring methods are applied to rank the offers selected by the filters; local scoring methods evaluate the relevance of each offer to the query, by considering different criteria such as their popularity, the strength of the syntactic match, the semantic relevance; the local scores are combined in a final relevance score, which is used to rank the offers and define the list of top-k relevant offers. Ranking is applied after filtering because most of the scores applied in ranking reuse the results of matches discovered during filtering; moreover, this approach allows to limit the number of items to which scoring functions are applied, achieving a better efficiency. A. Indexing In order to index the titles, categories and facets associated with each offer, three inverted indexes are built: I T , I C and I F respectively denote the title, category and facet indexes. In the facet index I F only facet values are indexed under the assumption that the attribute names of facets are not very significant for searches (e.g. a user is more likely to search for “phone Samsung android”, instead of “phone brand Samsung operating system Android”).

All the indexes are built by extracting every term occurring in the indexed string (respectively the name, the category identifier and the facets’ values); we tokenize the strings representing code (e.g. from “Nokia c5-03” we extract the three terms “nokia”, “c5” and “03”); terms are stemmed and stop-words are removed. We will call title terms, category terms and facet terms the elements belonging respectively to the sets I T , I C and I F . B. Syntactic and Semantic Filtering q q q Let OSY N , OCAT , and OF AC be the set of offers whose textual descriptions (titles in our case), categories and facets respectively match a query q. The set Oq of offers matching a query q can be defined as follows: q q q Oq = OSY (1) N ∪ OCAT ∪ OF AC In other words, all the offers that match the input query along at least one dimension are selected in the filtering phase. The next sections explain how each matching set is obtained. q 1) Syntactic Filter: The set OSY N of the offers that syntactically match against a query fragment is computed by performing a set union of the results of two filters: a prefix-based string matcher, and an approximate string matcher (asm) based on normalized edit distance (nedit) and a filtering threshold thasm [11]. A title term t ∈ I T pref-matches an input keyword fragment k, iff k is a prefix of t; a title term t ∈ I T asm-matches an input keyword fragment k, iff nedit(k, t) ≤ thasm . An offer o pref-matches (asm-matches, respectively) an input keyword fragment k, iff there is at least a title term t ∈ o such that t pref-matches (asm-matches) k. An offer o pref-matches (asmmatches, respectively) a query q = {k1 , ..., kn } if and only if o pref-matches (asm-matches) every keyword ki ∈ q, with q 1 ≤ i ≤ n. The set OSY N of offers syntactically matching an input query q is defined as the union of the set of all the offers pref-matching q and the the set of all the offers asm-matching q. As an example, suppose to have a query q=“nakia c”, and two offers named “Nokia c5-03” and “Nokia 5250 Blue”; “Nokia c5-03” syntactically match q because “nakia” asmmatches the keyword “Nokia” and “c” asm-matches the keyword fragment “c” (instead, observe that this is not a prefmatch because “nakia” does not pref-match “Nokia”); instead “Nokia 5250 Blue” does not syntactically match q because there is no term in the offer title pref-matching or asmmatching the keyword fragment “c”. 2) Semantic Filters: The goal of the semantic filter is to identify a set of offers that are potentially relevant to a query where a number of category and/or facet names occur. We define two semantic filters, one based on category matching (cat-matching), and one based on facet matching (fac-matching). Before applying semantic filters, stemming and stop-words removal are applied to a query q submitted by the user, obtaining a new query, denoted by q to which filters are applied. Intuitively, the cat-matching filter returns all the offers that q belong to some category matching a query q. This set OCAT

is defined as follows. A category c matches a keyword k iff there exists a category term i ∈ I C associated with c such that i = k (remember that category names can contain more terms, and that an indexed category term can be associated with more categories); an offer o cat-matches a query q = {k1 , ..., kn } iff it belongs to a category c that matches at least one keyword k ∈ q. Intuitively, the fac-match filter returns all the offers annotated with some facet matching a query q; this set OFq AC is defined as follows. A facet f matches a keyword k iff there exists a facet term i ∈ I F associated with f such that i = k (remember that only facet values are considered in the matching process); an offer o fac-matches a query q = {k1 , ..., kn } iff it is annotated with at least one facet f that matches at least one keyword k ∈ q (remember that an offer can be annotated several facets). Given a query q, we also identify the sets C q and F q , respectively representing the set of all the categories and facets that match with at least a keyword of k. As an example suppose to have a query q=“players with mp3”, which is transformed in the query q=“player mp3” after stemming and stop-words removal; since the category “mp3 and mp4 players” matches the query, all the offers belonging to q this category are included in the OCAT set; since several facets such as “compression: mp3” (where “mp3” is the value of the attribute “compression”), “audio format: mp3”, “recording format: mp3”, match q as well, every offer annotated with these facets is included in the OFq AC set. Observe that while an offer is required to syntactically match all the terms in the query, an offer can semantically match only one keyword in order to be considered in the semantic matches for that query. The reason for such a difference is that, while for precise queries targeted to retrieve offers based on their titles the user has more control on the precision of the results (under the assumption that the user knows what he/she is looking for), for exploratory queries containing generic category and facet terms we want to consider all the offers potentially relevant to these vaguer terms. Although the semantic match significantly expands the set of matching offers, the ranking process will be able to discriminate the offers that are more relevant to the given query. C. Ranking and Top-k Retrieval Several criteria are used to rank the set of matching offers. Each criterion produces a local relevance score; all the local scores are combined by a linear weighted function which returns a global relevance score for every matching offer; the top-k offers according to the global score are finally selected to be presented to the user. 1) Basic Local Scores: Three basic local scores are introduced to assess the relevance of an offer with respect to a query submitted by a user. Popularity: pop(o) of an offer o is a normalized value in the range [0, 1] that represents the popularity of an offer o based on the number of views the offer received. An absolute offer popularity score is assessed off-line analyzing log files

and it is frequently updated; the absolute popularity score is normalized at run-time by distributing the absolute popularity of the offers matching the input query into the range [0, 1]. Syntactic relevance: offers retrieved by exact (prefix) matching should be rewarded with respect to the offers approximately matched (e.g. the order of digits makes a difference in technical terms like product codes); we therefore define a syntactic relevance score that discriminates between the offers matched by the two methods. The syntactic relevance synq (o) of an offer o with respect to a query q assumes a value m if o is a pref-match for q, n if o is a asm-match for q, and 0 elsewhere, where m > n. Semantic relevance: semantic relevance is based the principle that the more matching facets occur in an offer, the more relevant this offer is with respect to the query. Semantic relevance does not consider categories because the number of offers belonging to a category is generally high and because an offer belongs to one category only. The semantic relevance semq (o) of an offer o to an input query q is computed as the number of facets F o associated with o and matching with q, normalized over the total number of the facets matching the query F q , according to the following formula: |F q ∩ F o | (2) semq (o) = |F q | 2) Intersection Boost with Facet Salience: When a user submits a targeted query (see Section II), the basic local scores introduced above perform quite well. When the matching is driven mostly by syntactic criteria, the accuracy of the syntactic match and the popularity become key ranking factors. However, exploratory queries are usually more ambiguous (each single keyword can match more categories and more facets at a same time, and matches for each keyword are combined by means of set union); in these cases, the semantic relevance score (see equation 2) based on the matching facets can be too weak, with matching categories being not even considered in the score. As an example, all the offers belonging to the category “vacuum cleaner” or annotated with the facet “type: robot” match the query “robot vacuum cleaners”; however, some of these offers describe vacuum cleaners of having “type: robot”, while some others describe other categories of kitchen tools having “type: robot”. The ranking scores defined so far do not discriminate among the two kinds of offers because they do not consider their categories; intuitively, offers describing vacuum cleaners of type robot should be better rewarded because they match on both the category and facet dimensions. Considering the general case when offers are matched along more dimensions, we introduce the general principle that offers that have been matched by more than one filter are rewarded. To capture this principle we add to the above discussed local scores an additional score, called intersection boost, which rewards the offers that have been matched by more filters. Before formally defining the intersection boost function, we discuss the boosting factor that can be defined in order to further improve the distribution of relevance scores among

matching offers, expecially in presence of exploratory queries. In order to better assess the relevance of the offers, and in particular of the offers that have been matched by semantic filters, we introduce a method which is based on the notion of facet salience. Intuitively we define a special set of offers that will be considered in the boosting function; this set consists of offers annotated with facets particularly salient to some categories matching the query submitted by the user. Intuitively a facet is salient to a category if it is highly likely to occur in offers that belong to that category. Although a facet (e.g. “type: third-person shooter”) can occur in the description of offers belonging to a category (e.g. “games for ps3”), the number of offers belonging to such category that are annotated with this facet can be limited; instead, other facets, which can have similar values (e.g. “type: first-person shooter”), that occur more frequently in offers belonging to the same category, can be considered more salient to this category. Therefore,when a user submits an ambiguous query like “shooter ps3”, the matching offers include the offers annotated with the facet “type: first-person shooter” and the offers annotated with “type: third-person shooter”, all of which belong to the category “games for ps3”; however, offers annotated with “type: first-person shooter” can be considered more relevant to the query because their facets are more salient to the category “games for ps3”. Formally, the salience of a facet f to a category c is defined by a function sal : F × C → [0, 1] ⊂ R that represents the conditional probability of the occurrence of a facet f in the set of offers belonging to the category c, that is, Oc ; given also the set Of of the offers annotated with the facet f , sal is defined by the following formula: sal(f, c) =

| Of ∩ Oc | |Oc |

(3)

The set Fˆ q of facets salient to a query q is defined as the set of facets matching with q whose salience to at least one category matching q is higher than a given threshold l: Fˆ q = {f ∈ F q | ∃c ∈ C q such that sal(f, c) ≥ l} (4) q Given a query q, the set OSAL of offers salient to q is defined as the set of offers that fac-match some facet f ∈ Fˆ q . Now we can introduce the Boosting with Facet Salience (BFS) score refining the function boostq (o) so that the salience of the facets matching the input query is taken into account. Let χA be the characteristic function of a set A (i.e, χx = 1 iff x ∈ A); the boostq (o) function can be formally defined as follows: X χA (o) boostq (o) = (5) |M | q q q q A∈M ={OSY N ,OCAT ,OF AC ,OSAL }

Observe that when there does not exist any category that q matches the query, OSAL is empty. 3) Global Matching Scores and Ranking: The global scoring function can therefore be modified considering also the intersection boost with facet salience. Formally, the global score can be defined by a function scoreq : Oq → [0, 1] ⊆ R

AJAX + HTTP

Autocompletion System

Offers

JSON

COMMA

Categories

Facets

Client

Server

Database

Fig. 1: COMMA evaluation deployment scenario.

as follows: scoreq (o)

=

wpop · pop(o) + wsyn · synq (o) +

(6)

+wsem · semq (o) + wboost · boostq (o) where all wi with i ∈ pop, syn, sem, boost are in the range [0, 1] and sum up to 1. IV. COMMA E VALUATION We conducted several experiments to evaluate the approach proposed in the paper. In order to investigate the effectiveness of COMMA we evaluated the accuracy of the ranks returned by the system comparing COMMA with a baseline, namely a syntactic approach based on the well-known TF-IDF weighting scheme[16], and analyzing the effect of several parameters on the results. In order to evaluate the efficiency of COMMA, we investigate whether the AS satisfies the strict time constraints required by the domain, and can scale up to handle a significant amount of data. Finally, we conducted experiments in a real world scenario, deploying COMMA on an Italian emarketplace and measuring the impact on the conversion rate, i.e., the ratio of users that complete a purchase over the users that visit the e-marketplace. COMMA has been implemented using the Java programming language exploiting several functions of the Lucene search engine library8 ; further details on COMMA’s implementation are out of the scope of this paper. We only clarify that, as shown in Figure 1, the system accepts the AJAX calls from the client web page and supplies results of the matching operation on offers, exploiting category and facet information, by means of JSON format. This deployment allowed for the provisioning of the autocompletion functionality as a remote service instead of requiring a deployment on the e-marketplace server. It must be noted that due to UI constrains, the list of results of the autocompletion operation is necessarily small: the size of this list was set to 7, a good trade-off between compactness and coverage of the suggested results[17]. The experiments where run in May 2012 on an Intel Core2 2.54GHz processor laptop, 4GB RAM, under the Xubuntu Linux Desktop 12.04 32bit operating system. For the realworld experiments only, COMMA has been deployed to an Ubuntu Linux Server 10.04 LTS 64bit, 2GB RAM, 2GB storage, virtualized machine. 8 http://lucene.apache.org/

DCG(O, r) =

r X i=1

TF-IDF-based COMMA COMMA-B COMMA-BFS

nDCG

0.8 0.7 0.6 0.5 0.4

2

4

6

8

10

12

14

16

18

20

18

20

Rank

(a) All query types. 1

TF-IDF-based COMMA COMMA-B COMMA-BFS

0.9 0.8 0.7 0.6 0.5 0.4

2

4

6

8

10

12

14

16

Rank

r(i)

2 −1 , log(1 + i)

1 0.9

nDCG

A. Effectiveness In order to evaluate the effectiveness of our approach we evaluate the accuracy of the results ranked by COMMA with respect to an ideal rank, adopting the Normalized Discounted Cumulative Gain (nDCG) measure [18]. Discounted Cumulative Gain (DCG) [19] is a well-known measure of effectiveness for ranking algorithms adopted in Information Retrieval. The rationale of the measurement is that highly relevant documents should appear in the top positions of the list of results and, more generally, that the list should be ordered according to the ideal relevance of the documents to a query. The nDCG measure is obtained by normalizing the DCG measure according to the ideal DCG measure of the result list. Adopting the notation used in the paper (documents are represented by offers), DCG can be defined as follows. Given a ranked set of offers O, and an ideal ordering of the same set I, the DCG at a particular rank r is defined as (7)

where r(i) is the assessed graded relevance of the offer i (adopting the scale 0=Bad, 1=Fair, 2=Good, 3=Excellent relevance). Consequently, the nDCG of O at a particular rank r is defined as DCG(O, r) nDCG(O, r) = (8) DCG(I, r) Intuitively, a higher nDCG value corresponds to a better agreement between the list proposed by the system and the human judgements (the ideal ordering). 1) Experimental Setup: To the best of our knowledge there is no available benchmark and dataset defined to evaluate a result-oriented AS, and in particular considering the types of queries addressed in this paper. Therefore, we used a dataset from a real e-marketplace, which contains 5594 offers belonging to 92 different categories and annotated with 463 different facets. The ideal ranks for 30 keyword-based queries (13 targeted, 13 exploratory and 4 hybrid queries) have been defined by collecting graded relevance judgements from 10 average users of e-marketplaces and aggregating the individual judgements into mean graded relevance scores. All the queries have been manually constructed, considering frequent types of queries made by Italian users of the e-marketplace. For example, the query “nakon coolpix” (notice the misspelling) is classified as targeted, the query “nokia wifi” is classified as exploratory, and the query “lcd samsung” is classified as hybrid because the all the keywords can refer both to the name and the facets of an offer. For each query the user was asked to fill in a questionnaire assigning a graded relevance judgement (adopting the previously introduced scale) to a set of distinct 20 offers from the dataset. This set contained 15 offers selected by a pool of experts and was enriched with the top 5 ranked offers according to the TF-IDF weighting scheme9 . 9 An archive containing the dataset used for our experiments, including the questionnaire and the user judgements can be downloaded from http: //www.lintar.disco.unimib.it/COMMA2012/query set extended.zip

(b) Only exploratory queries. Fig. 2: Normalized DCG for the 3 different configurations of COMMA compared to the baseline at different rank thresholds.

2) Experiments: In the first experiment we consider as a baseline the naive solution to the problem of responding to both exploratory and targeted queries: titles, categories and facets are indexed and the offers that have at least on term in common with the query (considering also mispellings) are selected and ranked according to the TF-IDF weighting scheme10 . The baseline is compared to three different configurations of COMMA: simple scoring, that is, without the methods for ranking improvement (i.e., the intersection boost and facet salience), with intersection boost and without facet salience (COMMA-B), and with both intersection boost and facet salience (COMMA-BFS). Figure 2a shows that all the configurations of COMMA that include at least one method for ranking improvement perform better than the baseline, at all the considered rank thresholds, while the TF-IDF-based approach outperforms the basic COMMA algorithm. Using both intersection boost and facet salience in the global ranking function leads to significant improvements, especially if we consider the results for exploratory queries, plotted in Figure 2b. The accuracy in ranking results in the top positions, which is noticeable for any type of query, is even more remarkable for exploratory queries; this behavior is particularly beneficial to an AS, which presents a limited number of results to the user. Rewarding the offers selected by more than one filters (Intersection boost) and particularly relevant because of the salience of their facets to the query (facet salience) enables COMMA to be enough selective to perform better for higher rank thresholds. In the second experiment, the effect of the weights wboost , wpop , wsyn and wsem on the overall effectiveness of COMMA is evaluated. Figure 3a presents the nDCG values obtained at fixed ranks (1, 10 and 20 respectively) for different config10 We also run experiments using a BM25 weighting scheme [20], which achieved worse performance than TF-IDF reported in the paper.

nDCG

0.8

Milliseconds

1 0.9

COMMA (rank 1) COMMA (rank 10) COMMA (rank 20)

0.7 0.6 0.5 0.4

0.1

0.2

0.3

1

1.5

2

2.5

3

3.5

4

4.5

0.4

0.5

0.6

0.7

0.8

0.9

1

Intersection Boost Value

Fig. 4: Average execution times with set of offers of different size.

(a) wboost varying.

to other configurations of COMMA and a TF-IDF that do not include any approximate string matching algorithm (referred to as “not robust” in the plot). The results of this experiment show that, when dealing with a set of offers of this size, COMMA is able to respect the strict requirements of an AS, completing the autocompletion operation in less than 25 ms leaving a reasonable time for network overhead. A second kind of analysis is related to the impact of the semantic filters and ranking scores adopted by COMMA on the overall execution times, which can be estimated by considering the difference between COMMA and the TF-IDF weighting approach. Within the experimented field, this difference remains quite limited and it does not grow significantly with the growth of the size of the dataset. Finally, the computational costs of the approximate string matching algorithms dominate the overall costs of computation. This problem is due to a suboptimal implementation of the approximate string matching algorithm in the Lucene framework used in the experiments; however, it has been announced that an improved implementation will be included in Lucene soon, the overall performance of COMMA are expected to significantly improve.

COMMA (syntactic) COMMA (semantic) COMMA (popularity) COMMA (constant weights)

0.9

nDCG

COMMA TF-IDF-based (robust) COMMA (not robust) TF-IDF-based (not robust)

Dataset Magnitude Order (powers of 10)

1

0.8 0.7 0.6 0.5 0.4

25 20 15 10 5 0

2

4

6

8

10

12

14

16

18

20

Rank

(b) Local scores for varying weights. Fig. 3: Normalized DCG for different configurations of COMMA with different weights at fixed rank values.

urations of COMMA, each one with a different assignment to wboost (ranging from 0.1 to 1), and with an uniform distribution of the other weights according to the formula wpop = wsyn = wsem = 1−w3boost . This experiment shows that the best results are obtained when wboost is assigned values between 0.6 and 0.9. In other words, the higher the weight of the intersection boost factor is, the better COMMA ranks according to human judgement, with the exception of wboost = 1. Moreover, the best performing configuration of COMMA is the one that emphasizes the syntactic score, as depicted in the Figure 3b. Although several criteria have to be introduced to refine the overall ranking scores (especially for offers selected by semantic filters, which extend the functionality of the AS) the syntactic score still represents the most important ranking factor. B. Efficiency We remind that the autocompletion operation must take place in a relatively short time span (100 ms) in order to be perceived as instant by the user. We assess the efficiency of COMMA with respect to time elapsed for the computation of an autocompletion operation. For these experiments we use a dataset of 30725 offers, belonging to 206 different categories, annotated with 936 different facets, and collected from a real e-marketplaces (consider that this is one of the larger e-marketplace considered by a price comparison engine like ShoppyDoo). In order to analyze the scalability of the COMMA approach, the execution times for completing a query are evaluated with sets of offers of different size. Test queries for this experiment are taken from a set of 200 queries of variable length (chosen among the most common queries actually submitted by users of ShoppyDoo). All the results of the single query computation are merged to a mean elapsed time measure. Results of this test are shown in Figure 4: COMMA is compared to the approach based on TF-IDF described in the previous experiment, and

C. Real World Experimentation The experiments in real world scenario were run to evaluate COMMA from a commercial perspective. Two different experiments were run considering two different configurations of COMMA deployed to an Italian e-marketplace dealing with consumer electronics and photography related goods. The results reported in this section are not as extensive as the ones previously described due to confidentiality agreements with the e-marketplace company. The goal of the first experiment was to evaluate the impact of the addition of the autocompletion function to the emarketplace web site, with particular reference to the effects on usability of the web site search function. The autocompletion function is a form of site adaptation and we were concerned that it could reduce overall site usability. Therefore, we used Google Analytics11 to gather data about the usage of the autocompletion function, adopting a purely syntactic configuration of COMMA (we ran the test in the early phases of the work): results showed that during the test period (50 days) users increasingly employed the autocompletion function (considering the frequency of activations of the autocompletion function on the overall number of page views) and they also growingly selected offers returned by the AS. In a later phase, we evaluated the impact of the full autocompletion function on the (previously introduced) conversion 11 http://www.google.com/analytics/

rate by means of an A/B test 12 during a shorter time period (5 days, for a total of 7,339 visitors): the introduction of this function led to an increase of 6.67% of the conversion rate.

criteria can increment the performance of an AS in terms of effectiveness; our experimental results confirm this intuition.

V. R ELATED W ORKS

The paper introduced COMMA, an approach for exploiting category and facet-like annotations for result-oriented autocompletion over semi-structured data, leveraging a type of information that is generally already available in e-commerce information systems. This information is exploited to support the elaboration of exploratory queries, without decreasing the precision in the management of precise queries. The approach was formally described and evaluated according both to its effectiveness and efficiency, in experimental and real world scenarios. The results show that COMMA outperforms purely syntactic approaches to autocompletion in presence of exploratory queries without causing a significant increase in computational costs. R EFERENCES

AS applications in the e-commerce area are rarely described in the scientific literature and are often protected by patents (see, e.g. [1]). Bing Shopping, is the only system that explicitly considers categories and product features and uses them for autocompletion exploiting semantic matching, in a queryoriented way: when a query matches one or more category name the system helps the user in submitting a more selective query adding (i) categories (e.g. “mp3 players”), and (ii) constraints over facets, grouped by category (e.g. tablets by brand: apple, samsung, etc). Our approach automatically performs this selection and suggests results based on categories and facets matching the query. In the field of Data Management the autocompletion problem has been interpreted as the problem of matching a query fragment with searchable data represented by strings, and different techniques of both exact and approximate string matching [11], such as Prefix matching [6], q-grams [7] and trie based matching [8], have been proposed. Our approach can potentially leverage any of these techniques. However, the experimental results presented in the Section IV show that an approach to autocompletion that employs only syntactic matching is not able to deal with exploratory and hybrid queries in a satisfactory way. Autocompletion techniques have also been studied in the field of semantic search [12]. Hyv¨onen et al. [13] formalize the autocompletion problem as the problem of matching ontological concepts used for data categorization to a query. Autocompletion in this area have been introduced to suggest ontological concepts from is-a (or part-of) hierarchies [9], and to retrieve data categorized using different ontological categories [10]. However, these approaches do not deal with the problem of ranking the proposed results, and assume that annotations are based on Web ontologies. Interactive Query Expansion (IQE) addresses the problem of retrieving those terms that, when added to the already expressed query, can lead to high precision and recall results if submitted to a search engine. The suggested terms can be retrieved adopting probabilistic models [5], exploiting semantic relations such as synonymy, hyponymy or hypernymy [4], or considering the behaviour of a user submitting multiple queries [2]. A more recent approach uses query logs to evaluate the similarity between query contexts [3]. A query context is constituted of all the query terms previously stated by the user plus the initial letters of the term that the user is currently typing. All of those approaches are intrinsically different from COMMA, since they all suggest queries (not results). Moreover COMMA deals with semi-structured data and not textual documents. However we follow the same consideration made in [3], that is, the usage of multiple 12 The A/B test was carried out using Visual Website Optimizer A/B testing tool. More information can be found at http://visualwebsiteoptimizer.com/

VI. C ONCLUSIONS

[1] R. E. Ortega, J. W. Avery, and R. Frederick, “Search query autocompletion,” Patent, 05 2003, uS 6564213. [2] H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li, “Contextaware query suggestion by mining click-through and session data,” in KDD, 2008, pp. 875–883. [3] Z. Bar-Yossef and N. Kraus, “Context-sensitive query auto-completion,” in WWW, 2011, pp. 107–116. [4] Z. Gong, C. W. Cheang, and L. Hou U, “Web query expansion by wordnet,” in DEXA, 2005, pp. 166–175. [5] J. Fan, H. Wu, G. Li, and L. Zhou, “Suggesting topic-based query terms as you type,” in APWeb, 2010, pp. 61–67. [6] H. Bast, C. W. Mortensen, and I. Weber, “Output-sensitive autocompletion search,” Inf. Retr., vol. 11, pp. 269–286, 2008. [7] S. Chaudhuri and R. Kaushik, “Extending autocompletion to tolerate errors,” in SIGMOD, 2009, pp. 707–718. [8] S. Ji, G. Li, C. Li, and J. Feng, “Efficient interactive fuzzy keyword search,” in WWW, 2009, pp. 371–380. [9] R. Sinkkil¨a, E. M¨akel¨a, T. Kauppinen, and E. Hyv¨onen, “Combining context navigation with semantic autocompletion to solve problems in concept selection,” in SeMMA 2008, ser. CEUR Workshop Proc., vol. 346, 2008, pp. 61–68. [10] E. M¨akel¨a, K. Viljanen, P. Lindgren, M. Laukkanen, and E. Hyv¨onen, “Semantic yellow page service discovery: The veturi portal,” in ISWC, 2005. [11] G. Navarro and M. Raffinot, Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, 2002. [12] E. M¨akel¨a, “Survey of semantic search research,” in Proc. of the Seminar on Knowledge Management on the Semantic Web, 2005. [13] E. Hyv¨onen and E. M¨akel¨a, “Semantic autocompletion,” in The Semantic Web / ASWC 2006, ser. Lect. Notes Comput. Sc. Springer Berlin Heidelberg, 2006, vol. 4185, pp. 739–751. [14] R. B. Miller, “Response time in man-computer conversational transactions,” in AFIPS (Fall, part I), 1968, pp. 267–277. [15] B. P. Bailey, J. A. Konstan, and J. V. Carlis, “The effects of interruptions on task performance, annoyance, and anxiety in the user interface,” in INTERACT, 2001, pp. 593–601. [16] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988. [17] T. L. Saaty and M. S. Ozdemir, “Why the magic number seven plus or minus two,” Math. Comput. Model., vol. 38, no. 3-4, pp. 233 – 244, 2003. [18] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, “Diversifying search results,” in WSDM, 2009, pp. 5–14. [19] K. J¨arvelin and J. Kek¨al¨ainen, “Cumulated gain-based evaluation of ir techniques,” ACM Trans. on Inf. Sys., vol. 20, pp. 422–446, 2002. [20] K. S. Jones, S. Walker, and S. E. Robertson, “A probabilistic model of information retrieval: development and comparative experiments,” Inf. Process. Manage., vol. 36, no. 6, pp. 779–808, Nov. 2000.