Query Subtopic Diversification based on Cluster

1 downloads 0 Views 798KB Size Report
Wang et al. [8] proposed a method to mine subtopics by extracting text fragments ..... [8] Q. Wang, Y. Qian, R. Song, Z. Dou, F. Zhang, T. Sakai, and Q. Zheng,.
Query Subtopic Diversification based on Cluster Ranking and Semantic Features Md Shajalala, Md Zia Ullahb, Abu Nowshed Chyc, and Masaki Aono Department of Computer Science & Engineering Toyohashi University of Technology Toyohashi, Aichi, Japan {shajalalG ,arif' ,nowshedC}@kde.cs.tut.ac.jp and [email protected] Abstract-Search queries are usually short, ambiguous, and have multiple interpretations. Identifying possible sUbtopics is one of the key strategy to disambiguate a search query. In recent years, researchers have investigated sUbtopic mining through a variety of approaches. This paper is aimed at mining a diversified list of SUbtopics underlying a query. In our approach, first, we apply soft clustering to the subtopic candidates based on frequent phrases to group subtopics of similar intents. Second, we introduce multiple semantic features to rank subtopics in the cluster. The clusters are then diversified by balancing relevancy and novelty. The cluster relevance score is estimated by combining cluster score and

SUbtopic importance.

We employ

Jaccard

coefficient to estimate the novelty of a cluster based on word embedding. Finally, a diversified list of subtopics is generated by selecting top ranked subtopic from each cluster. We conducted experiments on NTCIR-IO INTENT-2 and NTCIR-12 IMINE2 English subtopic mining test collections. The results conclude that our proposed method outperforms other methods in subtopic mining.

Key-Words: Query Intent; Subtopic Mining; Diversifica­

tion; Clustering. I.

INTRODUCTION

Nowadays, a large number of Web users want to discover and access information from search engines. Users are notice­ ably laconic in describing their information needs when they submit their query into the search engine. Since expressing search intention through keywords is difficult, some users do not choose the precise terms, and some others omit important terms needed to clarify search intentions. Because of these issues, large numbers of search queries are vague and ambigu­ ous. For example, an ambiguous query, "ginger" has different underlying subtopics including "ginger plant," "ginger health benefits," "ginger grammar," "ginger software," etc. Ambiguous or vague query may mislead search engine providing a rank list of documents with maximum redundancy which cover a very few SUbtopics underlying a query. Search results diversification can be used as one of the solutions to balance the coverage and redundancy. The study of search result diversification is divided into two approaches: implicit and explicit [1]-[3]. In the implicit approach, information retrieval system retrieves various relevant documents that are related to a given query. On the other hand, explicit approach first finds some SUbtopics of different intents for the given query and retrieves relevant documents for every SUbtopic. As a result, the retrieved documents cover various underlying 978-1-5090-1636-5/16/$31.00 ©2016 IEEE

user intents which increase the diversity. Therefore, mining a diversified list of subtopics has become a key challenge for search results diversification. Recently, query SUbtopic mining has gained much attention in the research community [4]-[6]. Researchers proposed several methods to mine a list of SUbtopics from different resources [7], [8]. For instance, Santos et al. [3] represent search engine suggestions as SUbtopic candidates to uncover query intents. Kim and Lee [7], [9] proposed a method to mine SUbtopics using simple patterns and hierarchical structure of SUbtopic exploiting a set of relevant documents. In this paper, we propose a method for mining a diversified list of possible SUbtopics for a given query. We utilize major search engine's query suggestions and completions as subtopic candidates. To group similar SUbtopics into clusters, we apply soft clustering to the SUbtopic candidates based on frequent phrases. We propose multiple semantic features to estimate SUbtopic relevance score and rank them in the cluster. The clus­ ter relevance is estimated by combining the cluster score and relevance scores of subtopics. We employ Jaccard coefficient to estimate the novelty of a cluster based on word embedding. The clusters are then diversified by balancing relevancy and novelty. Finally, a list of SUbtopics covering all possible intents is generated by selecting top ranked SUbtopic from each cluster. The experimental results clearly demonstrate the effectiveness of our approach in SUbtopic mining. In this research, our contributions are as follows: 1)

2)

A cluster diversification based ranking to diversify SUbtopics Multiple semantic features to estimate the relevance between query and SUbtopic.

The rest of the paper is structured as follows: In Section we summarize related work on query SUbtopic mining. In Section III, we introduce our proposed method. We present experiments and evaluation to show the effectiveness of our proposed method in Section IV. Some concluded remarks and future directions described in Section V.

II,

II.

RELATED WORK

Search queries are usually short, ambiguous, and have mul­ tiple interpretations [10]. Researchers tried different techniques to identify possible SUbtopics behind queries and diversify them. Ren et al. [11] proposed a method to mine subtopics based on a heterogeneous graph and enhanced the SUbtopic

quality by taking advantage of Wikipedia concepts. They introduced heterogeneous graph-based soft-clustering to derive an intent indicator for each object based on the constructed heterogeneous graph. Hu et al. [12] identified the intents of the input query through mapping the query into the Wikipedia representation space. Zheng et al. [l3] integrated the informa­ tion from both structured and unstructured data to extract high quality subtopics. To mine sUbtopic candidates, Jiyin et al. [14] suggested a random walk based approach to estimate the similarities of the explicit and implicit sUbtopics mined from a number of heterogeneous resources, including top results documents, click logs, anchor text, and web n-grams. Wang et al. [8] proposed a method to mine subtopics by extracting text fragments from different parts of the documents. They grouped similar text fragments into clusters and generated a readable subtopic for each cluster. Recently, Kim and Lee [15] proposed a method to mine sUbtopics using simple patterns and hierarchical structure of sUbtopic candidates. They extracted relevant phrase as sUbtopic candidates using simple patterns and constructed a hierarchical structure using set of relevant documents from the Web document collection. Kim et al. [9] also proposed a sUbtopic mining method based on three level hierarchical search intentions. Yu and Ren [16] proposed a modifier graph based approach by which the problem of subtopic mining reduces to graph clustering over the modifier graph. Zheng et al. [17] also proposed a pattern based sUbtopic model for mining subtopics. They applied maximal frequent pattern mining algorithm to extract the pattern from retrieval results of a query. Hu et al. [18] proposed a framework for search result diversification based on two level hierarchical intents. They used Google suggestions as the source of hierarchical sUbtopics and adapted xQuAD [1] and PM2 [3] for hierarchical search result diversification. III.

OUR ApPROACH

In this section, we present our proposed method for mining and diversifying query sUbtopics. We retrieve query sugges­ tions from major search engines and use them as sUbtopic candidates. To group similar subtopics into clusters, we use a frequent phrase based clustering algorithm where each cluster represents different query aspects. We extract multiple query dependent and query independent features to estimate the importance of SUbtopics and rank them in the cluster. The relevance between query and cluster is estimated by combining cluster score and subtopic importance. Then the clusters are diversified by balancing relevance and novelty. We use Jaccard coefficient to estimate novelty of a cluster based on word em­ bedding. Finally, a diversified list of SUbtopics is generated by selecting top ranked SUbtopic from each cluster. The overview of our proposed approach is depicted in Fig. 1. A. Subtopic Candidate Generation

Query suggestions are more effective for searching the most relevant documents which maximize the coverage [3], [7]. We retrieve query suggestions and query completions from major search engines (Google, Yahoo, Bing, etc.) for a given query. We remove duplicate suggestion after aggregating all suggestions and completions. Then, we use them as candidate

Subtopic Diversification Subtopic Diversification within Clus�ter,-_.., L...-__---II

Fig. 1: A Subtopic Diversification Framework SUbtopics of the user's query. For example, for a query "foot­ ball," a generated SUbtopic list may include "football games," "football scores," "football results," "football videos," "foot­ ball rules," etc. B. Subtopic Feature Extraction

Let q indicates the user query and S {81' 82,83, ...,8 N } indicates the set of SUbtopics for query q. We estimate the relevance between query q and SUbtopic 8 by exploiting mul­ tiple query-dependent and query-independent features those are classified into three categories: semantic features, lexical features, and popularity based features. The lexical features are extracted using their content information and the popularity based features are extracted exploiting subtopic popularity in the Web. The semantic features are proposed to estimate the semantic relevance between query and subtopic. =

1) Semantic features: We introduce a new semantic feature, average concept similarity (ACS) to estimate the semantic similarity between query q and SUbtopic 8 which is defined as follows: (1)

where ti and tj denotes term in query q and SUbtopic 8, respectively. Iql and 18 1 denotes the number of terms in query q and subtopic 8. The conceptual similarity ConSim(ti, tj) between two concepts ti and tj is defined as follows: ConSim(ti, tj)

=

2 * depthl depth2 + depth3 + 2 * depthl

(2)

Let two concepts ti and tj are related to each other in a conceptual domain. Fig. 2 depicts a hierarchical structure to describe how concepts are related to each other. C is the least common superconcept of ti and tj. depthl is the number of nodes on the path from root R to C. depth2 is the number of nodes on the path from ti to C. depth3 is the number of nodes on the path from tj to C. We propose another semantic feature based on concept similarity, WordNet path similarity (WPS) defined as follows: !WPs ( q,8 )

=

bqWb; Ibqllbsl

(3)

TABLE I: Lexical and popularity based features Type



e

Feature Name

Feature Function

Lexical Similarity

!Ls(q,s)

Query Term Over-

fQTO(q,S)

lap

�"

"'" '"

Synonym Overlap

fso(q,s)

Exact Match

fEM(q,S)

Avg. Term Length

fATL(S)

Hit Count

fHC(q,S)

.;(

Suggest Documents