Improving Entity Search over Linked Data by Modeling Latent Semantics Nikita Zhiltsov
⇤
Eugene Agichtein
Higher School of Information Technologies and Information Systems Kazan Federal University Kazan, 1/37 Nuzhina Str 420008, Russia
Maths and Computer Science Department Emory University Atlanta, GA 30322, USA
[email protected]
[email protected]
ABSTRACT Entity ranking has become increasingly important, both for retrieving structured entities and for use in general web search applications. The most common format for linked data, RDF graphs, provide extensive semantic structure via predicate links. While the semantic information is potentially valuable for e↵ective search, the resulting adjacency matrices are often sparse, which introduces challenges for representation and ranking. In this paper, we propose a principled and scalable approach for integrating of latent semantic information into a learning-to-rank model, by combining compact representation of semantic similarity, achieved by using a modified algorithm for tensor factorization, with explicit entity information. Our experiments show that the resulting ranking model scales well to the graphs with millions of entities, and outperforms the state-of-the-art baseline on realistic Yahoo! SemSearch Challenge data sets.
Categories and Subject Descriptors H.3 [INFORMATION STORAGE AND RETRIEVAL]: Information Search and Retrieval—Retrieval models
Keywords Entity Search, Learning to Rank, Tensor Factorization
1.
INTRODUCTION AND BACKGROUND
Providing entity-oriented search tools is one of the common trends of the past few years in the search engine industry. Some examples of existing products are Google Knowledge Graph, Facebook Graph Search, Bing Snapshot, WolframAlpha, and Yandex Islands. They aim to resolve entitycentric queries and show the search results (or only the best one) in the user convenient form. The analysis of real query ⇤Work done while visiting Emory University Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. CIKM’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA. Copyright 2013 ACM 978-1-4503-2263-8/13/10 ...$15.00. http://dx.doi.org/10.1145/2505515.2507868.
1253
Figure 1: A semantic graph of entities that are relevant to a query richmond virginia
logs justifies importance of the task. For example, a recent empirical study [8] of the representative Yahoo! query log sample discovered that about 58% queries have the clear user intent of soliciting information about certain entities, i.e., goods, people, organizations, locations, events and others. In this paper, we address the problem of accurate modeling of the underlying query intent, which is key for enabling rich search experience. In entity search over RDF graphs [8], we consider an entity description as an RDF subgraph that comprises RDF triples including the entity URI. The subgraph may contain literals along with pairs of predicates and URIs of di↵erent entities from the graph. For example, Figure 1 shows an entity graph for a city Richmond, VA. Given a query richmond virginia, we may expect the following search result list: fb:Richmond, Virginia, dbr:Richmond Virginia, and fb: East End, Richmond Virginia. The first two search results would be of excellent relevance level, and the last one is considered as of fair relevance. Along with the growing commercial interest of the leading search companies, there is active research into methods in academia. Most previous works adapt the standard IR approach to the task. Given an entity, its literals and, possibly, literals of other entities associated with the given entity are folded into a pseudo document with multiple fields. Then, one may apply BM25F [1, 2] or language model based [6] ranking functions. In [9], string similarity scores and the explicit semantics of owl:sameAs and DBpedia:redirect prop-
erties have been used for enhancing the search context and shown as remarkable improvement factors. The authors of [5] have improved entity ranking considering n-gram statistics along with a PageRank adaptation. Unlike previous works, our method takes advantage of the supervised learning-to-rank paradigm and first integrates information about the semantic graph into a ranking model in a compact representation. It represents queries, entities and query-entity pairs with a set of features that fall into two categories: term-based features and structural features. The features of the first category are derived from basic word statistics of queries and entities. The features of the second category capture the latent semantics of relations in the entity graph, inferring the distribution over latent factors for entities. To model this kind of semantics, our method relies on a tensor factorization based representation of initial relational data. Finally, a machine learning algorithm (in our case, Gradient Boosted Regression Trees) discovers some patterns of feature values and optimizes those regularities with respect to the one of the standard evaluation measures. In summary, key contributions of our work include:
Dirichlet prior µf is built for P each field. Finally, a mixture of language models P (t|✓) = f wf P (t|✓f ) is applied, and Q the probability P (q|✓) = t2q P (t|✓) for the entity, given a query, is used as a ranking score. The usage of this function as a feature is motivated by its superior performance in the entity search scenario [6]. Bigram relevance models. Most entity search queries are about named entities. Proper names are often fixed phrases, in which the words cannot be freely swapped or omit without losing the context. The queries south dakota state university or brooklyn bridge are good examples. That’s why we suggest using a plain bigram relevance model that considers frequency of consequences of length two from a given query. Since occurrences in the name field more likely lead to relevant entities, we consider di↵erent scores per each field. Additionally, we introduce two query related features. Query features. The entity independent features do not model the relevance directly, however, they may boost some of considered features. For example, query length may be helpful for a learning model to increase weights for bigram relevance scores on long queries. Besides, we exploit query clarity [3], i.e., a measure of query ambiguity. This feature aims to distinguish clear queries from vague queries that also occur in the test data, e.g. carolina. We expect that the MLM score has to perform better on queries of that kind.
• A principled approach based on learning-to-rank to incorporate content and our novel structural features into the ranking model; • A thorough evaluation of the proposed techniques by acquiring thousands of manual labels (to be shared with the research community) to augment the standard benchmark data set (Section 5).
3.
In the next three sections, we describe the di↵erent components of our ranker: representing entity descriptions (Section 2), representing link information (Section 3), and combining them together using a learning to rank approach (Section 4).
2.
REPRESENTING ENTITY DESCRIPTIONS
In this section, we describe our modification of a previous approach [6] to model entity descriptions. Then, we derive a set of term-based features for ranking. We follow the entity multi-fielded document paradigm. In particular, our model distinguishes three groups of predicate values that correspond to the document fields: • names, e.g. literals of foaf:name, rdfs:label properties along with tokens extracted from relative parts of entity URIs;
REPRESENTING ENTITY LINKS
In this section, first, we introduce a model to represent entity links in a latent space. Second, we describe structural features to evaluate query-entity relevance using the latent representation. A factorization of the data into a lower dimensional space introduces a way to represent the original data in a concise manner preserving their most characteristic features along with reducing the inherent noise. In this work, we rely on the recent algorithm for tensor factorization – RESCAL, which has shown a decent performance in link prediction scenario [7]. Next, we describe how entity graph can be modelled as a tensor and entities as vectors. Entity graph as a tensor. We introduce a tensor X of the size n ⇥ n ⇥ m, where Xijk = 1, if there is k-th predicate between i-th entity and j-th entity, and Xijk = 0, otherwise. Thus, each k-th frontal tensor slice Xk is an adjacency matrix for the k-th predicate, that is usually very sparse. Given r is the number of latent factors, we factorize each Xk into the matrix product: Xk = ARk AT , k = 1, m,
• attributes, i.e., the remaining datatype properties;
where A is a dense n ⇥ r matrix, a matrix of latent embeddings for entities, and Rk is an r ⇥ r matrix of latent factors. The matrices A and Rk are computed as a result of the optimization problem. The objective function f (A, R) is ! ! X 1 X T 2 2 2 min kXk ARk A kF + kAkF + kRk kF . A,R 2
• outgoing links, i.e., for the triples, which contain the entity as a subject, URI resolution is utilized, and tokens from the name field are considered; To model relevance between a query and entity descriptions, we come up with a set of four features: • mixture of language models;
k
k
The second summand is a regularization term with a parameter . The first is the sum over the squared Frobenius norms of the discrepancies between the given tensor slice and fitting matrix. To solve the optimization problem, [7] presents an iterative alternating least squares algorithm that
• three distinct bigram relevance models for the document fields; Mixture of language models (MLM). A distinct probabilistic multinomial language model P (t|✓f ) with its own
1254
Table 1: Feature set # Feature Term-based features 1 Query length 2 Query clarity 3 Uniformly weighted MLM score 4 Bigram relevance score for the ”name” field 5 Bigram relevance score for the ”attributes” field 6 Bigram relevance score for the ”outgoing links” field Structural features 7 Top-3 entity cosine similarity, cos(e, etop ) 8 Top-3 entity Euclidean distance, ke etop k 9
5.
updates A and Rk matrices until a convergence criterion is met. As a result, i-th row of the matrix A represents i-th entity in the latent space. Finally, we apply sum normalization for resulting matrix rows. Top-k entity similarity. We exploit the resulting entity vectors in the following way: given a query, we retrieve top-3 entities with respect to a baseline score and compute a distance between the closest top-3 entity vector and a given entity vector. Specifically, we experimented with cosine similarity, Euclidean distance, and heat kernel1 as distances.
4.
ke
Top-3 entity heat kernel, e
5.1
COMBINING TERM AND LINK INFORMATION WITH LEARNING TO RANK
2
Data
http://km.aifb.kit.edu/ws/semsearch10/ http://km.aifb.kit.edu/ws/semsearch11/ 4 http://github.com/nzhiltsov/YSC-relevance-data 5 http://km.aifb.kit.edu/projects/btc-2009/ 6 http://github.com/nzhiltsov/Ext-RESCAL 3
kx
EXPERIMENTAL EVALUATION
We combine queries and labeled data from Yahoo! SemSearch Challenge (YSC) 20102 and YSC 20113 evaluation campaigns into a single set. In total, the query set contains 142 queries. Before applying the searching methods we experimented with, we fixed misspellings in 9 queries using a ”Search instead for” feature of Google Search. To form the training data, we use extended versions of the YSC assessment files with labels acquired after our additional evaluation campaign on Amazon Mechanical Turk (AMT)4 . In total, we have 14 048 labeled examples. The Billion Triple Challenge (BTC) 2009 RDF data set5 is used as a data collection. The data set is diverse and covers various domains including academic publications, geographical data, music, biomedicine and many others. The largest data sources include DBpedia, LiveJournal, GeoNames, and DBLP. To make our tensor factorization experiments faster, we sample from the whole entity graph in the following way: given a query set, we take top 1000 relevant entities with respect to MLM scores along with the labeled entities, and retrieve all adjacent entities considering only 10 most promising predicates from the entity graph (Table 2). Thus, the tensor eventually has 144 020 ⇥ 144 020 ⇥ 10 entries. Seeking the balance between expressiveness of the latent space and running time, we have set the following parameter configuration: the number of latent factors r = 100, regularization constant = 0, and convergence threshold = 10 7 . Our experiments do not show that varying the parameters is sensitive for search results. The algorithm has converged within 19 iterations and taken 23 minutes on a powerful machine with 24 ⇥ 2.3Hz CPU and 256GB RAM. Although, the calculation has required no more than 2GB RAM runtime. The tensor factorization is implemented in the fork of the RESCAL project6 in Python. Our implementation is memory efficient, relies on the SciPy Sparse Matrix module, and scales well for graphs with millions of nodes on the a↵ordable hardware.
A learning-to-rank framework is a well-developed technology to combine multiple rankers. In our case, it is particularly beneficial, because it is challenging to come up with a hand-crafted formula for such a diverse feature set (see Table 1). In this section, we describe our procedure to incorporate the features into a learning-to-rank model. As a preprocessing step, values of all the features except #1 and #2 are normalized by subtracting the minimum value and dividing by di↵erence between maximum and minimum values per each query. We treat relevance labels as continuous dependent values. In principle, our approach is agnostic to the actual learning model. However, we tend to expect that performance of structural features may vary greatly depending on the available information of entity links, and they have to be supplementary to term-based features. Therefore, we use Gradient Boosted Regression Trees (GBRT) [4] as a learning model, which is proven to be very successful in similar scenarios. We optimize its hyperparameters (number of trees, tree depth, learning rate, and subsample size) with respect to NDCG. Applying the trained model for all entities in the index is usually impractical. So, we retrieve top-1000 entities with respect to MLM scores beforehand, apply the trained model, and order results by its predicted values.
1
etop k2
yk2
The heat kernel function k(x, y) = e tends to penalize very dissimilar vectors harder than Euclidean distance; we fix = 10 3 in our experiments
1255
Table 2: Predicates used as entity links in tensor factorization http://dbpedia.org/property/disambiguates http://dbpedia.org/property/hasPhotoCollection http://dbpedia.org/property/redirect http://swat.cse.lehigh.edu/resources/onto/university.owl#website http://www.geonames.org/ontology#locationMap
http://www.geonames.org/ontology#wikipediaArticle http://www.w3.org/2000/01/rdf-schema#seeAlso http://www.w3.org/2002/07/owl#sameAs http://xmlns.com/foaf/0.1/img http://xmlns.com/foaf/0.1/page
Table 3: Results of 10-fold cross validation. ”*” stands for statistical significance with p < 0.05 Features Term-based baseline All features
NDCG 0.382 0.401 (+ 5.0%)⇤
6.
For the learning-to-rank task, we use the implementation of GBRT in scikit-learn7 .
5.2
P@10 0.539 0.561 (+ 4.1%)⇤
CONCLUSION AND FUTURE WORK
In this paper, we presented a novel, principled, and scalable approach for incorporating structural and term-based evidence for entity ranking. In particular, we have introduced a scalable application of tensor factorization to entity search, and developed new and e↵ective features for entity ranking. Our method outperforms the previous state of the art on a large-scale evaluation over a standard benchmark data set. We complemented our experimental results with thorough error analysis and discussion. In the future, we plan to explore extending the entity structure representation by incorporating term information into the latent space, because it will enable us to infer a distribution of latent factors for entities with limited link information. It could be done by enhancing the tensor structure with the entity-term matrix. Yet another prospective research direction is an application of the method in the entity list search scenario.
Retrieval Performance Results
To measure the utility of the structural features, we have performed 10-fold cross validation for two subsets (Table 3). We used a set of term-based features as a strong baseline, designed to capture the main ideas of a state-of-the-art entity ranking method presented in [6]. We performed extensive experiments optimizing the baseline, achieving the performance comparable to the results reported in [6]. Statistical significance (marked ’*’) is measured by the Wilcoxon signed-rank test with p < 0.05. According to our analysis of boosting scores, the most instructive structural distance is heat kernel (Table 1, feature # 9). Adding the structural features has significantly improved NDCG and P@10, and demonstrates better performance with respect to MAP. This finding supports our hypothesis that the semantic link information does extend the e↵ective search context. For example, our method has the largest improvement on NDCG as well as MAP on a complex query shobana masala that likely asks about an Indian actress Shobana Chandrakumar who starred in movies of the Masala genre. Clearly, the query is not well treated by n-gram features and also is hard for MLM, because it favors entities who are named Shobana too. On contrary, since the correct entity is included in top-3, the structural features boost entities that are identical or very close to it. The similar explanations are applicable for the other best performing queries in comparison: motorola bluetooth hs850, sagemont church houston tx, philadelphia neufchatel cheese, bounce city humble tx. All these queries name a primary entity and contain refining terms. At the same time, top-3 entities of these queries have rich link information (primarily, the ones from DBpedia). After analyzing the worst performing queries, we have found two main reasons of failures. The first cause is the poor performance of the baseline on retrieving the top-3 similar entities, needed for structural features (e.g. for metropark clothing, sedona hiking trails) that leads to drifting to other entities. The second cause is query ambiguity (e.g. emery). The baseline correctly favors entities with single term names as in the case of “the Emery”, a rock band. The structural features promote entities representing people with similar names, e.g. Emery Barns or Sid Emery, which assessors tend to label as irrelevant. 7
Performance MAP 0.265 0.276 (+ 4.2%)
7.
REFERENCES
[1] R. Blanco, P. Mika, and H. Zaragoza. Entity Search Track Submission by Yahoo! Research Barcelona. In In: Proceedings of WWW’2010. ACM, 2010. [2] S. Campinas, R. Delbru, N. A. Rakhmawati, D. Ceccarelli, and G. Tummarello. Sindice BM25MF at SemSearch 2011. In Proceedings of the 20th International Conference on World Wide Web. ACM, 2011. [3] S. Cronen-Townsend and W. B. Croft. Quantifying query ambiguity. In Proceedings of the second international conference on Human Language Technology Research, pages 104–109, 2002. [4] J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 2001. [5] H. Hu and X. Du. Combining n-gram retrieval with weights propagation on massive rdf graphs. In International Conference on Fuzzy Systems and Knowledge Discovery, pages 1181–1185, 2012. [6] R. Neumayer, K. Balog, and K. Nørv˚ ag. On the Modeling of Entities for Ad-hoc Entity Search in the Web of Data. Advances in Information Retrieval, 7224:133–145, 2012. [7] M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing YAGO: Scalable Machine Learning for Linked Data. In Proceedings of the 21st international conference on World Wide Web, pages 271–280. ACM, 2012. [8] J. Pound, P. Mika, and H. Zaragoza. Ad-hoc Object Retrieval in the Web of Data. In In: Proceedings of WWW’10, pages 771–780, 2010. [9] A. Tonon, G. Demartini, and P. Cudr´ e-Mauroux. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval. In Proceedings of the 35th international ACM SIGIR, pages 125–134. ACM, 2012.
http://scikit-learn.org
1256