Post-Search Query Modeling in Federated Web Scenario - IEEE Xplore

3 downloads 177 Views 272KB Size Report
utilized by the search engines differ from those that process local databases .... expertise about the system as well as about the optimization methodology of the ...
Post-Search Query Modeling in Federated Web Scenario Jolanta Mizera-Pietraszko Department of Computer Science and Management Wroclaw University of Technology Wroclaw, Poland [email protected]

Abstract — As opposed to query reformulation oriented towards

In other words, we concatenate the indexing techniques of some multilingual retrieval systems by profiling the queries submitted by the user interacting with the system. The starting point of the searching process proposed is that by submitting queries, the user systematically broadens the knowledge in the field of his or her interest, which results in entering the longer as well as more and more precise subsequent queries.

changes made by a user to specify the information need more precisely, a post-search query modeling is a technique of exploiting syntax variation of gradually extended query which depending on some other factors like e.g. the resource, database or the key word alignment, facilitates the searching process. The study into modeling query submitted to some search engines that utilize different translation semantic paradigms is motivated by a real-world’s challenges to retrieve heterogeneous textual documents from the web. For a couple of language pairs, we develop a user-centered framework for imposing the Hidden Web traffic optimization. In literature Hidden Web is the World Wide Web facet usually missed by standard information systems. Our data set contains variety of query types submitted to translingual systems that perform a number of syntax-driven indexing being evaluated by constructing a precision trend function, the one that intensifies the relevance set of the system responses from a perspective of dramatic reduction of those outside the user’s interest.

keyword 1

step 2

keyword 3

The scheme in Fig.1 presents the modeling scenario of the human-language interaction proposed. At stage 1, called above as step 1, a user entered a keyword of a general nature according to the knowledge about a particular topic, let it be a word “information”. No matter the search engine is selected to perform it, the resulting number of the responses is going to be huge, so nobody is capable of reading at least the snippets only. Yet, amongst those that have been read, there must be some of the user’s interest. The first step of the approach is to move forward by adding a word that models the query to be more precise e.g. “information retrieval”. The number of the system responses (Web documents) drops dramatically, pushing those of the user’s interest upward, that is towards the top ten visible on the screen, simultaneously. The second step, if still found necessary by the user, gives a real chance to go into even more detail, in other words, to profile our humancomputer interaction. For instance, the user may need to learn about “text information retrieval”, or “image information retrieval” lowering the search results even more. Overall, going this way, step by step, it is often, quite possible to reach a stage at which on the screen, we get a very little number of the results, but all of which are relevant, which is the purpose of every user’s interaction. To explain the phenomenon, we observed that while modeling the query using our approach, some of the keywords are removed from the search process and replaced by those producing much more relevant results from the perspective of the user’s need. In our recent research experiments, we

INTRODUCTION

Efficient query formulation is one of the formidable challenges for the users, specifically for the multilingual purposes, as it requires from the user some knowledge about the prediction of the keywords that are matched by the documents. While attempting to access very large databases any ambiguous query keyword produces a long list of irrelevant results taking a long time to sift through. Specifically, in federated Web scenario, the data models utilized by the search engines differ from those that process local databases either in syntax, or the target language so that the system in unable to index the data efficiently. As a result, a lot of information is missing. Our study motivation is to generate information in federated Web scenario, so we start from the analysis of the query language profile in some search engines working on different language models. The methodology of the research relies on identification of the factors that limit the searching process and it is oriented towards the user needs, even the one having a fuzzy idea about the query formulation.

978-1-4799-2259-14/$31.00©2014

keyword 2

Fig 1 Post-Search Query Modeling Scenario

Keywords—component; Query Modeling, Search Strategy, Hidden Web, Trans-lingual Information Retrieval, HumanComputer Interaction

I.

step 1

183

Latent Semantic Indexing (LSI) technique improves the retrieval of the multilingual information without implementation of the translation component. A document is processed as a set of the words of the semantic space and created from the source text in the form of a matrix with the values representing a co-occurrence of the query word in this particular document. A measure of the position in the ranking list is computed based on the approximated coverage of the document words’ semantic synonymy, providing that the number of these words is close to the number of the query words [7]. Analysis of the data consistency can be made from the integration viewpoint when we consider the digital documents’ updates as an extra factor responsible for the inconsistent information. Then, the data come from a variety of resources, some of which are outdated on the contrary to the others. The integration of some documents produces the resulting list much longer since the same keyword is linked to these two kinds of information and in addition it is misleading for the user [14]. From the perspective of the worldwide research, comparative analysis of the current indexing techniques in the federated Web scenario, verifies a popular viewpoint of the preferred by the user human-computer interaction methods e.g. entering the queries in the form of the keywords only, or the simple phrases. Specifically, it supports a role of the user expertise about the system as well as about the optimization methodology of the multilingual retrieval process. Post-search query modeling relies on the user’s knowledge acquired from browsing the Web, in particular the Deep Web oriented towards the portals routinely missed by the standard systems working in a federated environment like e.g. ScienceResearch.com [15]. While searching, the user gathers the information indexed by the standard databases and depending on the system’s efficiency the process is continued by either removing the key phrases that are found not relevant or by expanding the query, making it more specified for the system. The first attempts to define this technology took place when the project Net Snippets was launched [16].

observed that in this kind of interaction with the system a proportion between the relevant responses to the all the items displayed in the ranking list systematically increases. However, on exceeding a certain number of the query words, the system produces no results. The remainder of this paper is organized as follows: at first, we introduce the general concept of our work that is the models used in the experiment, we present the state-of-the-art technology, describe the federated Web scenario and the postsearch query modeling to move on to the overall research framework. In section three we review a semantic space structure of English-French bi-texts, in particular the examples of the retrieval process. The next section introduces federated Web scenario and latent semantic paradigms. In the last section we discuss the results and relate to the conclusion for the further research. II.

BACKGROUND

A. Retrieval Models Pragmatically, numerous search engines produce different ranking lists for the same queries depending on the factors such as the language resources, databases accessed or specifically the matching algorithms that usually deploy dissimilar key phrase and document search technologies. The analysis of the factors of Panda, a 2012 successor of the PageRank algorithm, indicates that positioning the keywords discrepancies improves the ranking list of the system results tremendously. For instance, Google does not index pages larger than 101 KB, whereas entering a query consisting of more than thirty-two words gives no results [4] compared to ten-word query limit in 2004. Likewise, some of the query search modes utilized by information systems perform retrieval models like: Boolean, Vector, Probabilistic, Language, Fuzzy model, or Latent Semantic Indexing (LSI). For example, Google as a semantic search engine, is one of those that employs classical LSA, which empowers searching by indexing synonyms of the query words. We test the technique using both the Canadian Hanzard English and French parallel corpora.

III. B. Literature Review Federated Web scenario constitutes a formidable obstacle for searching process. As a result, most of the unlinked pages, very large databases, variety of information resources based on unstructured models and in particular, the multilingual documents are not indexed by the conventional systems. Metadata search performs somewhat deeper than the standard crawling. Still, it does not match most of the multilingual documents [5]. This is found a primary compelling reason for finding multilingual search less effective than the monolingual one. An iterative algorithm of the Google links analysis produces a ranking list of the search results [6] using transition matrix and a random walk through the linked pages. Therefore, any alignment of the query words impedes the system responsiveness.

SEMANTIC BI-TEXT SPACE STRUCTURE

Semantics in bi-text spaces rely on finding alignment between particular phrases to infer the contextual meaning. Resolving ambiguity requires consideration of synonymy, in which a concept is expressed in many forms, as well as polysemy, when different meanings apply to the same term. Polysemy - WOOD 1 1 The secondary xylem of trees and shrubs, lying beneath the bark and consisting largely of cellulose and lignin. 2 A dense growth of trees or underbrush covering a relatively small, or confined area WOOD Synonymy buy ≡ purchase

Which meaning: 1 or 2? The same result

Fig 2 Polysemy versus Synonymy 1

184

The FreeDictionary by Farlex http://www.thefreedictionary.com/polysemy

and j different meanings of the word, respectively) can be expressed by a formula

Assuming that the same words create a limited number of grammatical structures, specifically the latent ones, it is possible to avoid machine translation, language resources, or any human interaction with the system to achieve impressive retrieval results, especially in relation to the false positive (scored as irrelevant to the query) and false negative system responses, representing the numerous missing documents. Keyword-Document-driven space model like Latent Semantic Technique when used for a purpose of analysis called LSA, for prediction models, Probabilistic (PLSA) [9], otherwise as Latent Semantic Indexing (LSI), captures semantic resemblance in the target language of the conceptual patterns across a database of documents and relates the statistically computed results to each of these documents. In a simple word, it refers the structural patterns within a document to the whole of the collection. Each key word is represented in the vector by its occurrences in the document of the semantic space. Then, created are both query words and documents’ spaces that represent query word-document orthogonal matrix SDV (Singular Value Decomposition) expressed as

$

%&'(( ⋮ %&' (



%(( ⋯ %&'(! ⋱ ⋮ -=$ ⋮ %( ⋯ %&' !

,!

=

,"

# × "," × ",!

⋯ %(" &(( ⋱ ⋮ -×$ ⋮ ⋯ %" &"(

⋯ ⋱ ⋯

&(" '(( ⋮ -×$ ⋮ &"" '("

⋯ '"( ⋱ ⋮ ⋯ '"!

simi,j(w1, w2)=

pour

les artistes

Trouver

0.87

0.07

0.14

0.13

des informations sur

0.31

1

0.31

0.25

0.22

0.23

0.45

0.18

les comédiennes

0.16

0.21

0.21

0.76

(2)

Definition 2 Let S be a set of the s sentences such that S’ is the subset of S while s’, a number of sentences of those S which contain the words w1, w2,…,wn co-occurring in the two sets, simultaneously {%′: ∈  < >?@ 1 ≤ D ≤ @}. A word context is a set of words W(w) different from this word w that is represented by the formula E(F) ={F: ∈  < : FH< ≠ 0, D ≠ %′}

(3)

So that, a context of word w is a set of all the words different from this word which co-occurred with word w in at least one sentence of set S. Definition 3 A context of a word pair w1 and w2 is a set W(w(1), w(2)) defined by a formula EKF (() , F (L) M = {F (:) ∈  < : FH< ≠ 0, FH< ≠ 0 D ≠ %′} (()

with the r-rows that are the word occurrences and the ccolumns being their matches in the document. The extra index d represents the number of dimensions for the reduced model in which d≤min(r,c) [10]. des informations

./0 (23 )∪/6 (27 ).

Thus, word w1 in meaning Wi (w1 ) is a synonym of word w2 in meaning Wj(w2 ) if and only if simi,j(w1, w2)≥ β, where β is the threshold. Otherwise, we say that these two words w1 and w2 are semantically similar, that is 0

Suggest Documents