Reformulation of Telugu Web Query using Word Semantic Relationships Ramakrishna Kolikipogu
Padmaja Rani B
Vijayalakshmi Kakulapati
Department of CSE CMR College of Engg and Tech Hyderabad, India
Department of CSE JNTU College of Engineering Hyderabad, India
Department of CSE RRS College of Engg and Tech Muthangi, Medak,India
[email protected]
[email protected]
[email protected]
ABSTRACT Use of Internet becomes more popular in India to avail the information needs. A major area of Information browsing includes Education, Medical, Agriculture, Geographical, Business and other social domains. Availability of electronic documents for Indian Language is growing day by day. The people living throughout India speak different languages. The government of India has given "languages of the 8th Schedule" official status for 22 languages. Compare to European languages and other Indian languages, processing of Telugu language electronic documents is more difficult in nature. This is due to multi – encoding formats of the text. Indian languages are encoded using Unicode, ISCII. To fasten the retrieval process the Unicode or ISCII is need to be converted into simple and standard encoding which makes Information Retrieval as easy task. Once the information processing system is build for a mono-lingual, it is the base to go for Multi-lingual and Cross – lingual information processing. In Information Retrieval process users expects exact results for the given query. It depends on the vocabulary expertization of the end user in building the root query. Word mismatch is common problem of all languages in Information Retrieval process. Query Expansion gives a solution to the word mismatch problem. In Query Expansion the top ranked documents are used to expand the query terms. Sometimes user need to judge the relevance of the expanded query to iterate search. The relevance judgment of the user depends on the knowledge (i.e Language knowledge to describe the context of the query) of the user. If the concept hierarchy is properly defined, then user involvement is void in this scenario. This can be easily test on English language, but applying Query Reformulation technique directly on Indian languages is not stands good, because the nature of Indian languages is not simple like English. The Paper is aimed to reduce the mismatch between user query and retrieved documents by using semantic relationships between query terms and document terms. To test the proposed model, Telugu language, one of the Indian languages is taken as a case study. True translation from English to Telugu and vice versa is not possible due to high word conflation in Indian languages. This paper is an attempt to adopt Semantic Network with semantic relationships between terms of a query to reformulate and iterate the search. Method of Relevance Feedback Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICACCI’12, August 3-5, 2012, Chennai, T Nadu, India. Copyright 2012 ACM 978-1-4503-1196-0/12/08…$10.00.
improves recall without compromising precision, but it works well on limited corpus. Reformulation of query by embedding WordNet, ConceptNet relationships gave better results, but great fall of precision is observed. Comparison between initial query test results and reformulated query search results are made in result analysis.
Categories and Subject Descriptors [Information Retrieval]: Information Retrievals System, Semantic Web, Natural Language Processing, Machine Translation
General Terms Design, Algorithms, Measurement, Experimentation, Performance.
Keywords Information Retrievals, Web Query, Semantic Network, Query Reformulation, Synset, WordNet, Tokenization, Lemmatization, POS Tagging, Boolean Model, Vector Space Model, Indic Scripts.
1. INTRODUCTION Web is the best resource to gather the information needed by the end users. The availability of an abundance of knowledge sources which is ever increasing in digital form has incited a large amount of effort in developing road maps for Information retrievals. Processing and managing such a vast amount of stored documents is a big challenging task for researchers. Keeping the issue, it is difficult for the user to search the exact information from the big corpus. Wherein for the European languages including English it has reached to the satisfactory levels of users in gathering the information against queries, but still there is an effort required to give the best to the naïve users. One of the major problems is vocabulary mismatch in retrieving exact information in the view of user’s context. The problem has more impact on non English With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English Web documents and user queries are becoming major issues for search engines [3]. Restructure in the process of information retrieval techniques are being used to get expected results. Naive users are lacking in writing formatted queries by specifying context of the user interest. One should raise the following questioner before give back the results: x How the system understands the user query? x How the bags of words are collected against stop words? x How to bring conceptual meaning of the query? x How bag of words given semantic meaning? x How to reformulate the query using word semantic relationships? x Is a query expansion preserve the meaning of root query? x How to apply Pseudo Relevance Feedback model?
774
x
How to check and maintain correctness of the results along with the Semantic meanings? etc.,
All the above queries are addressed throughout this paper. In general users submit short queries that do not consider the variety of terms used to describe a topic, resulting in poor recall power [15]. In one hand, the vocabulary used in the root query may be different from the vocabulary within particular Information resources; on the other hand, users’ vocabulary may not be discriminating enough to identify the matching. These two scenarios lead to retrieval failure [5]. This paper proposed a novel model to expand the root query with the help of semantic network to refine the outcome of the search engine. The model is proposed to test on Telugu corpus collected from www.te.wikipedia.org . This paper is organized into six chapters, Chapter 1 start with introduction of Information retrieval system, Web Search Engines, Indian languages, Telugu language characteristics, WordNet/Semantic Network support with necessity of query expansion in information retrieval. The survey and related work about Semantic Networks and WordNet is explored in Chapter 2. In Chapter 3 Query Expansion techniques are being explored and discussed the adaptation of these techniques for Telugu Language Processing. Proposed Framework is given in Chapter 4. The Results are analyzed and depicted in chapter 5 followed by research conclusion and future scope in chapter 6.
2. RELATED WORK Information Retrieval System is not limited to gather the information, but allows to process, store in the form of corpus, organize and manage the digital documents efficiently. Text corpus is build from various resources and makes it to available for using in various information retrieval processes. Every day, availability of digital documents is being increasing dramatically on the World Wide Web. Searching for the information needs in the huge collection of heterogeneous sources is difficult. Use of Semantic Network for information retrieval is not new concept, but applying word semantic relationships to the Telugu Corpus is a challenging task. WordNet [1] which plays a major role in Information Retrieval process and gains improved recall. In [8], Query expansion improves recall, but sometimes due to wrong selection of sense or context it will have advert effect on precision. Then the selection of key terms to expand the root query is difficult. Use of WordNet helps to disambiguate the senses of a keyword and improves the retrieval performance [9]. From the last few decades the research brings rich methods to provide effective results to the user query [8]. Word Sense Disambiguation (WSD) resolves sense of a word in a sentence, when it having more than one meaning in sense based query expansion [12]. Connecting to Semantic Network to expand the query is addressed in the forth coming chapters. Indian languages are rich in morphology [4], it is difficult to achieve the root form of the words. Word inflation is high in Telugu language. Multiple variants of a word are a major problem in Expanding query [9]. We perform query expansion by generating lexical paraphrases of queries. These paraphrases replace content terms in the topic words to return the stem [10]. In Semantic Web Search, Information interpretation and aggregation are being manipulated by ontology-based semantic annotation. The semantic annotation [19], identify a number of requirements, and review the current generation of semantic annotation systems. Reference [20], the researchers investigate the definition of an ontology-based IR model, oriented to the exploitation of domain Knowledge Bases to support semantic search capabilities in large document repositories. Another research [21] focuses on a holistic architecture for semantic annotation, indexing, and retrieval of documents with regard to extensive semantic repositories. A
system is proposed, which is a semantically enhanced information extraction system provides automatic semantic annotation with references to classes in the ontology and to instances[25].WordNet during the indexing phase proved to be more effective, by adding the synonyms and the holonyms of the encountered geographical entities to each document’s index terms[24]. A novel metric to measure the semantic relatedness between words is proposed. The approach is based on ontologies represented using a general knowledge base for dynamically building a semantic network [22]. Then obtain an efficient approach to rank digital documents from the Internet according to the user's interest domain is tough [24]. An ontology-based user model, called user ontology, for providing personalized information service in the Semantic Web utilizes concepts, taxonomic relations, and non-taxonomic relations in a given domain ontology to capture the users’ interests [23]. Building semantic relations is not generalized for all languages; in case of Telugu language it has to be initiated.
3. QUERY EXPANSION Query Expansion is a process of reformulating the root query by adding an optimal set of terms that improves recall and precision. The motivation for query expansion is to reduce the mismatch between query and Documents by expanding the query terms using words or phrases which are synonymous to query terms or share other statistical relationships with the terms contained in the set of relevant documents [13]. The goal of Query Expansion is to find representative words for describing the relevant documents [28]. It is to add related terms into the query so as to extend the coverage of the query, i.e. to retrieve more relevant documents. Query expansion is one of the promising approaches to deal with the word mismatch problem in information retrieval [18]. Thesaurus or WordNet or Ontology based query expansion approaches causing query drift mainly because of the polysemy of words [2]. However, Reference [17] supporting that various features of thesaurus influence the retrieval performance. Fully described vocabulary in a thesaurus known as WordNet, which maps different vocabularies, e.g., between concepts/sense and textual words improvise the retrieval performance [17]. The expansion is usually done by using a thesaurus or a set of statistical association relations between terms [6]. Study has made [8], how the query expansion improves the search results. First time when user gives initial query, adds some new related terms and forms a new query known as expanded query. The addition of the new terms extends the original query so as to widen scope of original query; this collects more relevant items that are expected to be retrieved. This entire scenario improves the recall ratio. The quantified problem with this approach is to identify the appropriate words to expand. If wrong expansion is done, it costs more by great fall of precision (i.e. the meaning of the query would be changed and retrieves more non related items). So we must consider what is to be added? How to link those terms to the initial query?. We used statistical methods for matching term-term relationships and item-term relations for global analyses these considers a corpus in a statistical manner (i.e. term co-occurrence analyses. The use of statistical methods after their empirical analyses on information retrieval using semantic network or thesaurus improves recall [14]. Query Expansion is broadly defined into two types, Local Analysis and Global analysis. Query Expansion is aimed to improve the recall without compromising precision. Initial Results of base query are used to reformulate the query by local analysis is called Relevance Feedback Query Expansion [18]. Collection of Document set or Corpus is analyzed by Global analysis adopts Pseudo Relevance Feedback (PRF) approach. Relevance Feedback (RF) involves the user to judge the relevance of documents for the given query.
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
775
Sometimes Users are not interested to give feedback or they don’t have enough knowledge to judge. An alternate to RF is Pseudo Relevance Feedback approach, in which the top ranked documents are considered relevant and used to select expansion terms. Query Expansion is further supported by local analysis and global analysis [8]. In local analysis the terms are extracted from these documents for query expansion [8]. The work is aimed to use global analysis and Query Expansion techniques to improve the recall without compromising precision on Telugu Corpus. The proposed framework can be adapted to many Indian languages with slight variations due to their individual characteristics.
3.1 Local Analysis The Local analysis Techniques analyzes initial documents and reweights the terms in the query. If the initially retrieved documents are relevant to the user query the expansion greatly improve the results and give better relevant set of documents as a result. If the initially retrieved documents are not relevant then the result comes from out of domain. Local Analysis is identified the best for limited corpus consisting of more relevant documents. Local Analysis can be implemented in two ways using Relevance Feedback and Pseudo Relevance Feedback.
3.1.1
Relevance Feedback
The thought of Relevance Feedback is to involve the user in the retrieval process so as to improve the final result set. The user issues an initial query, then the system returns an initial set of relevant documents. These documents are analyzed to determine the Concepts for query expansion. Different techniques(i.e. Vector Space Model and are used to find the co-occurrence of terms in first iteration resultant documents retrieved Vector Space Model is used to determine the concept and reformulate the initial query. Steps involved in Relevance Feedback Model. 1. 2. 3. 4. 5. 6.
Input Initial Query Q i against Document Set (D). Identify the N top-ranked documents from the initial resultant document set. Find all the terms from the N top-ranked document set (Dt). Select the feedback terms (T e) to add the initial Query. Add the feedback terms to the initial query and user has to judge whether to consider it as expanded query Qe or not. Identify the top-ranked relevant documents (Dr) for the expanded queries through relevance ranking. ?
Expand
No
Dr
Dt Stop
Te
D
Figure 1. Relevance Feedback Query Expansion. The system computes a better representation of the information need based on the user feedback [8]. In this approach user must have some domain knowledge to judge the relevance of the reformulated queries and give feedback. It may cause the user to endure the process. The basic problem is, naïve users failing to use good vocabulary in building exact queries to the system. The Word
776
3.1.2
Pseudo Relevance Feedback.
Pseudo Relevance Feedback (PRF) is an automatic relevance feedback method, which voids the manual interaction in judging the relevance of the reformulated queries. PRF gives the faster and improved performance without an extended interaction. PRF, also known as blind relevance feedback, provides a method for automatic local analysis. It automates the manual part of relevance feedback, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k-ranked documents are relevant, and finally to do relevance feedback as before under this assumption. PRF via query-expansion has been proven to be effective in many information retrieval (IR) tasks [16].Term Reweighting is used to Increase weight of terms in relevant documents and decrease weight of terms in irrelevant documents.
3.2 Global Analysis The Similarity between concepts can be determined in global context by global analysis. Concepts are used to define context of the query or sentence of a document. The simplest definitions are that all words are concepts excluding common words. The context of a query is defined as the co-occurrence of query terms in the document with that word. The global analysis is related to the representations generated by other dimensionality-reduction techniques [27]. The essential difference is that global analysis is only used for query expansion and does not replace the original word based document representations. Reducing dimensions in the document representation leads to problems with precision [26]. The global Analysis techniques examine word occurrences and relationships in the corpus as a whole, and use this information to expand any particular query [7]. Different types of structures used to correlate the terms, concepts, senses and other features.
3.2.1
Qe
D
miss match is base for query expansion. Few reason are causing the Relevance Feedback not use much in IR applications. Sometimes users may reluctant to provide explicit feedback. The results in long queries may require more computational time to retrieve, and search engines process lots of queries and allow little time for each one. It is harder to understand why a particular document was retrieved. Pseudo Relevance Feedback is an alternate to avoid the user intervention in reformulating the initial query.
Thesaurus
The query is expanded by using terms which have related meaning to the terms of initial query. Query Expansion by Thesaurus voids the manual interaction. Use of query expansion generally increases recall and is widely used in many science and engineering fields [29]. Knowledge stored in a thesaurus or other global information source is used to increase recall. Thesauri have frequently been incorporated in information retrieval systems as a device for the recognition of synonymous expressions and linguistic entities that are semantically similar but superficially distinct [30]. Various methods have been proposed [31] to expand the query using thesaurus to improve the recall, which includes 3 major categories. They are Hand-crafted thesauri, Co-occurrence based thesauri and Head modifier based thesauri. Query expansion based on hand-crafted thesauri is only successful if the thesaurus is domain-specific and corresponds closely to the domain-specific document collection being searched [32]. With due considerations
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
of hand-crafted thesauri used in query expansion has not been very successful [32]. Handcrafted thesaurus describes the synonymous relationship between words [11]. It is expected that using of words from Thesaurus will improve precision, but it causes some problems in expansion of Queries. Example 1: “ȇ3R [viMta]/Curiosity/wonder/oddity/oddness/ an odd thing/a rarity/ a thing that causes wonder/a marvel” has many forms such as “6bŖ]Ū[ [AScaryamu], 5TiŨR[ [adbutamu]”. Example21: “Y3Ĭ [baMDi]/truck” Vs “@C [
Vi3Ĭ [ [
] [ĸôC [
] ĺ± ĹÍŴ[
]
] ĨĆM [
ȎfUV[ [
] ĨĆM [
]
Example 3:
The frequency of “ǹOŝ
[biDDa]/
ȋȉa
[SiSuvu]/baby” in Synsets are as follows – 1)
{ ǹOŝ [biDDa] / babyÆ ǷfW [pApa]/ C¥R§] [
2)
girl/ daughter } { ǹOŝ [biDDa] / baby, 5Yŧķ [bAbu]/ ĥóO¦C¡ [
] /baby ] /
baby boy/son }
] C¡ [
] ǴdiC¡
The sense of the term “ ǹOŝ [biDDa] / baby” varies based on the
] /
Mode of
context of usage. The noun “ǹOŝ [biDDa] / baby” is more likely to
Transportation” / “a vehicle used to carry from one place to other”, “ĺfeV[ [vAhanamu]/Vehicle”. Thesaurus based query expansion gained little improvement in recall about 10% to 20% on a domain specific corpus [9]. Then it is observed that the thesaurus-based QE techniques have not been as successful in gaining recall without compromising precision. The observations are made, that the short queries (i.e one or two word queries) does not express the interest of the user search, hence the expansion fails in improving relevance retrieval of the search.
3.2.2
each synset of a term t, there is a definition and frequency measure that indicates the extent of the term t, is utilized in this sense.
be used in the sense of “ǷfW [pApa]/baby girl” or “YY [bAbu] / baby boy” than the sense of “ĥóO¦C¡ [koDuku] /son” or “C¥R§] [kUturu] / daughter”. The Synsets in WordNet are linked to each other through different relations including hyponyms, part of and member of. Whenever the sense of a given term is determined to be the Synset S, its synonyms, words or phrases from its definition, its hyponyms and compound words of the given term are considered for possible addition to the query. The similarity between words w1 and w2 can be defined as the shortest path from each sense of w1 to each sense of w 2 [35]:
WordNet
sim (w1, w2) = max [-log (N/2D)]
WordNet is lexical semantic database that provides tremendous support for information retrieval. The English WordNet is developed by Princeton University to model the lexical knowledge of a native speaker of English [1]. It’s a base for all WordNets of different languages in the world. Wordnet is organized around the notion of sets of synonyms (synsets) with the words with the same meaning. These synsets have different relations between them. The relation of hypernymy/hyponymy (isa relation) is the principal relation and creates a hierarchic structure. There are also relations of meronymy / holonymy (partof relation). WordNet is divided into different classifications by the type of word: nouns, verbs, adjectives and adverbs (i.e. Parts Of Speech (POS), some POS taggers are not useful in building index such as stop list: articles, prepositions and conjunctions ect. ). Query Expansion terms are selected from the correct Synset for each key term of root query. The basic Problem is that the number of synonyms for each word is excessive, this causes for poor performance. In order to overcome this problem we use word sense frequency measure of Synset [9]. Basically two approaches are used to address this problem. They are Boolean model and vector space model (VSM). In Boolean model, the added terms are put into disjunction with the original query terms. For instance, t is a term in the original Boolean query and t1 is a related term to it, then t1 is put into disjunction with t in the new query. In some cases, the added term is assigned a weight equal to that of the original term t. Thus, t is replaced by (t Ú t1). In other cases, the added term is assigned a lesser importance. So t is replaced by (t Ú t1 a) where a = 1. During the evaluation, the factor a plays the role of multiplication factor. That is, if a document’s similarity to t1 is v, then its similarity to t1 a is (a*v).
3.2.2.1 Synset Synset is formed by grouping the synonyms with same meaning for a word [11]. Each Synset represents one sense. Associated with
Where N is the no. of node in a path from w1 to w 2 and D is the maximum depth of taxonomy. Sometimes the concept may belong to subsume of the concepts, in such cases the similarity can be measured as probability of the concept derived from relative frequencies of a document collection: p(c) = f( c ) / N Where p(c), the probability of a concept c is defined as the ratio between frequency of a concept c and number of key terms N. The final similarity score is defined as the sum of path based similarity and content based similarity.
3.2.2.2 Semantic Relations between Synonyms. i)
Hypernymy – Hyponymy: This relation represents generalization (hypernymy) and specialization (hyponymy).
ii)
Moronymy – Holonymy: It is a part whole relationship between synonyms.
iii) Antonymy: opposed meaning is represented using this relation, but it is between words not synonyms. iv) Troponymy: This relation has temporal inclusion, like ǵTƔ [nidra]/ Sleep Æ E]C [guraka] /snore. v)
Entailment two verbs.
: It represents Implication between
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
777
corpus are given in Table 1. Generating Root words is more important before applying Morphological Analyzer [10] or TelMore [34]. To achieve the table 1 relations Morphological Analyser is the best tool for Telugu. From the Category wise text corpora the semantic relations are generated manually. Each word is given a weight based on the relationship. Building all the semantic relations between words is a tedious task. Few Online Thesaurus or dictionaries are useful to generate such a list, but it involves lot of human work. When developing such a list of alternate words for each word of a document, many issues come into picture. All the words should be given to morphological analyzer and the sense of the words to be identifies. A Telugu to Telugu dictionary or WordNet is required to generate alternate words with same meaning. Figure 1. Synset relation between aŪaȎf\3 [agriculture] and ȎfE [sAgu] terms. Table 1. Semantic Relations between types of Synonym. (Unicode Encoding is used to represent Telugu Language)
4. PROPOSED FRAMEWORK Web users gives their queries through an interface, the query follows some preprocessing steps to divide it into tokens, in Telugu there are no fixed stop words, we use morphological analyzer that takes all preprocessing steps and disambiguates the things.
Figure 2. Framework for proposed System.
4.1 Algorithmic Approach Input Output corpus.
Due to lack of Telugu WordNet, Domain specific Hand craftedSynonyms are used to test the results. In figure1. aŪaȎf\3[ ] / hypernymy and ȎfE[ ] / hyponymy relationship is defined. Different weights are given to these relations to know which one takes first to expand terms of a query. All the relations considered for domain specific Telugu
778
: Initial Query by User. : Top k-ranked relevant documents from the whole
Procedure: 1. User enters Initial Query Qi through user interface. 2. Call pre-processing procedure to tokenize the query and to remove the stopwords to generate key terms of the input query Qi. 3. Apply Morphological analyzer and generate key term variants to generate different senses of terms as a set ‘Qi’ using tokenization, normalization process. Qi= {q w1, q w2, q w3….qwn}, where ‘n’ is no. of keywords in a query q. ‘w’ indicates query word. 4. Find the semantic relations between query terms by connecting to the Telugu WordNet (i.e.There is no Telugu WordNet, hand-crafted collections of domain specific synsets are used) and give re-weighting to their relations. Set the threshold values for considering expansion terms. 5. Build term-term relations tiÆtj from (Ii, Ij), where Item Ii,, ti€Ii and Ij, tj€Ij repeat this step for entire corpus. 6. Perform Similarity measure and closeness of concepts. Consider N high relative words as pre-expansion words. Use
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
7. 8.
them in query ‘q’ as expansion words. The resultant query is marked as Qe, Qe = {ti1, ti2… tiN}, where i=1, 2, 3…N, then goto next step. Reformulate expanded query as Qe = Qi + Qe, where Qe is taken as new query and submit to search engine. Use TFIDF, term t’s semantic relevance to web page p is measured by tf(t)*idf(t), where tf(t) is the frequency of t occurring in p and idf(t) is the Inverted Document Frequency of term t. and Rank the items according to relevance.
5. RESULTS ANALYSIS Category wise documents collection is taken as input corpus. The category consists of CS_¡[kathalu]/stories, VTi_¡[nadulu]/rivers, ǷfM_¡ [pAtalu]/songs, ĸfJĥ \_¡ [rAjakIyAlu] / politics, Ȏfľ RŪ3[sAhityaM] /poetry, ĻfĽh Š Ŭ [SAstrIyaM]/science, \3 6M_¡ [ATalu] / sports in the input corpus. Statistical analysis is used to find the frequency of words in text corpus against the threshold from 1 to 15 documents. The category wise word frequency is shown in Table 2. Due to high inflation of the Telugu language it is very difficult to preprocess the query to find key terms. N-gram and stemming techniques are used to find root words of the search query. Use of N-gram technique gives high precision and low recall. The sense of the query is not possible to measure in Ngramming techniques, hence it identifies most of the items are relevant to the search. An alternate to N-gram technique is stemming. Use of stemming for Telugu is not as simple as it is used for other languages. Morphological analyzer [10] is used to solve this problem for Telugu language. Table 2. Category wise word frequency.
Figure 3. Category wise word frequency graph. All the documents are properly indexed using Inverse Document Frequency. The complete synset is to be build for domain specific text corpus. Use of semantic relationships to expand the query gives better response in the limited Telugu corpus. In the poetry category the precision is increased along with recall. This is due to limited documents collection having more related word corpus. The results are refined in terms of recall, but further investigation has to be done by considering all the categories in to a single repository against a complete or semi-complete knowledge base i.e synset or WordNet.
6. CONCLUSION AND FUTURE WORK It an ad hoc attempt in Telugu Information Retrieval System. Use of N-gram techniques is not good idea for Telugu language. Due high inflation nature of the Telugu, it is difficult to build Stemmers. However few attempts are made to test the Information Retrieval performance using stemmers (Name Entity Recognizers). Use of Morphological Analyzer is an alternate for Named Entity Recognizers. With which stemming, Tagging and variants can be found together for further processing. We found positive results in recall but we were disappointed with precision. We plan to continue this research work and test on different Indian Language WordNets. The Proposed framework has to be tested with different Telugu pre-processing tools on a standard corpus.
7. ACKNOWLEDGMENT I thank Prof.K.N.Murthy and Mr.B.Srinivas, Research Scholar, CIS, HCU for their support and guidelines to test the proposed system on resources available at HCU, Hyderabad. I also thank the Management of CMRCET, Hyderabad for providing financial support to attend the conference and present the work.
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
779
7.REFERENCES [1] C. Fellbaum, WordNet: An Electronic Lexical Database, the MIT, Press, Cambridge, London, England. [2] Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. Automatic query expansion using smart: Trec 3. In TREC, 1994b. [3] Current research issues and trends in non-English Web searching, Journal of IR, Vol 12, Issue 3. [4] Dimple Patel 1 and Devika P. Madalli, Information Retrieval in Indian Languages: A Case Study of Plural Resolution in Telugu Language, ICSD. [5] Ingrid Zukerman, Bhavani Raskutti, Query Expansion and Query Reduction in Document Retrieval, Research Paper, and Australian Research Council grant DP0209565. [6] Jian-Yun Nie, Query Expansion and Query Translation as Logical Inference. [7] Jinxi Xu, Croft , Query Expansion using Local and Global Document Analysis. [8] Kolikipogu Ramakrishna, B.Padmaja Rani, Information Retrieval in Indian Languages: Query Expansion models for Telugu language as a case study, IEEEIITA2010, china. [9] Kolikipogu Ramakrishna, B.Padmaja Rani, WordNet based Term Selection for PRF Query Expansion, ICCMS,Jan 2011,Vol 1, Pp: 127-131. [10] Prasad Pingali, Jagadeesh Jagarlamudi, A Dictionary Based Approach with Query Expansion to Cross Language Query Based Multi-Document Summarization: Experiments in Telugu – English National Workshop on Artificial Intelligence, Mumbai, India. [11] Reza Hemayati, Weiyi Meng and Clement Yu, SemanticBased Grouping of Search Engine Results Using WordNet, Advances in Data and Web Management. [12] Roberto Navigli and Paola Velardi, An Analysis of Ontologybased Query Expansion Strategies [13] ROCCHIO, J. Relevance feedback in information retrieval. In The Smart Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Englewood Cliffs, NJ, 313–323. [14] William R. Hersh and David Hickam. Information retrieval in medicine: The SAPHIRE experience. Journal of the American Society for Information Science, 46:743–747. [15] Xiaoyun Wang, User Ontology and Word Sense Disambiguation for Query Expansion, ICCASM. [16] Yang Xu, Gareth J. F. Jones., Query Dependent Pseudo Relevance Feedback based on Wikipedia, SIGIR’09. [17] Yiming Yang and C G Chute. Words or concepts: the features of indexing units and their optimal use in information retrieval. In Proceedings of the Annual Symposium on Computer Application in Medical Care, pages 685–689, (1993). [18] Zhiguo Gong, Chan Wa Cheang, and Leong Hou U, Web Query Expansion by WordNet, LNCS 3588, pp. 166 – 175. [19] V. Uren et al., “Semantic annotation for knowledge
” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 2(1), pp. 49-79. [22] A. M. Rinaldi, “An ontology-driven approach for semantic information retrieval on the Web,” ACM Transactions on Internet Technology (TOIT), vol. 9(3), article no. 10. [23] Xing Jiang, A.H. Tan, “Learning and inferencing in user ontology for personalized Semantic Web search,” Information Sciences, vol. 179(16), pp. 2794-2808. [24] Davide Buscaldi, Paolo Rosso, Emilio Sanchis Arnal , A WordNet-based Query Expansion method for Geographical Information Retrieval, August – 31, 2005. [25] Atanas Kiryakov, Borislav Popov, Damyan Ognyanoff, Dimitar Manov, Angel , Kirilov, Miroslav Goranov, Semantic Annotation, Indexing, and Retrieval, ISWC, 2004. [26] Jinxi Xu and W. Bruce Croft, Query Expansion Using Local and Global Document Analysis. [27] Deerwester, S., Dumais, S., Furnas, G., Landauer, T.l & Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for information Science, 41,391-407. [28] Jinxi Xu and W.Broce Croft, Improving the Effectiveness of Informational Retrieval with Local Context Analysis, SIGIR96. [29] Relevance feedback and query Expansion, DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. [30] F.A. Grootjen, Th.P. van der Weide, Conceptual Query Expansion, Preprint submitted to Data & Knowledge Engineering 28 February 2005. [31] R. Mandala, T. Tokunaga, H. Tanaka, Combining multiple evidence from different types of thesaurus for query expansion, in: SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, 1999, Berkeley, CA, USA,ACM, 1999, pp. 191–197. [32] E. Fox, Lexical relations enhancing effectiveness of information retrieval systems, SIGIR Forum 26 (5) 629–640. [33] E. Voorhees, Query expansion using lexical-semantic relations, in: SIGIR ’94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1994, pp. 61– 69. [34] M Ganapathiraju, TelMore: Morphological Generator for Telugu Nouns and Verbs, www.ulib.org/conference/2006/7.pdf. [35] Resnik, P, Using Information Content to Evaluate Semantic Similarity in Taxonomy, In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pp. 448–453.
management: Requirements and a survey of the state of the art,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 4(1), pp. 14-28. [20] M. Fernández et al., “Semantically enhanced Information Retrieval: An ontology-based approach,” Web Semantics: Science, Services and Agents on the World Wide Web, in press. [21] A. Kiryakov, B. Popov, I. Terziev, D. Manov, D. Ognyanoff, “Semantic annotation, indexing, and retrieval,
780
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)