International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) - 2016
Bilingual Keyword Indexing and Searching framework Shashi Pal Singh*1, Ajai Kumar*2, Dr. Hemant Darbari*3, Srishti Gupta#1, Kanika#2 *
AAI, Center for development of Advanced Computing, Pune, India *1
[email protected] *2
[email protected] *
[email protected] # Banasthali Vidyapith, Banasthali, India #1
[email protected] #2
[email protected]
Abstract— The rapid growth of digital data storage of medical or health, social media, education and many more in the world has amplified the demand for big data storage, which requires trillion of files having exabytes of data. This growth in data has put up the key question of how we can effectively manage and find the data in the emergent ocean of information. The upcoming demand for data storage in petabytes and exabytes of data has also resulted in putting pressure in organizing the file structure in such a way that retrieval results of searching a keyword should match with the growing pace of data storage. As a result, there is an increase in demand for keyword indexing and searching of file systems. Directly implementing searching methodology in file system has resulted in inefficient and inconsistent results. General purpose indexes may not be suitable for file system searching as it relies on relational databases and may limit the scalability and performance. This proposed bilingual framework for English and Hindi addresses these problems through a novel approach for indexing and searching queries in large scale file system.
Keywords—stemming, searching, KWIC (Keyword In Context), KWOC (Keyword Out Of Context) and KWAC (Keyword Augmented In Context). I. INTRODUCTION “keyword indexing is based on the natural language processing of documents to get the index entries and vocabulary , which is essential for indexing system such as the title, abstract or text of a document for creating indexes. To implement searching in large set of files, one must extract the textual content and index that text, converting it into a format that will let you search rapidly, eliminating the slow process of sequential scanning. This conversion process is called indexing, and the generated output is called an index. Searching is a
lookup procedure in an index to find keywords appearing in the documents. Keyword indexing is based on the natural language of document to create the index entries and no control over vocabulary is required for indexing system. In this, entries are made for each keyword along with the line numbers in which they are present. Manual indexes (or human prepared indexes) are not really indexes but concordances. A concordance is an alphabetical list of keywords found in a document along with reference to the passage. During 1960s, scholars started to automate the process of creating concordances using computers. Thus, the concept of KWIC (Key Word In Context) came. Computerization of concordances is known as KWIC [3]. The framework is intended to carry out better automatic indexing and searching technique in any number of files (in bilingual file systems). It will allow the users to search for relevant queries in unstructured data.The user’s search query can be single word, multi –words or even phrases.For efficient results stop words removal and stemming of keywords present in the search query has been done. The application displays the output in three formats i.e. KWIC (Key Word In Context), KWAC (Key Word Augmented in Context) and KWOC (Key Word Out of Context). The framework output consists of: a) Keyword: significant words of the title or abstract which are obtained by removing the stop words and the common words. b) Context: the rest of the terms of the line along with the keywords specify the context. c) Identification or location code: It provides path of the document along with the line numbers providing full description of the keyword.
978-1-4673-9939-5/16/$31.00 ©2016 IEEE
906
International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) - 2016
In this perspective, keyword indexing and searching will address following parts:
document, which also includes providing synonyms of words describing the subject.
• Indexing and retrieval relies on efficient extraction process and interactive content search on large documents as well as small ones and it competes with manual indexing and search methods. • Results are provided in best possible way in KWIC, KWAC, KWOC format for better readability. • Sometimes it happens that the user does not receive the results for search query he/she has specified and the word (search query) can have different synonyms that can be looked for.
i. Phrase search: A search that matches combination of more than one word containing the exact phrase, such as "Indexing and Searching”. ii. Concept search: A search that is totally based on multi-word concepts, for example processing Compound term. This type of search is becoming popular in many e-Discovery solutions [3]. iii. Concordance search: A concordance search is simply an alphabetical list of all essential words i.e. the keywords that occur in a text with their immediate context [3]. iv. Proximity search: A search which matches only those documents containing two or more words that are separated by a specified number of words [3]. v. Fuzzy search: search for document that matches the given terms and some variation around them. vi. Wildcard search: A search that replaces one or more than one characters in the search query for the wildcard character like an asterisk[3] or any. For example using the asterisk in query "p*n" will find "pin", "pan", "pen", etc. in a text.
Thus, the English and Hindi synonym search facility is also included. II. LITERATURE REVIEW Making an index for a book is a tough process that is best performed by a trained indexer who understands the subject matter. Thus, indexing is the technique which helps in retrieving or checking whether a particular keyword is present in documents or not. During indexing we simply rely on information which is present in the document, without trying to add to this from our own knowledge or other bases [2] . This is derived indexing, which means the indexing derived straight from the document. The examples of derived indexing are: title based indexing and citation indexing. A. Title-Based Indexing: The title of the document is its main part as it defines what the document consists of. The title of a document in itself is a one line summary [2] of document and it serve as an index point, title, heading, hence, title indexes came into existence. E.g. of title based indexing are KWIC (Key Word In Context, KWOC (Keyword Out of Context), and KWAC (Key word Augmented with Context) B. Citation Indexing: Citation index is an wellordered list of cited articles with the list of citing articles or note. The cited article is acts as reference and the citing article as the source. The index is prepared [2] by using the association of ideas that exists between the cited articles and the citing articles. Citation indexes have proved better than the other indexes and can be prepared without so many difficulties. C. Improved querying tools
Keyword In Context (KWIC) A KWIC index makes an entry under each significant word in the title and text, along with the remaining part i.e. the context. The entries are derived using terms one by one as the lead term along with the entire context for each entry. Example: Computer Libraries in India (with identification code 12). Computer Libraries in India
12
Libraries in
12
Computer
India Libraries in India.
12
Keyword Out of Context (KWOC) In KWOC system, keyword or the access point is shifted to the extreme left at its normal place in the beginning of the line. It is followed by the complete line to provide complete context. The keyword and the context are written in the same line [2]. Example-Title: Computer Libraries in India FORMAT Computer Computer libraries in India
12
India
Computer libraries in India
12
Libraries
Computer libraries in India
12
The trained indexers have provided a list of words which describe the subject or title of the textual 978-1-4673-9939-5/16/$31.00 ©2016 IEEE
907
International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) - 2016
Keyword Augmented In Context (KWAC)
will be visible with the highlighted searched keywords.
The KWAC system provides for the enrichment of the keywords of the title with additional significant words taken either from the abstract of the document or its contents [2].
In this way the user gets the complete information about the query fired.
Example- Title: Computer Libraries in India FORMAT Computer libraries in India
12
Libraries in India
12
Input (ENGLISH/HINDI)
Search from files (Upload files) /database
III. PROSOED SYSTEM “Keyword Indexing and Searching” includes the efficient searching technique to search for the query in large number of documents present in the directory or in a database. This bilingual searching tool has an important role in Natural Language Processing as it helps in text retrieval and along with text extraction. Searching is applied on the indexes formulated from the text extracted from the documents of different format.
Input search query
One more module is added to the application for the searching of synonyms, when no results are found for the searched query. Output module consist of the displaying of results in all three formats namely KWIC, KWAC and KWOC formats. Files are incorporated with one more feature that user can open the file content and the content
Extraction of text from files
Indexing of textual data Searching the query
OUTPUT IN KWIC/KWAC/ KWOC format
Synonym searching
DB
Input module consists of user chosen preferences. If files are chosen then the user is allowed to upload multiple files to perform searching else if database is chosen then the connection to the database is made and content is fetched to perform searching.
For better results in searching, the query is analysed for stop words removal and stemming of keywords so that efficient results for the search query can be obtained. Similar is the case with database but it does not involve uploading feature and content is fetched directly from database.
Stemming (Hindi/English)
INPUT BLOCK
Architecture: The architecture involves searching either in file system or in database. It consists of three modules input module, searching module and output module.
Searching module consist of all the main modules of the framework. All the files uploaded with different formats (.docx, .pdf, .pptx) are extracted and converted to text documents and the searching is performed on the indexes generated for each document for faster retrieval of results. Approaches used for indexing is inverted index.
Stop Words Removal
OUTPUT BLOCK SEARCHING BLOCK Fig.1: Framework Architecture
A. Input Module :Upload multiple files or connection to the database and search File Uploading will be done using the third party API i.e. Commons.IO which provides multiple file uploading facility. For searching in database, connection is made with the database and content is fetched.User enters the query to search in the uploaded files or in the content of the database. B. Removal of stop words and Stemming It involves removing of stop words from users entered query and checking whether the user has entered valid query or not. After that the query is passed for search operation. Stemming of the keywords present in the search query will help in searching all the words starting with the stemmed word of the keyword. i.
Performing stemming of English keywords for more efficient searching.
978-1-4673-9939-5/16/$31.00 ©2016 IEEE
908
International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) - 2016
For e.g. Accelerator Æ Remove “or”
ȡɅ
Ʌ
ȡ
from endingÆ Accelerat
ȪͧȡfȲ
fȲ
Ȫͧȡ
±ȡ_
ʜȡ_
±
ĤȪ× ȡ¡
ʜȡ¡
ĤȪ× ȡ¡
ȡ
ʜȡ
ͪȯ Ȣ
ʜȢ
ͪȯ
ȡǗ [
[ Ǘ
ȡ
ȡ¡ȡȡ
¡ȡȡ
ȡ
Thus, the application will also search all the words starting with “accelerat” in uploaded files or in database. Like, it will search for: • • • • • •
Accelerate Accelerates Accelerated Accelerative Accelerating Acceleration, etc.
TABLE.1 SOME ENGLISH STEMMING RULES APPLIED ON KEYWORDS
Word
Ends With
Remove Ending
Recode ending
Believes
Es
Believ
Belief
Consumption
Ion
Consumpt
Consum
Traceable
Able
Trace
Trace
Accountancy
Ancy
Account
Account
Intoxicants
Ants
Intoxic
Intoxic
Problematic
Atic
Problem
Problem
unlimitedly
Edly
Unlimit
Unlimit
Enlightened
Ened
Enlight
Enlight
Rubbing
ing
Rubb
Rub
Egyptians
Ians
Egypt
Egypt
Gaseous
Eous
Gas
Gas
Backward
Ward
Back
Back
Disgraceful
Ful
Disgrace
Disgrace
C. Conversion of uploaded files into textual format For indexing the uploaded files or the content of database, we need to extract the textual information from documents having different formats and saving the textual information in text documents. Then the indexing of all keywords present in the uploaded documents is performed.
PDF
MS Word
PPT
Parser
ii.
Performing stemming of Hindi keywords for more efficient searching. For e.g. ] ȡȢ Æ The suffix ”ʜȢ ” will be removed and result will be
Analysis
INDEX
Æ ] ȡ
ĤȪ× ȡ¡ Æ The suffix ”ʜȡ¡” will be removed and
result will be Æ ĤȪ× ȡ¡
Fig.2: Conversion into textual format
D. Indexing of the textual data
TABLE.2 SOME HINDI STEMMING RULES APPLIED ON KEYWORDS Word
Ends With
Remove Suffix
ȡȯȡ
ʜȯȡ
ȡ
ȡȯȢ
ʜȯȢ
ȡ
ȡɅ ȯ
ʜȯʜȲȯ
ȡ
The process of converting the text data into a fundamental unit of searching, called term. During analysis, the text data goes through multiple operations: extracting the words, removing common words, ignoring punctuation, changing words to lowercase, etc. Analysis happens just before indexing and query parsing. Analysis converts the text data into tokens, and these tokens are added as terms in the index [1]. E. Creating indexes of textual data
978-1-4673-9939-5/16/$31.00 ©2016 IEEE
909
International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) - 2016
After the input documents are extracted, it s ready for indexing. Indexing of the data available in textual format has been done. Thus textual information is extracted from documents of different formats. Indexing is a process of converting text data in a format that facilitates fast searching. A simple example is an index one can find at the end of a book: That index points to the location of topics that appear in the book. Indexing helps users to perform fast keyword lookups and finds the documents that match the given query. F. Searching of keywords or phrases in indexed data Searching is the process of looking for words in the index and finding the documents that contain those words.
formatted so that the keywords align in a vertical column. IV. RESULT Since searching is performed on both Hindi and English documents. Unit testing has been performed by recording the time for the extraction and searching module separately and then performing system testing by recording the time for extraction and searching process together. Sequential scanning process is competed with the indexing and searching mechanism of this framework as the sequential scanning of 40 mb file took 20 min (approx.) while this framework took about 70 sec including the extraction and searching time. Hence, making the process more efficient. Testing in English Documents
Results will contain: • The name of the document. • the words present in the textual data • the line in which the words are present • Line number. G. Ranking The Search method returns an ordered collection of documents ranked by implementing sorting. The documents containing the maximum number of keywords and the maximum times the keyword is occurring, is ranked first. H. Searching the synonyms of the keyword If the user fails to get the search results or none of the documents matches the query, facility to search the synonym of the keyword is provided to the user. The user will be given a list of synonyms of the searched keywords and on clicking any of the synonyms, the searching will start. I.
Fig.3: Testing of English Documents
Testing in Hindi Documents
Output Module
Displaying search results in KWIC, KWAC and KWOC formats After fetching the lines containing keywords from index, the lines along with the document name and line number will be displayed to the user in KWIC, KWAC, KWOC format. Keywords in context index is a list of keywords displayed with their surrounding text (context). The reader using the index must examine the listed pages until he finds the particular reference he seeks. The keyword-in-context index makes it easy to find the desired reference by including the surrounding text as well as the keywords. For readability, it is
Fig.4: Testing of Hindi Documents
978-1-4673-9939-5/16/$31.00 ©2016 IEEE
910
International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) - 2016
V. CHALLENGES i. A more efficient extraction process can be included so as to improve the indexing and searching mechanism. ii. The framework can be improved by including sentence segmentor directly in the files not after fetching their content. iii. A more enhanced and larger database of English synonyms can be developed and the database for Hindi synonyms can also be developed. iv. Application can be improved by including searching facility in other languages (i.e.by making it multilingual).
[7]
Mika Kaki, “fKWIC: Frequency Based Keyword-in-Context Index for filtering Web Search Results”, Department of Computer Sciences, University of Tampere, Finland, 2005.
[8]
Ambesh Negi, Mayur Bhirud, Dr. Suresh Jain and Mr. Amit Mittal, “Index Based Information Retrieval System”, International Journal of Modern Engineering Research (IJMER), Vol.2, Isuue.3, May-June, 2012, pp- 945-948.
[9]
Olayinka Silas Akinwumi, “Indexing and abstracting services in libraries: A legal perspective”, International Journal of Academic Library and Information Science, Vol. 1(1), pp. 1-9, August 2013.
[10] Mohammad Reza Falahati Qadimi Fumani, “Key Word versus Key Phrase in Manual Indexing”, Regional Information Center for Science and Technology, Shiraz Iran.
VI. CONCLUSION Keyword Indexing and searching plays a major role in Natural Language Processing (NLP) for text retrieval. It works along with extraction to retrieve the search results from the text files extracted from files having different formats (.pdf, .docx, .pptx, etc.) and it also supports the bilingual searching mechanism adding to its role in NLP field. In this framework: i. ii. iii.
An efficient indexing algorithm for the searching process is used. Therefore, resulting in faster retrieval of the results. Removal of the stop words and stemming enhance the searching process. Documents are rated to provide the best possible results and results are displayed in KWIC, KWAC and KWOC formats.
Synonyms fetching adds to resolve the problem of NLP to some extent as the user is allowed to choose his own query to be searched from a list of synonyms displayed. REFERENCES [1]
Erik Hatcher forwarded by Doug Cutting, “Lucene in Action”, Manning Publications Co., Greenwich, CT, USA, 2004. http://www.geocities.ws/salman_mlisc/dissertation/chap3.ht m
[2]
https://en.wikipedia.org/wiki/Full_text_search
[3]
http://www.janda.org/workshop/content%20analysis/kwic.ht m
[4]
Erik Hatcher forwarded by Doug Cutting, “Lucene in Action”, Manning Publications Co., Greenwich, CT, USA, 2004. “KWIC Index of Computer Programs”, Data Processing Institute For SOCIAL Research University of Michigan, July 1965.
[5]
H.Y.Mahakuteshwar, (AKWIC) Indexing”, Documentation, 1980.
[6]
Robert Fugmann, “The Five-Axiom Theory of Indexing and Information Supply”, A Wiley Company, 1985.
“Altered Keyword in Context Annals of Library Science and
978-1-4673-9939-5/16/$31.00 ©2016 IEEE
911