FOR IMPROVING THE PERFORMANCE OF AMHARIC TEXT ...... python 3.1 programming language was used to develop all the programs that were used to.
UNIVERSITY OF GONDAR FACULTY OF NATURAL AND COMPUTATIONAL SCIENCES DEPARTMENT OF INFORMATION TECHNOLOGY
APPLYING THESAURUS BASED SEMANTIC COMPRESSION FOR IMPROVING THE PERFORMANCE OF AMHARIC TEXT RETRIEVAL
A THESIS SUBMITTED TO THE DEPARTMENT OF INFORMATION TECHNOLOGY IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION TECHNOLOGY
BY TEWODROS ABEBAW CHEKOL Advisor: Million Meshesha (PhD)
June 2014
ABSTRACT Information retrieval is a mechanism that enables finding relevant information material of unstructured nature that satisfies information needs of user from large collection. Since there are usually many ways to express the same concepts, the terms in the user’s query may not appear in a relevant document. Alternatively, many words can also have more than one meaning which may confuse the retrieval system. This research intended to apply semantic compression techniques to handle related words in the document and users’ query. Amharic text retrieval developed in this study has indexing and searching subsystems. While Indexing organizes index terms, searching enables matching query terms with index terms in order to retrieve relevant documents. For this study Amharic text document corpus is prepared by the researcher encompassing different news articles, books and Amharic websites. Also various techniques of text preprocessing including tokenization, normalization, stop word removal and stemming were used to identify content-bearing words. Once content bearing terms are identified, then semantic compression technique is applied to identify semantic representation that is more precise descriptor for each term. To achieve this aim thesaurus based inverted indexing scheme is used for semantic compression of terms in indexing. In the same way, in the searching side also thesaurus-based query expansion technique is employed. Experimental result shows that the system registered on the average 70.75 percent precision and 77.13 percent recall. The major challenges that affect the performance of the IR system include lack of standard thesaurus for Amharic language and ineffectiveness of Amharic stemmer to conflate Amharic inflectional words into their stem. Therefore, in order to improve the performance of the system there is a need to develop an effective Amharic thesaurus as well as Amharic stemmer.
i
Contents Contents
page no.
ACKNOWLEDGEMENT ......................................... ERROR! BOOKMARK NOT DEFINED. DEDICATION ........................................................... ERROR! BOOKMARK NOT DEFINED. ABSTRACT ..................................................................................................................................... I LIST OF TABLES ........................................................................................................................ VI LIST OF FIGURES ..................................................................................................................... VII LIST OF ALGORITHMS .......................................................................................................... VIII LIST OF APPENDIX ................................................................................................................... IX INTRODUCTION .......................................................................................................................... 1 1.1
BACKGROUND ................................................................................................................. 1
1.2
STATEMENT OF THE PROBLEM AND JUSTIFICATION ........................................... 4
1.3
OBJECTIVES OF THE STUDY ......................................................................................... 5
1.2.1
General Objective ......................................................................................................... 5
1.2.2
Specific Objectives ....................................................................................................... 6
1.4
SCOPES AND LIMITATIONS OF THE STUDY ............................................................. 6
1.5
METHODOLOGIES ........................................................................................................... 6
1.5.1
Literature Review ......................................................................................................... 7
1.5.2
Dataset Collection......................................................................................................... 7
1.5.3
Approaches and Programming Tools ........................................................................... 7
1.5.4
Testing Procedure ......................................................................................................... 8 ii
1.6
SIGNIFICANCE OF THE STUDY..................................................................................... 8
1.7
ORGANIZATION OF THE THESIS .................................................................................. 8
CHAPTER TWO .......................................................................................................................... 10 LITERATURE REVIEW ............................................................................................................. 10 2.1
AMHARIC LANGUAGE ................................................................................................. 10
2.1.1
The Amharic Writing System ..................................................................................... 11
2.1.2
Amharic Numeric System .......................................................................................... 13
2.1.3
Amharic Punctuation Marks ....................................................................................... 14
2.2
AMHARIC SCRIPT FEATURES ..................................................................................... 14
2.3
CHALLENGES OF AMHARIC WRITING SYSTEM .................................................... 14
2.3.1
Redundancy Symbols ................................................................................................. 15
2.3.2
Features of Synonymous and Polysomous Words ..................................................... 16
2.3.3
Feature Similarity Among Characters ........................................................................ 17
2.3.4
Existence of Irregular Spelling ................................................................................... 17
2.3.5
Problems of Transliteration ........................................................................................ 17
2.3.6
Formation of Compound Words ................................................................................. 18
2.4
OVERVIEW OF INFORMATION RETRIEVAL ............................................................ 18
2.5
INDEXING ........................................................................................................................ 20
2.5.1
Linguistic Preprocessing............................................................................................. 21
2.5.2
Document Representation and Term Weighting ........................................................ 23
2.6
RETRIEVAL MODELS .................................................................................................... 24 iii
2.6.1.
Standard Boolean Model ............................................................................................ 25
2.6.2.
Statistical Model ......................................................................................................... 25
2.6.2.1. Vector Space Model ...................................................................................................26 2.6.2.2. Probabilistic Model ....................................................................................................27
2.6.3.
Linguistic and Knowledge-based Approaches ........................................................... 28
2.7
QUERY .............................................................................................................................. 29
2.8
SEMANTIC COMPRESSION .......................................................................................... 30
2.9
APPROACHES TO SEMANTIC COMPRESSION ......................................................... 32
2.9.1
Ontology based Approache ........................................................................................ 32
2.9.2
Frequency Based Approache ...................................................................................... 34
2.9.2.1
Term Frequency Approach ............................................................................... 34
2.9.2.2
Term Frequency – Inverse Document Frequency ................................................ 34
2.9.2.3
N-gram based Approach ................................................................................... 35
2.9.2.4
Thesaurus based approach ................................................................................ 38
2.10
REVIEW OF RELATED WORKS .............................................................................. 42
CHAPTER THREE ...................................................................................................................... 46 METHODS AND TECHNIQUES ............................................................................................... 46 3.1
ARCHITECTURE OF SEMANTIC COMPRESSION .................................................... 46
3.2
DATASET PREPARATION AND DOCUMENT PRE-PROCESSING ......................... 48
3.2.1
Tokenization ............................................................................................................... 48
3.2.2
Normalization ............................................................................................................. 49 iv
3.2.3
Dropping common terms ............................................................................................ 50
3.2.4
Stemming .................................................................................................................... 51
3.3
TERM WEIGHTING......................................................................................................... 52
3.4
SEMANTIC COMPRESSION .......................................................................................... 53
3.5
INVERTED INDEX .......................................................................................................... 54
3.6
EVALUATION TECHNIQUE .......................................................................................... 55
CHAPTER FOUR ......................................................................................................................... 57 EXPERIMENTATION AND DISCUSSION............................................................................... 57 4.1
CONSTRUCTION OF THE THESAURUS ..................................................................... 57
4.2
INDEX CONSTRUCTION ............................................................................................... 58
4.3
STATISTICAL CO-OCCURRENCE BASED SEMANTIC COMPRESSION ............... 59
4.4
THESAURUSES BASED SEMANTIC COMPRESSION ............................................... 60
4.5
EXPERIMENTATION AND EVALUATION ................................................................. 62
CHAPTER FIVE .......................................................................................................................... 66 CONCLUSIONS AND RECOMMENDATIONS ....................................................................... 66 5.1
CONCLUSIONS................................................................................................................ 66
5.2
RECOMMENDATIONS ................................................................................................... 67
REFERENCES ............................................................................................................................. 68 DECLARATION ....................................................... ERROR! BOOKMARK NOT DEFINED.
v
LIST OF TABLES Table 2.1 Sample for Orders of Amharic scripts………………………………………………..11 Table 2.2 Method of order formation in the Amharic writing system……………………….....12 Table 2.3 Ethiopic numeral ……………………………………………………………………..13 Table 2.4 The same sound for the first and fourth order alphabets………………………….….15 Table 2.5 Different alphabet that share the same sound………………………………………..16 Table 3.2 Sample List of stop words…………………………………………………………….51 Table 4.1 Amharic synonymous terms and their representative key terms……………….…….58
Table 4.2 sample raw term by document matrix using statistical co-occurrence base indexing…………………………………………………………………………………………61 Table 4.3 sample term by document matrix using thesaurus based indexing will be shown in the next table………………………………………………………………………62 Table 4.4 Performance Comparison of Amharic text retrieval with and without semantic
compression…………………………………………………………………………………….63
vi
LIST OF FIGURES Figure 2.1 General Architecture of Information Retrieval System ……………………19 Figure 3.1 The design of the Prototype semantic compression of Amharic text retrieval system………………………………………………………………………………..….……..47 Figure 4.5 Sample screen shoot when the Amharic retrieval system retrieve information for users query………………………………………………………………………………………………………..65
vii
List of Algorithms Algorithm 3.2.1: Tokenization ……………………………………………………….…….48 Algorithm 3.2.2: Normalization ……………………………………………………………50 Algorithm 3.2.3: Removing stop words ……………………………………………………51 Algorithm 3.2.4: Stemming Algorithm …………………………....……………….………52 Algorithm 3.4: Semantic compression ……………………………………………………54
viii
LIST OF APPENDIX Appendix: I: The Amharic Character Set …………………………………………………..…73 Appendix II: Sample of text document used for this study…………………………………..75 Appendix III: The queries used in the experiment……………………………………...……76 Appendix IV: Amharic Stop Words……………………………………………….…………77
ix
CHAPTER ONE INTRODUCTION
1.1
Background
As storage becomes more ample and less expensive, the amount of information retained by businesses and institutions is likely to increase. Searching that information and deriving useful facts, however, will become more awkward unless new techniques are developed to automatically manage the data [1]. The result of this phenomenon is that relevant information gets buried since it is never revealed. Therefore, information retrieval systems, which can search large information collection and return only the relevant information to user’s information need, will become more and more important. Users often describe their information need using a query which consists of a number of words. Information retrieval systems compare the query with the documents in the collection and return the documents that are likely to satisfy the information need. As most information is available in textual form, text retrieval is a critical area of study in worldwide [2]. An information retrieval system is system that stores and manages information on documents and also enables users finding the information they need. It returns documents that contain answer to users question rather than explicit answer to their information need. Most of the time retrieved documents satisfy users’ information needs.[3] Documents which satisfy users’ information needs called relevant documents, whereas documents which are not satisfying users’ information need are irrelevant documents. In fact there is no perfect information retrieval system which retrieves all relevant documents and no irrelevant document [3]. Information retrieval has two main subsystems, indexing and searching. Indexing is an offline process of representing and organizing large document collection using indexing structure such as inverted file, sequential files and signature file to save storage memory space and speed up searching time. Searching is the process of relating index terms to query terms and return relevant hits to users query. Both indexing and searching are interrelated and dependent on each other for enhancing effectiveness and efficiency [4].
1
Now a day’s progress has been seen on the development of information retrieval systems. The retrieval system is moving from classical retrieval models to toward intelligent techniques. The commercial search engines are well organized. Performance of commercial search engines in both efficiency and effectiveness is very high these days. But there are different challenges in implementing information retrieval system. Information retrieval is language dependent process which needs integrating knowledge of information retrieval techniques and natural language. Most of information retrieval techniques are developed for English language and it is always difficult task applying it for other languages. The other thing is tradeoff between efficiency and effectiveness in terms of information retrieval system performance. Information retrieval system should be both effective and efficient but always increasing one decreases the other. So that developing an effective and efficient system is becoming very basic but also hard task. Different types of information are uploaded in the Internet and text documents are the most widely used because of this most Research in text document retrieval is also getting closer attention in recent years. This has covered the way for a large number of new techniques and systems for text indexing and retrieval. These techniques are used to build many successful prototype systems for indexing and retrieval of text documents. Modern information retrieval systems use a range of statistical and linguistic tools to maximize the effectiveness of searching textual documents [5, 6]. But still there are challenges in background performance of information retrieval system. . We must remember that, not only quality of results, but also processing time has a crucial role for end user. Therefore, in contemporary IR systems based on vector space model, it is desirable to reduce the number of dimensions (corresponding to recognized concepts), and limit the amount of calculations thereby. One of the main problems that characterize natural language texts is filled with many words and phrases that have similar meanings, and it is often impossible for users to provide all the words which might be relevant to the query. Global information access and contextual retrieval are two main themes emerged in long-term challenges of information retrieval [7]. Global information access satisfy human information needs through natural, efficient interaction with an automated system that leverages world-wide structured and unstructured data in any language. Whereas contextual retrieval combine technologies and knowledge about query and user context into a single frame work in order to provide the most appropriate answer for a user’s information needs. 2
Near–term challenges models that incorporate the evolving information needs of users performing realistic tasks and the utility of information as well as topicality [2]. More advanced model and generalization of current techniques to information source, models and tools for incorporating multiple sources of evidence. Overall major challenges in information retrieval come from the tools and techniques we use to perform the analysis. We technically have not retrieved all the information we seek until we have an adequate set of tools to make sense of any data we collect. Semantic compression is advantageous in information retrieval tasks, improving their effectiveness (in terms of both precision and recall). This is due to more precise descriptors (reduced effect of language diversity limited language redundancy, a step towards controlled dictionary) [8]. There are usually many ways to express the same concept; the terms in the users query may not appear in a relevant document. Alternatively, most words are synonymous while they can have more than one meaning. To control such challenges scholars are recommending the application of semantic compression [9]. Semantic compression is a technique that allows transforming a text fragment so that it has similar meaning but it is using less detailed terms where information loss minimization is an imperative [8]. The most important idea behind semantic compression is that it reduces the number of words used to describe an idea. As a consequence, semantic compression allows one to identify a common thought in seemingly different communication vessels, and it uses reduced number of word-vector dimensions in involved methods, making them more efficient. The reduction entails some information loss, but in general it cannot degrade quality of results thus every possible improvement is considered in terms of overall impact on the quality. Dimensions' reduction is performed by introduction of descriptors for viable terms. Descriptors are chosen to represent a set of synonyms or hyponyms in the processed passage. Decision is made taking into account relations among the terms and their frequency in context domain. To do this we can use a frequency based or ontology based semantic compression techniques [8]. Frequency based semantic compression requires assembling word frequencies and information in semantic relationships and use a vector with one concept of every significant term that occurs in the document.
3
Ontology based semantic compression is a general description of all concepts as well as their relationship. Ontology plays a key role in query expansion research. A common use of ontology in query expansion is to enrich the resources with some well-defined meaning to enhance the search capabilities of existing web searching systems. The ontology-based information retrieval recognizes the relations among terms by referring to the ontology. Creating ontology’s is not an easy task and obviously there is no unique correct ontology for any domain. Ontology-based representation allows the system to use fixed-size document vectors, consisting of one component per base concept. Absence of some terms in the ontology, in particular terms related to specific domains (biomedical, mechanical, business, etc.), are not defined in the machine readable dictionary used to define the concept-based version of the documents. This way there is in some cases, a loss of information that affects the final retrieval results [10]. Hence, designing efficient Amharic document text retrieval system using semantic compression technique is one basic solution to overcome the limitations of the previous Amharic document retrieval systems, we aim at designing Amharic text document retrieval system using thesaurus based indexing semantic compression.
1.2
Statement of the problem and justification
Amharic is one of the Semitic language in the world and the working language of the Federal Democratic Republic of Ethiopia. It has been the working language of governmental organization, non-governmental organization and private institutions throughout modern times. As a result, this language plays a great role in social, political, and business aspects. There are lots of digital documents in Amharic language available on the Internet and the World Wide Web which are accessible for users. This leads to the need for developing efficient and effective information storage and retrieval systems to represent, store and retrieve relevant Amharic documents from language document collections. Some experiments on information processing on Amharic documents have been made in several areas. Some of the experiments carried in this area are: retrieval techniques to Amharic documents [51,18]; Automatic Classification of Amharic documents [58]; the application of OCR techniques on computer printout, type written and handwritten documents [56,53, 54, 55, 57]; Amharic Word Parsing [52],the application of websom for Amharic text[27],Amharic text retrieval: an experiment using latent semantic indexing (LSI) with singular value decomposition (SVD)[1]are proposed Amharic text retrieval systems with the aim of enhancing the efficiency of the system. 4
Recently, Haimanot [11] attempts towards integrating Semantic compression techniques to Amharic information retrieval system in order to improve its effectiveness. The application of Semantic compression is advantageous to define more precise descriptors of the concept (reducing the effect of language diversity thereby limiting language redundancy, a step towards controlled dictionary). However, she uses a statistical co-occurrence based indexing to show the performance improvement of integrating semantic compression in information retrieval which is not capable to control Amharic synonymous terms. So her research have a great impact in retrieving relevant information for users query because of the reason that the system it doesn’t handle synonymous terms. And the other weakness is that she uses a specified user query to retrieve information from the system which does not show the performance improvement of the retrieval system. Therefore, the aim of this study is to apply semantic compression approach to handle Amharic synonymous terms that affect the performance of information retrieval systems there by retrieving relevant documents as per users query. Towards this end, the following research questions are explored and answered in the present work: How semantic compression techniques are applied so as to control Amharic synonymous terms in Amharic information retrieval system? To what extent semantic compression reduce index terms to enhance the performance of text retrieval system? How semantic compressions improve searching for relevant Amharic documents in Amharic text retrieval?
1.3
Objectives of the Study
Here are the general and specific objectives of the study that the present study is aim to achieve.
1.3.1 General objective The general objective of this research is to design semantic compression technique that control Amharic synonyms terms and integrate it to Amharic text retrieval for enhancing the performance of Amharic text retrieval system. 5
1.3.2 Specific objectives On the way of attaining the general objective, the study specifically achieves the following objectives. To review literatures and previous works related with information retrieval system; To identify the language specific features of Amharic and explore natural language processing techniques for Amharic Document Retrieval; To prepare dataset and build corpus for training and testing the performance of the prototype; To develop thesaurus for Amharic index and query terms; To integrate semantic compression with Amharic IR system to retrieve relevant documents for the users query and; To evaluate effectiveness of information retrieval system.
1.4
Scope and Limitation of the study
This work attempts to integrate semantic compression for Amharic text document retrieval .Other data types, such as image, video, and graphics are out of the focus of this research. And also only content bearing words are considered in the research. Semantic compression is used during indexing and also in searching using thesaurus based indexing technique but only synonym relationship between terms is used to develop the thesaurus. Here stemming is done by suffix and prefix detection and removal. As a result of time factor, limited corpus was used for evaluating the performance of the IR system developed in the study. This study also does not include words with the same form having different meaning; for example,“ዋና” which means “main” but also means “swimming”). Additionally, infixes detection and removal is not handled.
1.5
Methodologies
Methodology is an approach which involves data collection, analysis and interpretation that show how a researcher achieves the objectives and answers the research [11]. Hence, in order to
6
achieve the specific and general objectives of the study and answer the research questions, the following methods are used.
1.5.1 Literature Review To have conceptual understanding and identify the gap that is not covered by previous studies and order to have detail understanding on the present research several previously proposed related literatures, including journal articles, conference papers, books and the Internet have been reviewed. In this study the review is mainly concerned works that have direct relation with the topic and the objective of the study. These include previous works done on Amharic information retrieval system and query expansion.
1.5.2 Dataset Collection To demonstrate the effectiveness of our proposed system the researcher collects Amharic text document to prepare text document corpus for testing our system. The corpus is built from the official website of Ethiopian Reporters news and Ethiopian news agency and also other Internet based sources or websites are used like የ ዳ ን ኤ ል እ ይ ታዎ ች Daniel Kibret's views and electronic books like OROMAY by Be’alu Girma. Data set used is mainly, news articles, and others resources and the Internet. None of the selected documents are domain specific, rather it covers different aspect of life like sport, culture, socio-economic, political, religious, education and development areas. Heterogeneity of the data set helps evaluation of the system more generic [12]. All the collected data are converted in the same font type and size and saved as a text document using notepad. All the prepared data sets are used as an input for testing the performance of the investigated Amharic text retrieval system.
1.5.3 Approaches and Programming Tools The process of developing semantic compression technique is broken down into the following phases [11]: Pre-processing (punctuation removal, tokenization, normalization, stemming, and stop word removal), Term weighting using term frequency–inverse document frequency approach (TF*IDF) and finally applying thesaurus based semantic compression during indexing and searching. 7
Owing to its dynamic data types and embeddable with in applications as a scripting interface python 3.1 programming language was used to develop all the programs that were used to Manipulate the files and develop the prototype. Python is dynamic programming language that is used in a wide variety of application domains. It is simple, strong, involves natural expression of procedural code and modular. The prototype system was run on Windows environment. The retrieval effectiveness was then compared using the conventional recall-precision measures. Depending on results of the output, conclusions and recommendations were made.
1.5.4 Testing procedure The experiment was conducted using 300 documents as the test set and 15 queries are used to test the performance of the system. To measure effectiveness of the Amharic information retrieval, the classic recall and precision are used. Recall is the fraction of the relevant documents that are returned by the system. Precision is the fraction of relevant documents in the set of returned documents [13]
1.6
Significance of the Study
This research work achieves significances due to the following main facts: This research produces a result that shows the advantage of integrating semantics compression based on thesaurus for effective information retrieval system. It can also be implemented in different organizations to provide a way for retrieving text documents timely. It will used as a motivation for other researchers to work on this area for further improvement of retrieval system in Amharic and other local languages.
1.7
Organization of the Thesis
This thesis is organized in to five chapters. The first chapter discusses the introductory part, which is discussed previously. The rest of the thesis is organized in the following way:
8
Chapter two is literature review. General concepts on semantic compression and on Amharic writing system applicable to the research are discussed. Approaches to semantic compression and related researches on semantic compression are also reviewed. Chapter three describes the design of the prototype system in this thesis, discusses the techniques used and the algorithm. Implemented major technique and methods used for semantic compression indexing and searching are discussed. The fourth chapter of the work is Experimentation and result of the study. In this part dataset selections and preparations, implementations of the proposed work, experimentations, findings of the study, and issues in implementations are discussed in detail. Finally, in chapter five major findings including faced challenges are written as a conclusion and works identified as future work and needs to get attention of other researchers are listed in recommendation section.
9
CHAPTER TWO
LITERATURE REVIEW Nowadays journal, magazines, newspapers, news, online education, books, entertainment Medias, videos, pictures, are available in electronic format both on the Internet and on offline sources. There are huge amount of information being released with Amharic language, since it is the language of education and research, language of administration and political welfares, language of ritual activities and social interaction. As a result, retrieving the required information from these collections is becoming very necessary and because of this many information retrieval systems are developed but still developing the system to retrieve information in effective and efficient manner is a challenged task. The specified problem can be tackled by developing effective and precise information retrieval system for users to search, browse and interact with the bulk of document corpus and do so in a timely manner [14]. The focus of this work is to design an information retrieval system which integrates semantic compression technique for Amharic text documents.
2.1
Amharic Language
Amharic is a Semitic language with an influence from Cushitic languages which is more visible on the syntax Amharic. Verbs exhibit the typical Semitic non-linear word formation inters digitations of consonantal roots with vocalic patterns. This also applies to ad verbal nouns and adjectives. The term root is used to refer to lexical morphemes consisting of consonants, radical for consonant constituents of roots; and stem for intercalated forms. Amharic is morphologically rich language where up to 120 words can be conflated to a single stem. The word units of Amharic phoneme, root, stem and word. The present Amharic writing system was adopted from the Ge’ez writing system. Ge’ez, which belongs to the class of Semitic languages, was the language of literature in Ethiopia in earlier times. The ancient Sabaean script is in turn attributed as the source of the Ge’ez script. However, as Bender [17] explains, the number of symbols in the original Sabaean script and their shapes, have changed as they descended into Ge’ez and then later on into Amharic. Moreover, some new symbols have been added to Amharic. Amharic did not discriminate in adopting the Ge’ez Fidel; 10
it took all of the symbols and added some of its own. Although Sabaean is not used currently, Ge’ez is still used especially as a language of liturgy (mass) in the Ethiopian Orthodox and Catholic churches and in church literature [11].
2.1.1 The Amharic writing system Since the 1st century, Amharic has undergone into various modifications and additions. The current Amharic script has 33 basic characters. As shown in Table 2.1. there are other six orders which are derived from the basic forms and represent syllable combination consisting of a consonant & vowel except the 1st order, which may represent the consonant alone [15]. Table 2.1 Sample for Orders of Amharic scripts 1st order 2nd order
3rd order
4th order
5th order
6th order
7th order
ሀ
ሁ
ሂ
ሃ
ሄ
ህ
ሆ
ለ
ሉ
ሊ
ላ
ሌ
ል
ሎ
ሐ
ሑ
ሒ
ሓ
ሔ
ሕ
ሖ
መ
ሙ
ሚ
ማ
ሜ
ም
ሞ
ሠ
ሡ
ሢ
ሣ
ሤ
ሥ
ሦ
ረ
ሩ
ሪ
ራ
ሬ
ር
ሮ
ሰ
ሱ
ሲ
ሳ
ሴ
ስ
ሶ
ቀ
ቁ
ቂ
ቃ
ቄ
ቅ
ቆ
From Table 2.1 it is easy to notice that there is regularity in the pattern of the characters even though it is violated sometimes. Letters starting from the 1st order through the 5th order have predictable structure while the 6th and 7th orders are less systematic. The regularity of the pattern is very important feature which helps new learners to know the writing system easily using the predictable patterns.
11
Table 2.2 Method of order formation in the Amharic writing system Characters
2nd order
Method of Construction
Example
Add a horizontal stroke at the middle of the right side of the
ሁ ሉ ሑሙ
base character. 3rd order
Add a horizontal stroke at the bottom of the right leg of the
ሂ ሊ ሒሚ
base character. 4th order
5th order
Elongate the right leg of a two or ሃ ላ ሓማ
three leg base character. Add a diagonal stroke at the bottom of the leg of a one-leg
ጋ ታቻፓ
base character. Add a ring at the bottom of the ሄ ሌ ሔሜ
right leg of base character. 6th order
Highly irregular.
ፅ ዝ ስ ብ
Some characters bend their legs.
ህ ሕጥት
Some
looped
characters
add
horizontal stroke at their loops. 7th order
Highly irregular.
ውድ ጅ ጽ
ሞዎ ጎ ዖ
Shortening the last leg (or the last two legs for characters that have
ሶ ቦ ጦሖ
three legs). Adding loop at the top-right of ሆ ሎ ቆ ሮ the character
12
Among the 33 base forms, two, /አ/ and /ዏ/, represent vowels and the rest consonants and the semivowels /ወ/ and /የ / (are classified as consonants). In addition to the (33 × 7 = 231) major symbols, there are 51 additional variants for labialized consonants (plus vowel), that is, syllables involving consonants with lip-rounding, for example, ኳ/kwa/. There are also symbols which are introduced into the writing system to incorporate sounds which have no symbol in the writing system previously. For instance, the system added two extra symbols (i.e., 2 sets of 7) which are ፐ /p/ and ጰ /p'/. These symbols are introduced into the system in order to pronounce words borrowed from Greek and Latin like: Paul (ጳውልስ), polis (ፖሊስ) and Ethiopia (ኢትዮጵያ) [30]. In addition to this the syllabic series ቨ /v/ is still an element of the writing system which originally introduced to transcribe Italian in Ethiopic writing [16].
2.1.2 Amharic numeric system The Ethiopic writing system has its own numbering system which is believed to be descended from Greek system. However, others also claim that it is derived from the letter elements of the respective writing system [16]. Generally, the Ethiopic numeral system has 20 symbols which are indicated in Table 2.3
Table 2.3 Ethiopic numeral 1፩
6፮
20 ፳
70 ፸
2፪
7፯
30 ፴
80 ፹
3፫
8፰
40 ፵
90 ፺
4፬
9፱
50 ፶
100 ፻
5፭
10 ፲
60 ፷
10000 ፼
However, the numbering system does not have a symbol for zero, negative, decimal point, and mathematical operators; hence in the current Amharic writing system the Hindu-Arabic numerals and Latin mathematical operators are used. But, the Ethiopic numerals are used highly in early Amharic text documents.
13
2.1.3 Amharic punctuation marks Amharic punctuation marks consist of different symbolic representations. Mostly used punctuation marks are: the basic word divider, ሁለት ነጥብ-hulet netib, which has two dots arranged like a colon (፡) and a sentence ending, is represented using አራት ነጥብ-arat netib, four square dots arranged in a square pattern (፡፡). Some others equivalent to comma represented as, ነጠላ ሰረዝ- netela serez,(-) and semicolon represented, ድርብ ሰረዝ- derib serez,(፤). Question mark, quotation mark and exclamation mark are represented as, ጥያቄ ምልክት- tiyake milikit, (?), ትምህርተጥቅስ-timihrite tiks, (« ») and ትምህርተ አንክሮ-timihrite ankro (!),ትምህርተ ስላቅ - timihrite silaki (¡), respectively [17].
2.2
Amharic Script Features
Million [32] discusses specific features of Amharic writing system as mentioned below:
Shape similarities among characters: the shapes of many Amharic characters show similarities with few distinctions, for instance, ደ and ጀ, ተ and ቸ, and ኸ and ከ.
Structural relation: some basic characters of the writing system are also clearly related in graphical structure such as ነ and ከ, ረ and ፈ.
Shape differences: There are remarkable differences in shape among the basic characters. For example consider ሀ and በ, these two letters are open in one side but in opposite direction. On the other hand, መ and ወ are formed from two loops but vary in loops connection, ሠ and ጠ have three legs which end in opposite directions.
Size and width differences: Amharic characters can differ in size either vertically or horizontally. Some characters are very short like (ፈ, ሠ, መ) and some others are very long such as (ች, ዥ, ኽ). Furthermore, there is also noticeable difference in width, for example between ነ, ሚ and ጪ.
2.3
Challenges of Amharic writing system
Bender et al. [17], Thomas [30] and Bethlehem [18], point out problems of the Amharic writing system as follows.
14
2.3.1 Redundancy symbols The use of different symbols in writing for representing the same concept result a considerable systemic redundancy, this creates a challenge in Amharic writing system. For instance four different symbols represent the same sound/h/ + vowel: ሀ, ኀ, ሐ and ሓ. There are additional symbols, i.e. /s/: ሰ, ሠ and /s'/:ፀ, ጸ which can be used interchangeably as one sound. Such redundancy may result an ambiguity on learners and users of the writing system [17]. For instance, using three of these symbols ሀ, ኀ, ሐ to represent the name ‘Haragewayni’ in Amharic resulted, at least, six different Amharic wordings, ሀረገወይን, ሐረገወይን, ሓረገወይን, ሃረገወይን, ኀ ረ ገ ወ ይ ን , and ኃ ረ ገ ወ ይ ን . Consequently, designing an information retrieval system is somehow complex due to such redundancies. The class of symbols with the same sound falls into two
The same sound for the first and fourth order alphabets.
Different alphabets that share the same sound.
The following belong to the class of the first type. Letters in the same row share the same sound.
Table 2.4 The same sound for the first and fourth order alphabets 1st order
4th order
ሀ
ሃ
ሐ
ሓ
ኀ
ኃ
አ
ኣ
ዐ
ዓ
The following is a list of the alphabets that fall in the second class. All alphabets in the same column have the same sound. 15
Table 2.5 Different alphabets that share the same sound (All alphabets in the same column have the same sound.)
ሀ
ሁ
ሂ
ሃ
ሄ
ህ
ሆ
ሐ
ሑ
ሒ
ሓ
ሔ
ሕ
ሖ
ኀ
ኁ
ኂ
ኃ
ኄ
ኅ
ኆ
ሰ
ሱ
ሲ
ሳ
ሴ
ስ
ሶ
ሠ
ሡ
ሢ
ሣ
ሤ
ሥ
ሦ
አ
ኡ
ኢ
ኣ
ኤ
እ
ኦ
ዐ
ዑ
ዒ
ዓ
ዔ
ዕ
ዖ
ጸ
ጹ
ጺ
ጻ
ጼ
ጽ
ጾ
ፀ
ፁ
ፂ
ፃ
ፄ
ፅ
ፆ
2.3.2 Features of Synonymous and Polysomous Words Polysomous words are characterized by different meaning based on the context. For instance, the word “ዓለም” in Amharic can mean ‘the world’, or it can also mean ‘every one’ or ‘Happiness’, or ‘wealth’. Polysemy causes terms in the query to be matched with words in irrelevant documents and for the search results to be too broad Words that are synonyms have equivalent meaning. Synonyms allow people to use different ways to refer to the same thing. It happens to the everyday searcher, even though, different people have same object in mind to search for, each 16
issues different queries [17]. They are all valid as they are synonyms. For example, in Amharic the words ‘ኃይል‘, ‘ብርታት' ‘አቅም’ and ‘ጥንካሬ’, all can be used to mean 'strength’.
2.3.3 Feature Similarity Among Characters Similarity among features of different characters resulted confusion specially in designing retrieval systems; for example, a system may group the word ደመረ and ጀመረ in the same cluster. Even if the first letter of both word have similarities with few distinctions, the remaining two characters are similar. Thus, such feature similarities among different characters greatly affect the performance of retrieval system [30].
2.3.4 Existence of Irregular Spelling Number of words in Amharic can be written with different spelling[17]. For example, the word “samtoal’ which means ‘he has heard’ may be spelled as ሰምቷል, ሰምትዋል and ሰምቶአል and also the word “ቢራቢሮ” and “ብራቢሮ” meaning “butterfly”[17] Literal term matching retrieval systems also suffer from this irregularity in spelling, as the same word can have different spelling[11].
2.3.5 Problems of Transliteration When foreign terms are transliterated in Amharic, different spellings may be used; as varied as the number of possible pronunciations [17].In all the above cases, the use of n-grams can handle the problem of variation by recognizing words that share a big proportion of strings as equivalent. Transliteration of foreign words into Amharic writing system is one of the main causes of this irregular spelling of words. Amharic language lacks some of the basic English sounds due to this a native Amharic speaker may fail to correctly pronounce some English words. For example ቴሌቭዥን and ቴሌቪዥን mean television and አይሮፕላን and አውሮፕላን mean airplane. Another problem of the language is, there are different ways of writing a single word due to different reasons .one reason for this can be regional dialects that can impact word formation in the basic level where the words are more likely to be written following their spoken form ዓጼ vs. ዓጤ[11].
17
2.3.6 Formation of compound words The Amharic writing system uses multitudes of ways to denote compound words and there is no agreed upon spelling standard for constructing these words [19].According to Bender et al. [17], compound words are sometimes written as a single word and other times as two separate words. For example, the word ‘megnetabet’ which means “bed room” can possibly be written as ‘መኝታቤት’ or ‘መኝታ ቤት’ and also the word ‘bet mekides’ which means “temple” can be written as ‘ቤተመቅደስ’ or ‘ቤተ መቅደስ’ [10]. This inflation nature of the writing system highly affects the performance of Amharic document image retrieval systems.
2.4
Overview of Information Retrieval
The information retrieval is very vast area of study, with the main aim of searching for relevant documents from large corpus that satisfies information need of users, thus led to the development of many text search engines [18]. IR is concerned with searching of documents for information from document corpus and the World Wide Web. IR searches both structured and unstructured information. IR system includes various process and techniques. Figure 2.1 shows IR system component and their interrelation.
18
Figure 2.1 General Architecture of Information Retrieval System [20] The whole IR system includes two main sub-systems [4]: Indexing and searching. Indexing is the process of preparing index terms which are either content bearing or free text extracted from documents corpus. Searching is process matching users information via query to documents in collection via index terms. These searching and indexing processes is guided by model, information retrieval model. There are various IR models including vector space model (VSM), Probabilistic and Boolean to list a few [28]. Additionally, it is important to measure performance of IR system. The performance evaluation techniques help us to know accuracy of the IR system.
19
2.5
Indexing
Indexing is an offline process of extracting index terms from document collection and organize them using indexing structure to speed up searching [4]. It is language dependent process which varies from language to language [36]. Three of the most commonly used file structures for information retrieval can be classified as lexicographical indices (indices that are sorted), clustered file structures, and indices based on hashing[35]. One type of lexicographical index is the inverted file, and the other one is the suffix tree (PAT tree).
i. Suffix tree Suffix tree [4] (also called PAT tree) is a data structure that presents the suffix of a given in a way that allows for a particularly fast implementation other operation like search. The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern, etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself [4].
ii. Inverted index An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the corpus. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines [4]. There are two different kind of inverted files [37]. One is record level inverted index (or inverted file index or just inverted file): contains a list of references to documents for each word. The 20
other is a word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document. Word level inverted index enables more functionality (like phrase searches), but needs more time and space to be created [37]. The process of construction of inverted index follows the four critical steps [36]: collecting document to be indexed, tokenization of the text, Linguistic preprocessing and finally index the documents that each term occurs in by creating an inverted index. Tokenization: it is the process of chopping on white spaces and throwing away punctuation characters. For example, if the original document is “I am Tewodros, who are you” the tokens will be ‘I’, ‘am’, ‘tewodros’, ‘who’, ‘are’, ‘you’. So we can define token as an instance of a sequence of characters. Each such token is now a candidate for an index entry, after further processing. One of challenges related to tokenization is--differentiating single word from compound word, for instance Hewlett-Packard Versus Hewlett and Packard, Addis Ababa Versus Addis Ababa. The other challenge is identifying numerical values like dates, phone numbers, and IP addresses. Additionally Chinese and Japanese has no space between words, which makes difficult space and punctuation mark based tokenization. Arabic and Hebrew are basically written right to left, but with certain items like numbers written left to right; the challenge here is it is not possible to use the same algorithm used for Latin based language for such languages too. That is why tokenization is called natural language dependent technique [38].
2.5.1 Linguistic Preprocessing Linguistic Preprocessing includes case folding/normalization, stop word removal and stemming. Case Folding/ Normalization: is the process of handling problem related with variation cases (UPPER CASE, or lower cases or Mixed Cases). So the good way to handle this problem is converting the whole document in to similar case. Often best to lowercase everything, since most of the time users use lowercase regardless of ‘correct’ capitalization. Here are some examples: Republican vs. republican, John vs. john vs. JOHN, and etc. In some languages like Amharic which does not have a distinction between upper and lower case this might not be a big deal. But it is very important for languages using Latin characters [39] [40].
21
Stop-words: Stop-words are most frequent terms which are common to every document, and have no discriminating power one document from the other. So these terms should not be considered in indexing process. According to Zipf’s law [36] few terms occur frequently, a medium number of terms occur with medium frequency and many terms with very low frequency. This shows that writers use limited vocabulary throughout the whole document, in which even fewer terms used more frequently than others. Further Luhn defined word significance across the document in 1958 [36]. Luhn suggested that both extremely common and extremely uncommon terms are not very useful for indexing. So we should upper and lower cutoff points. Upper cutoff enables to remove the most frequent terms whereas, a lower cut-off controls less frequent terms which believed to be non-content bearing terms [34]. Removing stop-word from English document saves disk space 25%-30%. But it is not good if the system handle phrase query, song names, etc. For example As we may think, to be or not to be [4]. Stemming: stemming is process used in most search engines and information retrieval systems. It is core natural language processing technique for efficient and effective IR system. Generally stemming transforms inflated words in to their most basic form. There are different stemming algorithms but the most common one is that of Porter, called ‘Porter Stemmer’. Even if stemming is very similar to lemmatization in most of indexing process stemming is used. Stemming is language dependent process in similar way to other natural language processing techniques. It is often removing inflectional and derivational morphology. E.g. automate, automatic, automation, automat. Stemming has both advantage and disadvantage. The advantage is it helps us to handle problems related to inflectional and derivational morphology. That makes words with similar stem/root word to retrieve together. This increases effectiveness IR system. Stemming has disadvantage sometimes: some terms might be over stemmed, this changes meaning of the terms in the document; different terms might be reduced to the same stem, which still enforce the system as to retrieve non relevant documents [41]. The inverted index data structure is core component of a typical search engine indexing algorithm. It improves quality of search engines by increasing speed of searching to find documents with term occurrence. With the inverted index created, the query can be resolved by jumping to the word id (via random access) in the inverted index. 22
2.5.2 Document representation and term weighting There are different ways of document representation. Document representation helps us to give different weight for different terms with respect to given query. It helps judge as if document is either relevant or not with respect to users query [42]. There are different mechanisms of assigning weight to terms. The first one is Binary Weights which assigned Only the presence (1) or absence (0) of a term is included in the vector. The other one is raw term frequency. This is a non-binary weight that assigns the frequency of occurrence of a term in the document. Finaly, the product of Term Frequency (tf) and Inverse Document Frequency (idf), which is commonly called TF*IDF. Even if there are three different ways of calculating term weight in this study TF*IDF weighting technique has been used. The reason why TF*IDF weight is selected is because it is normalized weighting technique and it is standard way of calculating weight. The term frequency*inverse document frequency also called TF*IDF, is a well know method to evaluate how important is a word in a document in information retrieval and text mining. It is statistical method which reflects how important a word/term is to a document in collection/corpus. TF*IDF is also a very interesting way to convert the textual representation of information into a VSM. The value of TF*IDF increases with proportion to the number of times a word appears in the document and decreases as the term exist frequently in the document corpus. The more common terms which exist in almost all documents have lower score of TF*IDF whereas terms exist frequently in single but not in others have higher score [37] [42]. Generally speaking, TF*IDF is used for term weighting. This weighting technique is used often by search engines and information retrieval systems. It can be used for filtering stop-words and also have application in text summarization and classification [42]. For example, if the user is "the brown cow" it is only documents that contains terms "the", "brown", and "cow”, which should be considered as relevant. So other documents which are not having any of these three terms are excluded from retrieval. In order to distinguish importance level of document frequency of each terms with (term frequency) in each documents counted [38].
23
In the above example the term ‘the’ is common word and obviously the most frequent term, if it is considered as content bearing it might be problem—because it is stop-word. The good thing about TF*IDF is that it relies on both term frequency (TF) and inverse document frequency (IDF). This makes simple to reduce rank of common terms throughout the whole document corpus, but increase in rank of terms exist in fewer documents more frequently [21]. The term frequency (TF) is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of importance of the term within particular document.
(2.2) The inverse document frequency (IDF) is measure whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, then taking the logarithm and quotient. Higher idf value is obtained for rare terms whereas lower value for common terms. It is mainly used to discriminate importance of term throughout the collection.
(2.3) Document frequency (df) - number of documents containing the given term. The more a term t occurs throughout all documents, the more poorly t discriminates between documents. The less frequently a term appears in the whole collection, the more discriminating it is. Then the tf*idf is product of tf and idf.
(2.4)
2.6
Retrieval models
A retrieval model specifies the details of the document representation, the query representation, and the matching function. There are two main reasons for having retrieval models. Primarily retrieval models guide research and provide the means for academic discussion. The second 24
reason why we need retrieval model is because it serves as a blue print to implement an actual retrieval system [43]. An Information Retrieval models predicts and explains what a user will find relevant given the users query. Evaluation and experiments proves correctness of what it predicts by use of the model [3]. There are different categories of information retrieval models. The major categories are, but not limited to: the Boolean Model the Statistical Model (vector space model, probabilistic model), and the Linguistic and Knowledge-based Models. There is also Google‘s Page Rank algorithm which uses web mining techniques [3].
2.6.1. Standard Boolean Model Boolean model is the oldest model of information retrieval. In the Boolean there are three basic logical operators AND, OR and NOT. AND is logical product, OR is logical sum and NOT is logical difference. AND is used to group set of terms in to single query/statement. For example ‘Information AND Technology’ is two term query combined by ‘AND’. In such case only document indexed with both terms will be retrieved. If terms in the user query are linked by operator OR, documented with either of terms or all terms will be retrieved. For example, if query is Information OR Technology, document containing Information, or Technology, or Information Technology will be retrieved [36] [3]. What makes Boolean model good model is that it creates a sense of control to expert/user over the system. It is the user who is in charge for deciding what should or shouldn‘t be retrieved. Query reformulation is also simple because user is in charge of deciding what should be retrieved and should not. In contrast Boolean model may not retrieve anything if there is no matching document or, retrieves all documents if terms in query are matching with it. So there is no relevance judgment and ranking mechanism [4].
2.6.2. Statistical Model The vector space and probabilistic models are the two major examples of the statistical retrieval approach. Both models use statistical information in the form of term frequencies to determine the relevance of documents with respect to a query. Although they differ in the way they use the term frequencies, both produce as their output a list of documents ranked by their estimated
25
relevance. The statistical retrieval models address some of the problems of Boolean retrieval methods, but they have disadvantages of their own. 2.6.2.1.
Vector Space Model
Vector space model is representation of index terms and query as vectors embedded in a high dimensional Euclidean space, where each terms is assigned as a separate dimension. Dj = (w1j, w2j, …, wtj) Q = (w1qj, w2qj,…, wtqj}
(2.5)
The whole VSM involves three main procedures. The first is indexing of the document in the way that only content bearing terms represent the document. The second is weighting the indexed terms to enhance retrieval of relevant document. The final step is ranking the documents to show best matching with respect to the provided query by user [4]. i.
Document Representation
Single document has many words, but it is only few words that describe content of the specific document. For instance prepositions and personal pronouns are non-content bearing terms. The main purpose of indexing is representing documents with only content bearing terms which have power to identify one document for the other [36]. Indexing can be based on term frequency where threshold is used to limit both high and low frequency terms. In order to make simpler, stop words are removed primarily from document and it will not be considered for representation. ii.
Term Weighting and similarity measurement
Term weighting is highly related to recall and precision. As in vector space model term weighting based on single term statistics throughout the entire document. In order to calculate term weighting we need term frequency, document frequency and length normalization. Product of these three main factors is result of term weighting. Term frequency is occurrence of a term in a given document. It can be used as content descriptive for documents and generally used for basis of a weighted document vector [4].
26
2.6.2.2.
Probabilistic Model
The probabilistic retrieval model is based on the Probability Ranking Principle, which states that an information retrieval system is supposed to rank the documents based on their probability of relevance to the query, given all the evidence available. The principle takes into account that there is uncertainty in the representation of the information need and the documents. There can be a variety of sources of evidence that are used by the probabilistic retrieval methods, and the most common one is the statistical distribution of the terms in both the relevant and non-relevant documents [37][43]. The most familiar probabilistic model technique is Bayesian inference networks. Let me give highlight on how it rank documents by using multiple sources of evidence to compute the conditional probability P(Info need|document) that an information need is satisfied by a given document. An inference network consists of a directed acyclic dependency graph, where edges represent conditional dependency or causal relations between propositions represented by the nodes. The inference network consists of a document network, a concept representation network that represents indexing vocabulary, and a query network representing the information need. The concept representation network is the interface between documents and queries. To compute the rank of a document, the inference network is instantiated and the resulting probabilities are propagated through the network to derive a probability associated with the node representing the information need. These probabilities are used to rank documents [43]. The statistical approaches have the following strengths: Firstly they provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the output by setting a relevance threshold or by specifying a certain number of documents to display. Secondly queries can be easier to formulate because users do not have to learn a query language and can use natural language. Thirdly the uncertainty inherent in the choice of query concepts can be represented. However, the statistical approaches have the following shortcomings: They have a limited expressive power. For example, the NOT operation cannot be represented because only positive weights are used. For example, the very common and important Boolean query ((A and B) or (C and D)) cannot be represented by a vector space query. Hence, the statistical approaches do not have the expressive power of the Boolean approach. The statistical approach also lacks the structure to express important linguistic features such as phrases. Proximity constraints are also difficult to express, a feature that is of great use for experienced searchers. The computation of 27
the relevance scores can be computationally expensive. A ranked linear list provides users with a limited view of the information space and it does not directly suggest how to modify a query if the need arises. The queries have to contain a large number of words to improve the retrieval performance. As is the case for the Boolean approach, users are faced with the problem of having to choose the appropriate words that are also used in the relevant documents [3] [30].
2.6.3. Linguistic and Knowledge-based Approaches In the simplest form of automatic text retrieval, users enter a string of keywords that are used to search the inverted indexes of the document keywords. This approach retrieves documents based solely on the presence or absence of exact single word strings as specified by the logical representation of the query. Clearly this approach will miss many relevant documents because it does not capture the complete or deep meaning of the user's query. Linguistic and knowledge-based approaches have been developed to address this problem by performing a morphological, syntactic and semantic analysis to retrieve documents more effectively [30]. In a morphological analysis, roots and affixes are analyzed to determine the part of speech (noun, verb, adjective etc.) of the words. Next complete phrases have to be parsed using some form of syntactic analysis. Finally, the linguistic methods have to resolve word ambiguities and/or generate relevant synonyms or quasi-synonyms based on the semantic relationships between words. The development of a sophisticated linguistic retrieval system is difficult and it requires complex knowledge bases of semantic information and retrieval heuristics. Hence these systems often require techniques that are commonly referred to as artificial intelligence or expert systems techniques [30]. DR-LINK Retrieval System: DR-LINK system represents an exemplary linguistic retrieval system. DR-LINK is based on the principle that retrieval should take place at the conceptual level and not at the word level. It attempts to retrieve documents on the basis of what people mean in their query and not just what they say in their query. DR-LINK system employs sophisticated, linguistic text processing techniques to capture the conceptual information in documents [36].
28
2.7
Query
A query can be defined as the verbalized expression of a user’s information need [18]. Queries may be real or artificial. Real queries represent real information needs of a user and artificial queries on the other hand are derived from titles and other parts of document text. Different models in IR make use of different formats for queries. In the vector space model, for example, a query stated in the natural language that we use for communication may be used (e.g. ‘information retrieval and computers’). In the Boolean model on the other hand, queries must be formulated as keywords combined by Boolean operators (e.g. “information AND retrieval”; “(information AND retrieval) OR computer”). In models like the Boolean model, natural language queries must first be changed to a format required by the model [47] [48]. Ideally, users of an IR system put forward their information requirements verbally or in written form. They then submit this query to the IR system from which they will be presented with materials of potential interest to them. Queries are also input to an automatic indexing process and are processed in a similar manner to documents. The encoded queries are then matched with the encoded documents. In other words, queries are also mapped into the language of indexing [28]. Term weighting also applies to query terms. Salton and Buckley [48] present a number of weighting schemes used for documents and queries. The following weighting formula, which is the ideal formula for queries (ibid.) as established in experiments, is used in this research. (2.1)
Where:
Wiq is the weight of term i in query q
tfiq is the frequency term i in query q (which usually is = 1)
max tf is the maximum frequency value of all query terms
N is the total number of documents in the collection (this does not include the queries)
ni is the number of documents in which the query term is found 29
Generally, Massive and increasing volumes of electronic data are also available in Amharic, which is observed on the growing online newspapers, websites, and digital storages in the language. The increase in the amount of electronic Amharic documents has caused increasing need for efficient information retrieval techniques. However it is still a challenge to develop efficient and effective Amharic retrieval system, the reason for this is there is no standard Amharic document corpus, there is no standard thesaurus for our language which used to handle synonym and related terms, also presence of challenge in Amharic language (discuses in section 2.3). To overcome this challenge information retrieval system should be developed by applying appropriate information retrieval techniques. The focus of this research is to design an Amharic IR system which integrates thesaurus based semantic compression.
2.8
Semantic compression
Semantic Compression is motivated by an issue of background performance of specialized information retrieval system. We must remember that, not only quality of results, but also processing time has a crucial role for end user. The latter is an effect of algorithm complexity and naturally the amount of data being processed. Therefore, in contemporary IR systems based on vector space model, it is desirable to reduce the number of dimensions (corresponding to recognized concepts), and limit the amount of calculations thereby. Semantic compression is a process throughout which reduction of dimension space occurs. The reduction entails some information loss, but in general it cannot degrade quality of results thus every possible improvement is considered in terms of overall impact on the quality. Dimensions' reduction is performed by introduction of descriptors for viable terms. Descriptors are chosen to represent a set of synonyms or hyponyms in the processed passage. Decision is made taking into account relations among the terms and their frequency in context domain.(Semantic compression for specialized Information Retrieval systems) Semantic compression is a technique that allows to transform a text fragment so that it has similar meaning but it is using less detailed terms where information loss minimization is an imperative(quality) an idea of semantic compression has been introduced by the authors in 2010 [08] . 30
The most important idea behind semantic compression is that it reduces the number of words used to describe an idea. As a consequence, semantic compression allows one to identify a common thought in seemingly different communication vessels, and it uses reduced number of word-vector dimensions involved methods, making them more efficient [08]. The semantic compression enables information retrieval tasks, such as text matching, to operate on a concept level, rather than on level of individual terms [21].This can be achieved not only by gathering terms around their common meanings, but also replacing larger phrases with their more compact forms. Domain based semantic compression yields better results than using its general form(as exemplified, this is because domain frequency dictionaries better reflect language characteristics[21]. To further generalize one can perceive semantic compression as an advanced process of creating of descriptors for synsets, where terms more general are favored over the less general. The most important features of utilization of semantic compression are listed below [11].
Precise descriptors (reduced effect of language diversity, no language redundancy, and step towards controlled dictionary).
More compact lexicon (less computational complexity).
Synthetic output, with possibility to display as natural text.
Semantic compression achieved in two steps [en.wikipedia.org/wiki/semantic compression], using frequency dictionaries and semantic network (encyclopedia). The first step is determining cumulated term frequencies to identify target lexicon, requires assembling word frequencies and information on semantic relationships, specifically hyponymy. Moving up wards in word hierarchy, a cumulative concept frequency is calculated by adding sum of hyponym’s frequencies to frequency of their hypernym
(2.10) Where ki is a hypernym of kj then, a desired number of words with top cumulated frequencies are chosen to build a target lexicon.
31
In the second step compression mapping rules are defined for the remaining words, in order to handle every occurrence of a less frequent hyponym as its hypernym in output text. For example, the below fragment of text has been processed by the semantic compression. Words in bold have been replaced by their hypernym. They are both nest building social insects, but paper wasps and honey bees organize their colonies The procedure outputs the following text: They are both facility building insect, but insect and honey insects arrange their biological groups
2.9
Approaches to semantic compression
Approaches that have been employed for applying semantic compression in information retrieval system are ontology based approaches and frequency based approaches 11]. Ontology is used to describe domain knowledge; logic reasoned and the frequency of terms are used to choose fitting expansion words [22].
2.9.1
Ontology Based Approach
Ontology is a means to represent knowledge that is characterized by multiple concepts with the relationships between the concepts also represent, since the meaning of the text is not immediately obvious from the words or phrases [23]. Ontology based information retrieval recognizes the relations among terms by referring to the ontology. The ontology definition of concepts can be used to describe the concepts and these concepts will be defined as document class. The real quality of ontology can be assessed only for its use in real application[22].An ontology is a type of knowledge base that describes concepts through definitions that are sufficiently detailed to capture the semantics of a domain( An ontology captures a certain view of the world, supports intentional queries regarding the content of a database, and reflects the relevance of data by providing a declarative description of semantic information independent of the data representation[24].
32
Creating ontology is not an easy task and obviously there is no unique correct ontology for any domain [24] . Calculating term importance is a significant and fundamental aspect for representing documents in conventional information retrieval approaches. It is usually determined through term frequency-inverse document frequency (TF-IDF). When using an ontology-based representation, such usual definition of term-frequency cannot be applied because one does not operate by keywords, but by concept [25]. Ontology-based representation allows the system to use fixed-size document vectors, consisting of one component per base concept. Absence of some terms in the ontology, in particular terms related to specific domains (biomedical, mechanical, business, etc.), are not defined in the machine readable dictionary used to define the concept-based version of the documents. General approaches to ontology based information retrieval are knowledge base and vector space model driven approach [22].Knowledge base use reasoning mechanism and ontological query languages to retrieve instances. These approaches focus on retrieving instances rather than documents. Vector space model driven approach the idea is to represent each document in a collection as a point in a space (a vector in a vector space). An ontology-based representation exploits the hierarchical is-a relation among concepts, i.e., the meanings of words [25]. For example, to describe with a term-based representation documents containing the three words: \animal", \dog", and \cat" a vector of three elements is needed; with an ontology-based representation,\animal" subsumes both \dog" and \cat", it is possible to use a vector with only two elements, related to the \dog" and \cat" concepts, that can also implicitly contain the information given by the presence of the \animal” concept. Moreover, by defining an ontology base, which is a set of independent concepts that covers the whole ontology, an ontology-based representation allows the system to use fixed-size document vectors, consisting of one component per base concept. The two main improvements obtained by the application of the ontology-based approach are information redundancy and computational time [25].
33
Information redundancy An approach that applies the expansion of documents and queries, use correlated concepts to expand the original terms of documents and queries. A problem with expansion is that information is redundant and there is not a real improvement of the representation of the document (or query) content. With the ontology based representation this redundancy is eliminated because only independent concepts are taken into account to represent documents and queries. Another positive aspect is that the size of the vector representing document content by using concepts is generally lower than the size of the vector representing document content by using terms [23]. Computational Time When IR approaches are applied in a real-world environment, the computational time needed to evaluate the match between documents and the submitted query has to be considered. It is known that systems using the vector space model have higher efficiency [23].
2.9.2
Frequency Based Approach
Frequency based semantic compression requires assembling word frequencies and information in semantic relationships. Different techniques can be used in frequency based semantic compression. 2.9.2.1
Term frequency approach
Term frequency (tf) is the relative frequency of the appearance of the keyword in the document while document frequency (DF) measures the number of documents containing the word. Term frequency measures the frequency of words appearing in a document, if a word appears with a higher frequency, then the system considers the word more important than other words in a document. Term frequency can be calculated using equation 2.2. 2.9.2.2
Term frequency – inverse document frequency
This approach uses a scoring function that score the terms occurred in the document by the term frequency and the inverse document frequency[25].Term frequency is the relative frequency of
34
the appearance of the key word in the document while document frequency measures the number of documents containing the word. TF-IDF approach is one of the most commonly used term weighting schemes.TF-IDF represents the word frequency in the document normalized by the domain frequency. In these methods, semantic proximity between words is computed using statistical methods, using several documents in the collection of a large number of documents. Document frequency (df) - number of documents containing the given term. The more a term t occurs throughout all documents, the more poorly t discriminates between documents. The less frequently a term appears in the whole collection, the more discriminating it is. The inverse document frequency (idf) is measure whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, then taking the logarithm and quotient. Higher idf value is obtained for rare terms whereas lower value for common terms. It is mainly used to discriminate importance of term throughout the collection. The idf weighting scheme assigns to term i is given by using equation 2.3 and the tf-idf weighting scheme assigns to term i a weight in document j given using equation 2.4. 2.9.2.3
N-gram Based Approach
N-grams are n units or grams where grams can be words, phonemes, characters, morphemes, syllables etc...[18] The n-grams are formed by considering n adjacent or non-adjacent units extracted from the source. For example, in the sentence “this is the house that jack built”, the word bi-grams (2-words) can be “this is”, “is the”, “the house”, etc.. In the word ‘house’ the character bi-grams(2-characters) can be ‘ho’,’ou’,’us’,’se’ and so on. The value of n can vary from 1 to many (usually not larger than 7 or 8). N-grams that are too long become almost equal to words (or sentences in the case of word n-grams) and hence will fail to capture similarity between different (in the morphological sense) but similar words. On the other hand, n-grams that are too short (e.g. uni-grams) will tend to find similarities between words that are due to factors (e.g. distribution of alphabets) other than semantic relatedness [18].
35
There are different types of n-grams differing in method of formation and size [48]. Based on formation there are two types of N-grams, character based n-grams and word based n-grams. Character based n-grams are generally used in measuring the similarity of character strings; spellchecker, stemming, and OCR error correction are some of the applications which use character based N-grams [18]. Word N-grams are sequence of n consecutive words extracted from text. Word level n-gram models are quite robust for modeling language statically as well as for IR without much dependency on the language [18]. With respect to size, we can have unigram (1 gram), bi-grams (2-grams, sometimes written as digrams in literature), tri-grams (3), tetra-grams(4), penta-grams(5), etc.. [47]. Unigram mean the value of n is 1 (n=1) which means Sentences are segmented in to single characters or words that are used for indexing. A unigram model used in information retrieval can be treated as the combination of several onestate finite automata. It splits the probabilities of different terms in a context, e.g. from to
.
In this model, the probability to hit each word all depends on its own, so we only have one-state finite automata as units. For each automaton, we only have one way to hit its only state, assigned with one probability A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2. The frequency distributions of bigrams in a string are commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. Bigrams help provide the conditional probability of a token given the preceding token, when the relation of the conditional probability is applied:
36
That is, the probability P of a token Wn given the preceding token Wn-1 is equal to the probability of their bigram, or the co-occurrence of the two tokens P(Wn-a,Wn), divided by the probability of the preceding token. Trigram is the same idea with bigram only the value of n is 3 (n= 3). Example: In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as P(I,saw,the,red,house) ≈p(I│)P(saw│I)P(the│saw)P(red│the)P(house│red)P(│house) Whereas in a trigram (n=3) language model, the approximation is P(I,saw,the,red,house) ≈p(I│,)P(saw│,I)P(I,saw)P(red│saw,the)P(house│the,red)P(│red,house)
Note that the context of the first n-1 N-grams is filled with start-of-sentence markers, typically denoted . For Example for the padding word “this is a sentence” result will be: Unigrams (n=1) This, is,, a, sentence Bi-gram (n=2) This is, is a , 37
a sentence Tri-gram (n=3) This is a, is a sentence N-gram are sequence of characters or words extracted from a text [26].The main motivation behind this approach is that similar words have a high proportion of N-grams in common. 2.9.2.4
Thesaurus Based Approach
Thesauri are valuable structures for Information Retrieval systems. A thesaurus provides a precise and controlled vocabulary which serves to coordinate document indexing and document retrieval. In both indexing and retrieval, a thesaurus may be used to select the most appropriate terms. When used during retrieval and searching, thesauri are useful in bridging the gap that exists between the metadata provided by the indexer and the concepts presented by searcher. The controlled vocabulary limits the terms available and increases the possibility that the query will use appropriate terms. This thesaurus is structured in a form of relationship which helps the searcher in navigation through the metadata and finding an appropriate query expression. Manual Thesaurus Construction The process of manually constructing a thesaurus is both an art and a science. We present here only a brief overview of this complex process. First, one has to define the boundaries of the subject area. (In automatic construction, this step is simple, since the boundaries are taken to be those defined by the area covered by the document database.) Boundary definition includes identifying central subject areas and peripheral ones since it is unlikely that all topics included are of equal importance. Once this is completed, the domain is generally partitioned into divisions or subareas. Once the domain, with its subareas, has been sufficiently defined, the desired characteristics of the thesaurus have to be identified. Now, the collection of terms for each subarea may begin. A variety of sources may be used for this including indexes, encyclopedias, handbooks, textbooks, journal titles and abstracts, catalogues, as well as any existing and relevant thesauri or vocabulary systems. Subject experts 38
and potential users of the thesaurus should also be included in this step. After the initial vocabulary has been identified, each term is analyzed for its related vocabulary including synonyms, broader and narrower terms, and sometimes also definitions and scope notes. These terms and their relationships are then organized into structures such as hierarchies, possibly within each subarea. The process of organizing the vocabulary may reveal gaps which can lead to the addition of terms; identify the need for new levels in the hierarchies; bring together synonyms that were not previously recognized; suggest new relationships between terms; and reduce the vocabulary size. Once the initial organization has been completed, the entire thesaurus will have to be reviewed (and refined) to check for consistency such as in phrase form and word form. Special problems arise in incorporating terms from existing thesauri which may for instance have different formats and construction rules. At this stage the hierarchically structured thesaurus has to be "inverted" to produce an alphabetical arrangement of entries--a more effective arrangement for use. Typically both the alphabetical and hierarchical arrangements are provided in a thesaurus. Following this, the manually generated thesaurus is ready to be tested by subject experts and edited to incorporate their suggestions. Automatic Thesaurus Construction In selecting automatic thesaurus construction approaches for discussion here, the criteria used are that they should be quite different from each other in addition to being interesting. Also, they should use purely statistical techniques. (The alternative is to use linguistic methods.) Consequently, the two major approaches selected here have not necessarily received equal attention in the literature. The first approach, on designing thesauri from document collections, is a standard one. The second, on merging existing thesauri, is better known using manual methods. i.
From a Collection of Document Items
Here the idea is to use a collection of documents as the source for thesaurus construction. This assumes that a representative body of text is available. The idea is to apply statistical procedures to identify important terms as well as their significant relationships. It is reiterated here that the central thesis in applying statistical methods is to use computationally simpler methods to identify the more important semantic knowledge for thesauri. It is semantic knowledge that is used by both indexer and searcher. Until more direct methods are known, statistical methods will continue to be used. 39
ii.
By Merging Existing Thesauri
This second approach is appropriate when two or more thesauri for a given subject exist that need to be merged into a single unit. If a new database can indeed be served by merging two or more existing thesauri, then a merger perhaps is likely to be more efficient than producing the thesaurus from scratch. The challenge is that the merger should not violate the integrity of any component thesaurus. Cross-references and relations between descriptors to build a thesaurus Seven types of cross- references are used [49]: Scope Note (SN), Use For (UF) and Use (USE) references, Narrower Terms (NT), Broader Terms (BT), Related Terms (RT) and Parenthetical Qualifiers. Scope Note (SN) = brief statement of the intended usage of a descriptor. It may be used to clarify an ambiguous term or to restrict the usage of a term. Example: INFORMATION RETRIEVAL SN Techniques used to recover specific information from large quantities of stored data. Use For (UF) and USE (USE) = terms we consider to be equivalent (equal or almost equal by the meaning) we can combine to the category of equivalence so that equivalent expressions match only one term. Equivalence relations direct synonyms and pseudo-synonyms of specific term to appropriate descriptor. The UF reference is employed generally to solve problems of synonymy occurring in natural language. Terms following the UF notation are not used in indexing. They most often represent either synonymous or variant forms of the main term, or specific terms that, for purposes of storage and retrieval, are indexed under a more general term. Years listed in parentheses indicate the time period during which the term was used in indexing. It provides useful information for searching older printed indexes, or computer files that have not been updated. Example: BIBLIOGRAPHIC DATABASES 40
UF Bibliographic Records (2004); Bibliographic Utilities (2004) The USE reference, the mandatory reciprocal of the UF, refers an indexer or searcher from a no usable (non indexable) term to the preferred indexable term or terms. Example: KINESCOPES USE Films Narrower Terms (NT) and Broader Terms (BT) = these indicate the existence of a hierarchical relationship between a class and its subclasses. In a hierarchical relation, one term is viewed as being “above” another term because it is broader in scope. Narrower terms are included in the broader class represented by the main entry. The Broader Term (BT) is the mandatory reciprocal of the NT. Broader Terms include as a subclass the concept represented by the main (narrower) term. Example: SCHOOL CULTURE BT Culture; Organizational Culture Example: RECREATIONAL ACTIVITIES NT Playground Activities; Recreational Reading Related Terms (RT) = Associative relations express the analogy (not equivalence) between concepts. These kinds of relations are used for not hierarchical semantic relations in the thesaurus. Example: ALCOHOLISM RT Addictive Behavior; Alcohol Education; Antisocial Behavior; Behavior Disorders; Drug Addiction; Fetal Alcohol Syndrome; Physical Health; Special Health Problems Parenthetical Qualifiers = A Parenthetical Qualifier is used to identify a particular indexable meaning of a homograph. In other words, it discriminates between terms (either Descriptors or USE references) that might otherwise be confused with each other. Examples include LETTERS (ALPHABET) and LETTERS (CORRESPONDENCE). The Qualifier is considered an integral part of the Descriptor and must be used with the Descriptor in indexing and searching. 41
In this research Thesaurus based approach is proposed to handle synonymous (related) term and retrieving relevant information by allowing users to enter any of their queries which lastly improve searching.
2.10 Review of Related works Accuracy in information retrieval, that is, achieving both high recall and precision, is challenging because the relationship between natural language and semantic conceptual structure is not straightforward. To solve this challenges different researchers conducts there research on this area. Abu Salem (1992), studies the IR in Arabic Language. His study was based on 120 documents he received from the Saudi Arabian National Computer Conference and on 32 queries. In his research, he studied indexing by using full words and by using the roots only. He found that using the roots is superior to other ways. He also built a manual thesaurus using the relation between expressions to test the possibility of supporting an IRS through this thesaurus. He found that the thesaurus makes IR much better. Kanaan and wedyan (2006), this study was based on 242 documents they received from the Saudi Arabian National Computer Conference and on 24 queries. In their research, they studied indexing by using full words and by using the roots. They found that using the roots is superior to other ways. They also built a Automatic thesaurus using the relation between expressions to test the possibility of supporting an IRS through thesaurus. They found that the thesaurus makes IR much better between 1% and 10%. Mohammad, Basim and Adnan 2012), this study aims at achieving Arabic information retrieval system effectiveness by applying a successful thesaurus. The results after applying 35 queries, this study was based on 500 documents those were given to a group of students who have certain links with those subjects to determine the relevant document to each query. According to the determination of those students, work on these results began and results were analyzed using the criteria of Precision and Recalling. Average Recall Precision was calculated and the result shows the use of automatic thesaurus improve information retrieval system by 1020%.
42
Jose R, Perez A, Lourdes A (2006), in this study the researchers proposes to apply a combination of techniques first co-occurrence measure and then its type (equivalent, hierarchy or associative) is used to detect the relationship between terms. First of all researchers perform a section of terms, called the core set, from a text collection concerning the intended thesaurus domain. In this phase researchers apply linguistic pre-processing which consists of a POS tagging which allows selecting only those words of noun category, stemming, and elimination of stop words. Researchers apply TF-IDF to the candidate words in order to obtain the initial list of thesaurus terms. Next EUROVOC (which contains concepts on the activity of the European Union), SPINES (a controlled and structured vocabulary for information processing in the fielded of science and technology for development) and ISOC (thesaurus aimed at the treatment of information on economy) are used in the process of generating a union thesaurus. Terms which appear in both, the core set and any source thesauri are the tern list of the union thesaurus. Furthermore the relationships among the terms included in the new thesaurus are provided by the source thesauri. If a couple of terms to be related appear in some of the source thesauri, this indicates the kind of its relationships. If they do not appears in the source thesauri its possible relationship has to be investigated. The first step is to detect any kind of relationship and then in a second step the type of the detected relationship is identify. For the extraction of semantic relations between terms they have chosen the statistical method of vector space model. To apply the Victoria model we define a vector of features for each term from the documents in which it appears. The values of this vector are estimated by counting the co-occurrence of the terms in the documents. Researchers use cosine similarity measure to determine the similarity of terms. Once the pairs of terms for which the semantic similarity is significant enough (the similarity is above a threshold value of 0.3) is determined next the type of the relationship among these pair of terms to be determine. For the evaluation of the system the researchers use trec-eval package, with the measure of precision and recall and as a test set the researchers use a total of 50 queries extracted from the batteries provided by CLEF in 2001. Finally the researchers achieve 9.47% improvement in precision and 9.99% in recall and also the index size is decrease (from 352,534 to 321,612) which leads to a significant decrease of the execution time. The researchers mention they use a high frequent terms as a stop word but if they considering the degree of relationship with other words of the intended domain they can achieve a better result. Finally the researchers plan to improve the different phases of this process and to 43
apply a more exhaustive linguistic analysis for the identification of semantic relationships, as well as using WorldNet as another source of information for the thesaurus generation. Ceglarek (2010), conducted an experiment to check if the number of vector space dimensions can be reduced significantly without the deterioration of results. Experiment given was based on Seneca Net. Seneca Net is semantic net the technique used was term frequency. Two sample set of documents was subjected to a procedure of clustering documents were from different categories. To verify the results, all documents were initially labeled with a category. Clustering procedure was performed 8 times; first run was without semantic compression methods. All identified concepts were included. Then semantic compression algorithm was used to gradually reduce the number of concepts. It started with 12000 and it proceeded with 10,000, 8000, 6000, 2000 & 1000.The conducted experiment indicates, that semantic compression algorithm can be employed in classification tasks to significantly reduce the number of concepts and corresponding vector dimensions. Darius (2011), used domain based semantic compression and differences between it and its general version, which combines data from two sources term frequencies: frequency dictionary and concept hierarchy from semantic network. Experiments conducted by authors confirm that domain based semantic compression yield results than using its general form. This is because domain frequency dictionary better reflect language characteristics. Semantic compression can find its application in intellectual property protection systems such as SEIPro2S implemented for Polish. This kind of systems focuses on the meaning of documents trespassing corporate boundaries. In knowledge oriented societies where majority of revenue is an outcome of knowledge application this kind of system is invaluable asset (Dariusz Ceglarek1 2010). Semantic compression for English one has to use a solution that has similar capabilities as those on Seneca Net. Building up a new semantic net for English is a great effort surpassing authors’ abilities, thus they had turned to existing solutions. Word Net has proven to be excellent resource. It was applied by numerous research teams to a great number of tasks yielding good results. Haimanot [11], conduct a research on integrating semantic compression for Amharic information retrieval to enhancing its performance .Since there are usually many ways to express the same concept, the terms in the user’s query may not appear in a relevant document. 44
Alternatively many words can also have more than one meaning which may confuse the retrieval system. Information retrieval is a mechanism that enables finding relevant information material of unstructured nature that satisfies information need of user from large collection. Amharic text retrieval developed in this study has indexing and searching parts. The index file structure used is inverted index file structure. For this study Amharic text document corpus is prepared by the researcher encompassing different news articles and so on. Various techniques of text preprocessing including tokenization, normalization, stop word removal and stemming were used to identify contentbearing words. Frequency based semantic compression method is further applied for compressing Amharic words before indexing. Term frequency (TF)-based term weighting was used to calculate frequency of terms in the document and statistical co-occurrence is used to calculate the co-occurrence frequency of terms if the co-occurrence of two terms is greater than the threshold value that is 0.8 based on their frequency the one which is frequent replace the less frequent term. The main aim of Haimanot’s study is to handle related words in the document in order to reduce indexing terms and to find semantic representation, that is, more precise descriptor for each term. The experiment show that an average of 55.4% precision and 59.2 % recall which mean both recall and precision are improved by 0.112 and 0.109respectively. However in this experiment a better precision and recall is not registered. Hence further study is required to achieve a better result. As to the researchers knowledge however there is no work done to apply thesaurus based semantic compression to enhance the performance of Amharic information retrieval system. Hence this study has made a great contribution to characterize documents and queries with their semantic representation.
45
CHAPTER THREE METHODS AND TECHNIQUES In this study an attempt is made to apply semantic compression for enhancing the performance of Amharic information retrieval. Development of semantic compression based information retrieval system involves various techniques and methods. This chapter present the proposed semantic compression based information retrieval, the techniques and algorithms used to achieve the required and finally evaluation techniques used in this study are discussed.
3.1 Architecture of Semantic Compression Information retrieval system designed involves two main components that is: indexing and searching. The basic architecture of the proposed system is depicted below in figure 3.1. Given Amharic text corpus to the IR system, then the system organize the text corpus using index file to enhance searching. The first step is tokenization of the text words to identify stream of tokens (or terms). This is followed by normalization in order to remove punctuation marks. The normalized token is checked whether it is a stop word or not then content bearing terms (non-stop words) are stemmed. For all stemmed tokens its respected weight calculated and the co-occurrence of terms is also calculated and highly occurred will replace the less occurrence one and finally this term will replace by its descriptor based on the thesaurus then inverted index file is constructed. On the searching similar text pre-processing (tokenization, normalization, stop word removal, and stemming, semantic compression) technique is followed as it was done in the indexing part then the query terms changed to their synonyms term. Then similarity measurement technique is used to retrieve and rank relevant text documents.
46
Users query
Text document corpus
Tokenization
Tokenization Stop word
Normalization Stop word removal
list
Normalization
Stop word removal Prefix & suffix list
Stemming
Stemming የአማርኛ መዝገበ ቃላት
Thesaurus
Thesaurus
Variant terms
Variant terms
Preferred terms
Preferred terms Ranking Similarity measurement
Index file Figure 3.1 The design of the Prototype semantic compression of Amharic text retrieval
47
3.2 Dataset Preparation and Document pre-processing In the evaluation of the IR system a corpus must be used. This document corpus is organized from different news articles and other online resources and also Haimanot’s corpus is used. After this the algorithms used for the indexing process are discussed, all except the semantic compression algorithm are adapted from Haimanot’s research [11].
3.2.1 Tokenization Tokenization is the process of chopping character streams in to tokens, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed. Tokenization in this work also used for splitting document in to tokens and detaching certain characters such as punctuation marks. Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. In this study, words are taken as tokens. All punctuation marks, control characters, numbers and special characters are removed from the text before the data is processed. All punctuation marks are converted to space and space is used as a word demarcation. Hence, if a sequence of characters are followed by space, that sequence is identified as a word. A consecutive sequence of valid characters recognized as a word in the tokenization process.
Algorithm 3.2.1: Tokenization 1. initialize the variables to hold the word 2. read a character from the sentence (document) 3. check if the character is any one of the Amharic delimiter (punctuation mark, space, tab, carriage return, and line feed characters) 4. if not, concatenate the character to the variable 5. else if the character length is above one character report the word 6. if there is more data to process, go to step 1. The identified words were also checked for whether they contain a number or not. This is because, words such as 1ኛ (1st), በ20ኛው (during the 20th), የ1977ቱ (the 1977), etc. were not considered for indexing. Besides identifying individual words, the algorithm was made to
48
produce the frequency count of each word in each document. Finally, unique words together with their frequency were stored in a text file.
3.2.2 Normalization The aim of normalization is to come up with a set of tokens that will reduce overall number of items. The most common types of normalization are case folding and stemming (reducing infected words to their stem or root form).Case folding is easy in English, but can be problematic in some languages. In the Visual Ge’ez font, as is also common with other Amharic fonts, upper case and Lower case of the same alphabet represents two different symbols (orders) (Amharic Fidel). For example, ‘B’ is the character used to represent “ብ“(sixth order) whereas “b” is used to represent “በ” (first order).Therefore, no case conversion was done as part of preprocessing. The motivation for normalization is the observation that many different strings of characters often convey essentially identical meanings with different symbol. Given that we want to get at the meaning that underlies the words, it seems reasonable to normalize superficial variations by converting them to the same form. For example ሠ and ሰ have similar sound (with the sound se) so the characters converted to ሰ. As pointed out in section 2.6.4, the presence of several Fidel’s (symbols) with the same sound in the Amharic writing system creates a problem in text retrieval. The same word written using different symbols (letters) with the same sound will be considered as different words by the retrieval system. In this research, choosing one letter for the group of letters with the same sound and replace the remaining ones is taken as a solution to the problem. Therefore, if a character is any of ሓ, ኃ, ኻ, ሃ, ሐ or ¦ (all with the sound ‘h’) then, it is replaced by ‘ሀ’. Also the different orders of ሐ and ኀ are changed to their corresponding equivalent orders of ሀ. Similarly, all orders of ሰ (with the sound ‘s’) are changed to their corresponding equivalent orders of ሰ, all order of ዐ (with the sound ‘a’) are changed to their corresponding equivalent orders of አ, all orders of ፀ (with the sound ‘tse’) are changed to their corresponding equivalent orders of ጸ.
49
Algorithm 3.2.2: Normalization The algorithm is as follows (for each of the seven orders); algorithm 3.2.1 1. Read the character form the tokenized file 2. If the character is any of ሃ/ኅ/ሐ/ኃ/ሓor any other order thereof then Change it to ሀ Exit Else if it is ሠor any other order thereof Change it to ሰ Exit Else if it is ዐ/ዓ/ኣ or any other order thereof Change it to አ 3. If the character that follows is a diacritic marking, attach it to the changed base character.
3.2.3 Dropping common terms All terms in a text documents are not equally important in allowing retrieving relevant documents from a document collection. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often handfiltered for their semantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing. It is a common practice to remove the non-content bearing terms, which are not useful to specifically identify some portions of the document collection, from the index term list. Mostly, non-content bearing terms are removed by either removing high frequency terms from the indexing term list or by using a negative dictionary (stop word list) for that language [11]. We filtered the sense examples with a stop-word list, to ensure only content bearing words are included. In this research the stop word list is taken from Haimanot [11]. See Table 3.2 for sample list of stop-words in Amharic. Hence, stop word elimination improves the size of the indexing structures.
50
Table 3.2 Sample List of stop words
ነዉ
ስለ
እንደገለጹት
ብቻ
ናቸዉ
ቢሆን
ገለጹ
ብዛት
ተገለጸ
ብለዋል
አስታዉቀዋል
ብዙ
አስታወቀ
በርካታ
ጠቅሰዉ
ቦታ
Algorithm 3.2.3: Removing stop words 1. Open the stop word list 2. While not end of terms from normalized file is reached do Read terms For each term in the file If term in Stop word list then Remove term End if End for 3. End while
3.2.4 Stemming Stemming is the process for reducing inflected or derived words to their stem, base or root form. The stem need not be identical to the morphological root of the word. It is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The process of stemming, often called conflation, is useful in search engines for query expansion or indexing and other natural language processing problems. In information retrieval, the relationship between a query and a document is determined primarily by the number and frequency of terms which they have in common. Unfortunately, words have many morphological variants which will not be recognized by term-matching algorithms without some form of natural language processing. In most cases, these variants have similar semantic interpretations and can be treated as equivalent for information retrieval (as opposed to linguistic) applications. For this study the following algorithm is used to stem terms
51
Algorithms 3.2.4: stemming 1. While not end of stop word removed list is reached do Read terms For each term in the file If term starts with prefix If term not in exception file list Then Remove prefix End If End If If term ends with suffix If term not in exception file list then Remove suffix End If End if End for 3. End while 4. Close files
3.3 Term Weighting All the terms included in the index list are not equally important in reflecting the content of a specific text [05]. An importance indicator or a term weight should be associated with each term. Term weighting is important to select important terms. One of the widely used techniques for term weighting is TFIDF, which is calculated using equation 3.2. The numerical data weighting was done on python software package. Only the body of the term frequency generated in the first phase and saved into a text. Then, the highest frequency term list is selected. The basic rational behind weighting is that a term has high weight if it is frequent in the relevant documents but infrequent in the document collection [1].
52
(3.1) Where TF is the frequency of each term in the respective document and DF is the number of documents that contain the given term. Since queries are short documents with few terms, equation 2.1 TFIDF weighting formula is used for query terms [11].
3.4 Semantic Compression Thesaurus is used to handle the problem of synonymous by representing a group of synonym terms by a single group identifier [28]. For example, if a query contains the word “በሽታ”, documents that contain the words “ደዌ” and “ህመም” are likely to be relevant also one of them are used for indexing. Hence, the query needs to be expanded by including the synonym words. A thesaurus is valuable structures for Information Retrieval systems. A thesaurus provides a precise and controlled vocabulary which serves to coordinate document indexing and document retrieval. In both indexing and retrieval, a thesaurus may be used to select the most appropriate terms. Additionally, the thesaurus can assist the searcher in reformulating search strategies if required. Thesauri define semantic relationships between index terms [33]. The three main relationships are Equivalence (equivalent terms), Hierarchical (broader/narrower terms: BT/NTs), Associative (Related Terms: RTs).In this study the researcher develop a thesaurus to handle only synonym of terms called equivalent term thesaurus. Procedures in Manual Thesaurus Construction 1. First terms are taken from the text document corpus 3. Identify desired characteristics of the terms Only synonyms, related terms (Broader Terms and Narrower Terms) and/or scope notes 5. Analyze each term for its related vocabulary including Synonyms Broader (BT) and narrower terms (NT) Definitions and scope notes, based on[31] 53
6. Organize these terms and their relationships into structures such as hierarchies, equivalence relationship, related (associative relations like BT and NT). This step will help in identifying additional related terms, new relationship between terms and bring (having) together synonyms that were not previously recognized 7. Review the entire thesauruses for consistency check of word forms. 8. “Invert” the hierarchy structure thesaurus in to alphabetical arrangement of entries. After this the following algorithm will be executed to select index terms used in this retrieval system.
Algorithm 3.4: semantic compression Open thesaurus While not end of inverted index file is reach do Read terms For each term in the file If a term is a keyword (a descriptor) in the thesaurus then Break Else Find its’ keyword (descriptor) in a thesaurus then replace the term in the inverted index file with its’ keyword (descriptor) in the thesaurus End if End for End while
3.5 Inverted Index An inverted file is a data structure for efficiently indexing texts by their words. One can view an inverted file as a list of words where each word is followed by the identifier of every text that contains the word. The number of occurrences of each word in a text is also stored in this structure. Even if users can choose data structure of his choice, what has to be included in the data structure is pre-determined by the technique. In this study inverted index is used for sake of indexing.
54
The major steps in creating an inverted index is first, collecting the document to be indexed, second, applying text operations such as tokenization, normalization, stop word removal, stemming and identifying list of tokens to be indexed[11]. At this stage language preprocessing should be done. For each terms term frequency, collection frequency and document frequency are calculated. Finally, index file is created following the principle of inverted index file construction. The Index file is further split into two files: vocabulary file and post file. The inverted file allows an IR system to quickly determine what documents contain a given set of words, and how often each word appears in the document. Other information can also be stored in the inverted file such as the location of each word in the text.
3.6 Evaluation technique There are two way of measuring IR success, Precision and recall [13]. Precision is a ratio of relevant items retrieved to all items retrieved, or the probability of retrieved item is relevant. On the other way, Recall is ratio of relevant items retrieved to all relevant items in the corpus or the probability of relevant item retrieved. There is always trade-off between precision and recall. If every document in the collection is retrieved, it is obvious that all relevant documents will be retrieved, so that recall will be higher. In contrary when only little proportion of the retrieved document is relevant to given query, retrieving everything reduces precision (even to zero). The higher score in both recall and precision means the higher the performance of the system [13]. Precision is the number of relevant documents a search retrieves divided by the total number of documents retrieved. In other word it is the fraction of the documents retrieved that are relevant to the user's information need. To illustrate these metrics, suppose we have a document collection D, a query Q, and a retrieval system S. Out of the documents in the collection, let us assume that R of them is relevant to query Q. The relevance of a document to a query is often determined by domain experts. Finally, let us suppose that system S retrieves a set of ‘n’ documents for Q and ‘r’ of the ‘n’ retrieved documents is relevant to Q. The recall and precision for Q and D are defined as:
(3.2) Recall is the number of relevant documents retrieved divided by the total number of existing relevant documents that should have been retrieved. It is the fraction of the documents that are relevant to the query that are successfully retrieved. 55
(3.3) Users’ looks for their need satisfied; they don‘t need to get quantity of documents in the corpus. The other problem is that relevant documents exist often redundantly. Beside to the fact that there is trade-off between precision and recall some people prefers higher recall, still others refers higher precision. So, rather than considering precision and recall separately it is better to specify relative importance of both. When there are more than one query, an average recall, (Ravg), and an average precision, (Pavg), can be computed using equation 2.8 and 2.9 [28] : The above evaluation techniques will be used for measuring the effectiveness of semantic compression based information retrieval system in our experiment.
56
CHAPTER FOUR
EXPERIMENTATION AND DISCUSSION In this study thesaurus based semantic compression is integrated and experimented for the purpose of enhancing the performance of Amharic Text Retrieval. For the experiment 300 Amharic News articles were used as test set. 100 of the documents were also used by Haimanot [11].The remaining 200 documents are collected from news articles, websites, and books. Each document is kept in a separate text file written in a VG2 Main font under a common folder. In addition, 10 queries set up by the researchers themselves, these queries are marked across each document as either relevant or irrelevant to make relevance evaluation. The main importance of having identified queries is to evaluate the performance of the system. These queries are selected subjectively by the research after reviewing content each article manually. The reason is to the
researcher’s knowledge; there is no standard established test collection for Amharic information retrieval testing. Experiments in Amharic IR therefore usually make use of sets of documents and queries set up by the researchers themselves.
4.1
Construction of the thesaurus
In this research a manually constructed thesaurus is developed based on Amharic Dictionary (የአማርኛ መዝገበ ቃላት) written or prepared by Amharic language research center in Addis Ababa University in 2007[31]. To construct the thesaurus frequency of each term in the document is automatically calculated; then from the synonym terms the one with high frequency is used as a key term in the thesaurus. Table 4.1 presents some of the synonyms terms in Amharic language.
57
Table 4.1 Amharic synonymous terms and their representative key terms key term
Synonym terms
ስራ
“ተግባር”, “ክንውን”, “ድርጊት”,
ክልክል
“ግዝት”, “እርም”, “ሀራም”,
ህመም
“በሽታ”, “ደዌ”, “ስቃይ”
መንግስት
“ህዝብ”, “ማህበረሰብ”, “ህብረተሰብ”, “ወገን”, “ሀገር”
ድብቅ
“ስውር” , “የማይታይ” “ሚስጥር” “ህቡእ”
ሀሩር
“ቃጠሎ” “በርሃ” “ሙቀት”
ምክንያት
“መንስኤ”
4.2
Index construction
During index construction, there are various steps involved, including tokenization, stop word removal, normalization, stemming, weighting and indexing. As discussed in section 3.2.1 Tokenization of the text words identify stream of tokens (or terms). As discussed in section 2.2.3, Amharic words in a text are separated by punctuation marks, spaces, tabs, carriage return and line feed characters. However, since the test documents in the collection as well as the queries are written in VG2 Main Amharic writing software, the punctuation marks considered in this research are those that are supported by the writing software. Consequently, the codes that represent these punctuation marks in the implementation of the software are identified and used in the process of word identification. Some decisions were made concerning the punctuation marks ‘-‘and ‘.’. In Amharic texts the punctuation mark ‘-‘, equivalent of hyphen in English, is used to form compound words. However, in the test collection this punctuation mark was not used consistently. The same compound words were found written both as separate words without the hyphen mark and as compound words with hyphen (example ጸ ረ ኤ ድ ስ and ጸ ረ -ኤ ድ ስ ). To keep consistency throughout the test collection, a decision was made to replace the hyphen mark with one character space and split compound words into their constituent terms. 58
Likewise, the symbol ‘.’ is used for abbreviation purpose. Through analysis of the document collection, it was found that the symbol was used only with two abbreviations: ዓ .ም and አ .አ , which are equivalent to A.D and A.A (for Addis Ababa). Because, their frequencies in the collection were high, removing them from the text was believed to cause significant difference in the performance of the models. As discussed in 3.2.2 the aim of normalization is to come up with a set of tokens that will reduce overall number of items. The most common types of normalization are case folding (converting all words to lower case) and stemming (reducing infected words to their stem or root form).There is no case folding in Amharic writing system. Also In this research, choosing one letter for the group of letters with the same sounds and replace the remaining ones is taken to solve the problem of redundant Characters. The other challenge is, in Amharic writing system some terms have the same form with different meaning when reading strongly and lightly. For example “ገ ና ” means “Christmas holy day” and it is also mean “late to come or late to do something”. In case of removing stop words, non-content bearing terms are removed by removing high frequency terms from the indexing term list and by using a negative dictionary (stop word list) for that language. But there might be important terms from the most frequent terms which have an impact on retrieving relevant documents for users. The other problem is some common terms may use as a book title, song title or names which have a significant impact on the retrieval of relevant documents for users query. For example in our system if we search for the book with the title or content “ብቸኝነት” the system treat as a common/ stop word and there is no document to be retrieve. The normalized token is checked as it is not-stop word then this content bearing terms (non-stop words) are stemmed. The main challenge here is the inability of the stemmer to efficiently handle word variants and it doesn’t remove infixes and also ambiguity of words in the language. In this study only the prefix and suffix stemming is performed.
4.3
Statistical co-occurrence based semantic compression
The basic idea is simply that words with similar meanings will tend to occur in similar contexts, and hence word co-occurrence statistics can provide a natural basis for semantic representations
59
[50]. To calculate the co-occurrence of two terms wi, wj in a total document n, the following formula is:
(3.4) In this study 0.8 is used as a threshold value that means, if the co-occurrence of terms is less than 0.8 values indicate less than the expected number of co-occurrences, which shows the words that are not semantically related but if it is above 0.8 value terms are semantically related and the most frequent one will replace the less frequent term and use as a descriptor. In this case, synonymous of terms is not handled. And also selecting the threshold value is a challenged task since there is no a very defined rule to select that value.
4.4
Thesauruses based semantic compression
For all stemmed tokens its respected weight calculated and finally the term is then compared to the thesaurus if this term is used as a key term or descriptor in thesaurus then it uses as an indexing term else if it is a value or synonym of the a key term in the thesaurus then it replaced by its respective key term so this key term use as an indexing term. Since there is no a standard thesaurus, developing the thesaurus by itself is a challenged task. In this paper the thesaurus is developed manually based on [31] which is time taken task and the researcher try to handle synonymous terms only. The other challenge is, as mention in section 4.2 in Amharic writing system some terms have the same form but having different meaning when reading strongly and lightly. For example “ገ ና ” means “Christmas holy day” and it is also mean “late to come or late to do something”. That makes the system to retrieve no relevant documents for users. For example: let table 4.2 shows sample raw term by document matrix using statistical cooccurrence base indexing.
60
Table 4.2 sample raw term by document matrix using statistical co-occurrence base indexing. Index
Doc1
Doc2
Doc3
Doc4
Doc5
Doc6
Doc7
Doc8
Doc9
ዩኒቨርሲቲ
1
6
8
0
5
7
2
0
0
ኮሌጅ
2
2
2
2
1
4
2
0
0
ት/ቤት
5
1
2
11
1
0
1
0
1
ጠቃሚ
11
6
1
1
0
0
1
5
5
መሰረታዊ
3
5
6
1
0
0
0
1
0
አስፈላጊ
3
3
2
0
0
5
1
3
7
ዋና
0
2
1
1
0
1
7
0
1
አስተያየት
6
1
1
1
0
8
0
1
0
አመለካከት 0
23
5
1
0
0
0
0
1
ተግባር
8
5
8
4
8
5
1
5
2
ክንውን
0
1
1
1
1
1
1
0
5
ምድር
2
1
0
1
0
1
1
7
1
መሬት
1
1
5
1
1
5
1
0
0
ህመም
4
5
1
0
0
1
1
1
1
በሽታ
6
1
0
0
0
8
0
0
1
terms
For the above table the output term by document matrix using thesaurus based indexing will be shown in the next table. The index terms goes to the thesaurus and replace by their key word (descriptor) and finally their document frequency will merge.
61
Table 4.3 sample term by document matrix using thesaurus based indexing will be shown in the next table. Index
Doc1
Doc2
Doc3
Doc4
Doc5
Doc6
Doc7
Doc8
Doc9
ዩኒቨርሲቲ
8
9
12
13
7
11
5
0
1
አስፈላጊ
17
16
10
3
0
6
9
9
13
አስተያየት
6
24
5
2
0
8
0
1
1
ተግባር
8
6
9
5
9
6
2
5
7
መሬት
3
2
5
2
1
6
2
7
1
በሽታ
10
6
1
0
0
9
1
1
2
terms
4.5
Experimentation and Evaluation
The classic recall and precision measures are used to judge the retrieval performance of the two indexing approaches. An attempt has been made to compare the results of the thesaurus based indexing method against the standard vector space approach and statistical co-occurrence based indexing. For the vector space model and the statistical co-occurrence based indexing the starting point for the thesaurus based indexing method, is used. The result is depicted in table 4.4.
62
Table 4.4 Performance Comparison of Amharic text retrieval with and without semantic
compression Query
Retrieval effectiveness of the IR system without semantic compression
statistical co-occurrence
thesauruses based
based semantic compression semantic compression
Precision
Recall
0.625
0.699
0.697
0.702
0.746
0.822
ተስፋፋ 0.622
0.642
0.686
0.703
0.744
0.811
ሚስጥር 0.552
0.602
0.604
0.665
0.698
0.746
በወረዳው በሽታውን 0.547
0.591
0.598
0.624
0.688
0.755
0.582
0.612
0.626
0.711
0.823
0.666
0.651
0.666
0.714
0.779
ስፖርታዊ ልምምዶችን 0.505
0.566
0.591
0.662
0.682
0.694
ህዝቡ በነቂስ ወጥቶ
0.552
0.594
0.621
0.688
0.689
0.763
መንግስታዊ ድርጅቶች
0.544
0.523
0.666
0.624
0.722
0.794
ማርኮት 0.452
0.521
0.552
0.604
0.681
0.726
0.6564
0.7075
0.7713
የህዝብ ተሳትፎ በሀገሪቱ
Precision
Recall
Precision
Recall
እየተባለ ያለው ድብቅ ያላቸው
ለመከላከል የዬኒቨርሲቲ የምርምር 0.502 ስራዎች ለአካባቢው ኑዋሪዎች 0.601 ገንዘብ ሰጥቷል
ውበቷ ተከተላት Average
0.5502
0.5986 0.6268
63
Table 4.4 shows that semantic compression based IR system performs better than without semantic compression. The better result is registered by applying thesaurus based semantic compression for Amharic information retrieval with average precision and recall of 70.75% and 77.13% respectively. This is followed by statistical co-occurrence based semantic compression with 65.64% average recall and 62.68% average precision. As we see semantic compression using thesaurus based indexing performs better. The reason is that the synonymous terms are handled; also this thesaurus based indexing handle the existed irregular spelling and problems of transliteration in Amharic writing system by defining them as a synonymous in the thesaurus. This reduces the index terms and at the same time it makes the user to get a relevant document for their query. But the best performance is still less than 80%. This is because of ineffectiveness of the Amharic thesaurus (for example the thesaurus may retrieve irrelevant information for users who search for term “ገና” which mean x-mass since it has another meaning when we speak it harder and lightly)and stemmer used in the current study (it is hard to steam the tern ነጫጭ to ነጭ or ትልልቅ to ትልቅ). Better performance will be achieved if the effectiveness of stemmer algorithm is improved, and if there is a standard thesaurus for Amharic to find not only synonymous words but also related and associated words. The sample screen shoot when the program runs for searching the fourth query used in evaluation is depicted below in figure 4.5.
64
Figure 4.5 Sample screen shoot when the Amharic retrieval system retrieve information for users query.
65
CHAPTER FIVE
CONCLUSIONS AND RECOMMENDATIONS
5.1
Conclusions
There are lots of digital documents in Amharic language available on the Internet and the World Wide Web which are accessible for users. This leads to the need for developing efficient and effective information storage and retrieval systems to represent, store, search and retrieve relevant Amharic documents from large document collections as per users’ information need. The objective of this research is therefore designing and integrating semantic compression techniques to Amharic text retrieval in order to enhance the performance of Amharic text retrieval system by controlling synonyms terms. To this end, 300 document corpus are indexed and 10 queries are used for evaluation. Various techniques of text preprocessing were employed; preprocessing techniques such as tokenization, normalization; stop word removal and stemming were used for identifying content bearing terms for indexing and searching. This research attempts to apply thesaurus- and statistical co-occurrence-based semantic compression for identifying semantically related terms from Amharic text corpus for indexing. This work has verified that semantic compression is valuable for Amharic text retrieval. Semantic compression is applied both during indexing and searching. According to the experimentation made thesaurus-based semantic compression registered the best IR effectiveness with 70.77% precision, 77.13% recall; and by using semantic compression 61% reduction in indexing terms is achieved. This is a promising result to develop an efficient and effective Amharic text retrieval that searches within large corpus. The main challenge in the study is the inability of the stemmer to efficiently handle word variants. The Amharic stemmer used doesn’t remove infixes and also ambiguity of words in the language. Additionally the nature of Amharic language greatly affects the retrieval performance which shows the need for a standard thesaurus. Besides, there is no standard Amharic corpus that can be used for experimentation. 66
5.2
Recommendations
Based on the findings of this research, the following issues are recommended for further investigation. Nowadays Ontology is the main area of study in IR. Therefore, further research can investigate ontology based semantic compression to enhance the performance of Amharic IR system. In this study Thesaurus based and statistical co-occurrence based semantic compression are attempted independently. To exploit the potential of both techniques there is a need to test a hybrid of both techniques. To improve system performance it is important to construct a standard Amharic thesaurus system like Wordnet for Amharic language. One of the challenges in this study is lack of standard corpus for experimentation. So we recommend coming up with standardized large corpus for experimentation of Amharic retrieval system. This research is conducted on text documents only, other types of document like video, audio, graphics and pictures are not studied. So integrating semantic compression for video, audio and image retrieval may be tasted in further research.
67
References [1] Tewodros H, Amharic Text Retrieval: An Experiment Using Latent Semantic Indexing (LSI) With Singular Value Decomposition (SVD), MSc. Thesis, Addis Ababa University, Addis Ababa, Ethiopia, 2003. [2] Challenges in information retrieval and laguage modeling, Report of a Workshop held at the Center for Intelligent Information Retrieval, University of Massachetts Amherst, 2002. [3] Djoerd Hiemstra, Information Retrieval Models, Wiley Online. New York: John Wiley & Sons, Inc publisher, 2009. [4] C. D. Manning, P. Raghavan, and H. Schutze, An Introduction to Information Retrieval, Online Edition, Cambridge: Cambridge UP, 2009. [5] Salton G. , Introduction to modern information retrieval. New York, McGraw-Hill press, 1983. [6] A.W. Nega, Stemming of Amharic words for information retrieval, vol. 17,1,pp. 1-17, Lit Linguist Computing, 2002. [7] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval. Cambridge UP, New York, 2008. [8] Ceglarek, D., Haniewicz, K., Rutkowski W., Quality of semantic compression in classification, CCCI'10 Proceedings of the Second international, vol. 1,LNAI 6421, pp. 162-171, 2010. [9] Ceglarek, D., Haniewicz, K., Rutkowski W., Semantic compression for specialized information retrieval, vol. 283, pp. 111-121, 2010. [10] Adane L, Feature extraction and matching in Amharic document image collections, MSc Thesis, Addis Ababa University, Addis Ababa, Ethiopia, 2003, June 2011. [11] Himanot M, Semantic compression for enhancing the efficiency of Amharic text retrieval, MSc Thesis, University of Gondar, Gondar, Ethiopia, January 2014 [12] Gezehagn G, Afaan Oromo Text Retrieval System, MSc Thesis, Addis Ababa University, Addis Ababa, Ethiopia, June 212 [13] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. AddisonWesley, MA, 1999. [14] Zoran P , Minh D, Serge A and Martin V ,New Methods for Image Retrieval, ICPS Congress on Exploring New Tracks in Imaging,Antwerp, Belgium, 1998. 68
[15] Yaregal A, Optical Character Recognition of Amharic Text: An Integrated Approach. M.Sc Thesis, Addis Ababa University, Addis Ababa, Ethiopia, 2002. [16] Asteraye T, Berhanu B, Daniel A, Daniel Y. A Roadmap to the Extension of the Ethiopic Writing System Standard Under Unicode and ISO-10646 Why Extended Ethiopic? In: Fifteenth International Unicode Conference, p. 1-12, 1999. [17] Marvin L.Bender, Sydney W.Head and Roger Cowley, Language in Ethiopia: The Ethiopian Writing System. Oxford University Press, 1976. [18] Bethlehem M, N-gram-Based Automatic Indexing for Amharic Text. M.Sc Thesis, Addis Ababa University, Addis Ababa, Ethiopia, 2002. [19] Seid H. and Gamback B , A Speaker Independent Continuous Speech Recognizer for Amharic.In Proc. of Interspeech, Lisboan, Portugal, 2005. [20] Information Retrieval and Extraction 2008,‖University of Washington, 2008. [Online]. Available:http://nlg3.csie.ntu.edu.tw/courses/IR/2008midtermSolutions.htm. [Accessed: December, 2013]. [21] Ceglarek D., Haniewicz K. Rutkowski W., Domain based semantic compression for automatic text comprehension Augmentation and recommendation, Third International Conference, ICCCI 2011, Gdynia, Poland, September 21-23, 2011. [22] Andreas Hotho , Steffen Staab , Alexander Maedche, ontology based text clustering, Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany, 2002. [23] Sridevi U.K, Nagaveni N., ontology based correlation analysis in information retrieval, International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009. [24] Sridevi U.K, Nagaveni N, ontology based similarity measure in document ranking, International Journal of Computer Applications, Vol 1, No. 26, 2010. [25] M.dragoni, An ontological Representation of Documents and queries for information retrieval sysytems, in the Proceedings of the 1st Italian Information Retrieval Workshop (IIR’10), Padova, Italy. vol. 6097, pp. 555-564, January 27–28, 2010. [26] P Majumder, M Mitra, B.B. Chaudhuri, N-Gram:a language independent approach to IR and NLP, 2002. [27] Bizuneh, M. (2003). The application of Websom for amharic text retrieval.MSc. Thesis, Addis Ababa University, Addis Ababa, Ethiopia, 2003. [28] Salton, G. and Michael J. McGill, Introduction to Modern Information Retrieval, New York: McGraw-Hill Book Company, 1983 69
[29] K.Clarke, frequency estimates for statistical word similarity measures, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language, vol. 1, pp. 165-172, 2003. [30] Bloor T. The Ethiopic Writing System: a Profile [Internet]. 1995. Available from: www.spellingsociety.org/journals/j19/ethiopic.php [Accessed: December, 2013] [31]የ ኢት ዮ ጵ ያ ቋ ን ቋ ዎ ች ጥ ና ት ና ምር ምር ማእ ከ ል ,አ ማር ኛ መዝ ገ በ ቃላ ት , አ ዲስ አ በ ባ ዩ ኒ ቨ ር ሲቲ , 1993 ዓ.ም. [32] Million M. and Jawahar C. V. , Indigenous Scripts of AfricanLanguages. African Journal of Indigenous Knowledge Systems, pp. 132-142, 2007 [33]Jean Aitchison, Alan Gilchrist,Thesaurus Construction, 2nd Edition,Aslib london,1987. [34] P. Ingwersen, Information Retrieval Interaction, 1st ed. London: Taylor Graham Publishing, 2002. [35] R. Baeza-Yates, Information Retrieval: Data Structure & Algorithms, 1st ed. Waterloo: University of Waterloo, 2004, pp. 1-630. [36] E. Greengrass, Information Retrieval: A survey, Information Retrieval, vol. 163, no. November, pp. 141-163, 2000. [37] A. Singhal, Modern information retrieval: A brief overview, IEEE Data Engineering Bulletin, vol. 24, no. 3, pp. 35-43, 2001. [38] A. Das and A. Jain, Indexing the World Wide Web: The Journey So Far, Google Inc, USA, vol. 1, no. 1, pp. 1-24, 2008. [39] J. Etzold, A. Brousseau, P. Grimm, and T. Steiner, Context-aware Querying for Multimodal Search Engines, Springer-Verlag Berlin Heidelberg, vol. 6, no. 2012, pp. 728-729, 2011. [40] Y. Fang, N. Somasundaram, L. Si, J. Ko, and A. P. Mathur, Analysis of An Expert Search Query Log Categories and Subject Descriptors, Symposium A Quarterly Journal In Modern Foreign Literatures, vol. 64, no. 18, pp. 1189-1190, 2011. [41] Y. Y. Yao, Information Retrieval Support Systems. Boca Raton: Taylor & Francis Group, pp. 1-778, 2012. [42] P. Soucy, ―Beyond TFIDF Weighting for Text Categorization in the Vector Space Model, International Joint Conference on Artificial Intellegence, vol. 1, no. 1, pp. 1-6, 2005. [43] N. Fuhr, Probabilistic Models in Information Retrieval,SpringerLink, vol. 11, no. 3, pp. 251-265, 2008 70
[47] Salton, G. .Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, Massachusetts: Addison-Wesley Publishing Company, 1989. [48] Salton, G., Buckley, C. Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24, 513-523. Reprinted in Sparck Jones, K. and Peter Willet (Eds.), Readings in Information Retrieval. San Francisco, California: Morgan Kaufmann Publishers Inc., 1997 [49] Kristina F,Thesauri Usage in Information Retrieval Systems:Example of LISTA and ERIC Database Thesaurus, 4th International Conference:The Future of Information Sciences (INFuture), Benja publishing Lokve, Republic of Croatia, Zagreb 2009. [50] Monz, C. and M. de Rijke. Shallow morphological analysis in monolingual information retrieval for Dutch, German and Italian. In C. Peters, M. Braschler, J. Gonzalo, and M. Kluck (Eds.), Evaluation of Cross-Language Information Retrieval Systems, CLEF 2001, Volume 2406 of Lecture Notes in Computer Science, pp. 262–277. Springer, 2002. [51] Saba, A.. The Application of Information Retrieval Techniques to Amharic Documents on the Web. (Masters Thesis). School of Information Studies for Africa. Addis Ababa University. Addis Ababa, 2001 [52] Abyot, B. Design and Development of Amharic Word parser. (Masters Thesis). School of Information Studies for Africa. Addis Ababa University. Addis Ababa.2000. [53] Dereje, T. Optical Character Recognition of Typewritten Amharic Text. (Masters Thesis). School of Information Studies for Africa. Addis Ababa University. Addis Ababa.1999. [54] Million, M. A Generalized Approach to Optical Character Recognition of Amharic Texts. (Masters Thesis). School of Information Studies for Africa. Addis Ababa University. Addis Ababa.2000. [55] Nigussie, T. Handwritten Amharic Text Recognition Applied to the Processing of Banking Cheques. (Masters Thesis). School of Information Studies for Africa. Addis Ababa University. Addis Ababa. 2001. [56]Worku A. The application of OCR Techniques to the Amharic Script. (Masters Thesis). School of Information Studies for Africa. Addis Ababa University. Addis Ababa.1997. 71
[57]Yaregal, A. Optical Character Recognition of Amharic Text: An Integrated Approach. School of Information Studies for Africal.Addis Ababa University. Addis Ababa.2002. [58] Zelalem, S. Automatic Classification of Amharic News Items: The Case of Ethiopian News Agency. School of Information Studies for Africal. Addis Ababa University. Addis Ababa. 2001.
72
Appendix I: The Amharic Character Set h ሀ ሁ ሂ ሃ ሄ ህ ሆ l ለ ሉ ሊ ላ ሌ ልሎ
ሏ
ḥ ሐሑሒሓሔሕሖ
ሗ
mመሙሚማሜምሞ
ሟ
ś ሠ ሡ ሢ ሣ ሤሥሦ
ሧ
r ረ ሩ ሪ ራ ሬ ር ሮ
ሯ
s ሰ ሱ ሲ ሳ ሴ ስ ሶ
ሷ
š ሸ ሹ ሺ ሻ ሼ ሽ ሾ
ሿ
ḳ ቀ ቁ ቂ ቃ ቄ ቅ ቆ
ቈ
ቊቋቌቍ
b በ ቡ ቢ ባ ቤ ብቦ
ቧ
v ቨ ቩ ቪ ቫ ቬ ቭቮ
ቯ
t ተ ቱ ቲ ታ ቴ ትቶ
ቷ
č ቸ ቹ ቺ ቻ ቼ ችቾ
ቿ
ḫ ኀ ኁ ኂ ኃ ኄ ኅ ኆ
ኈ
ኊኋኌኍ
n ነ ኑ ኒ ና ኔ ን ኖ
ኗ
ñ ኘ ኙ ኚ ኛ ኜ ኝ ኞ
ኟ
ʾ አ ኡ ኢ ኣ ኤ እ ኦ
ኧ
k ከ ኩ ኪ ካ ኬ ክ ኮ
ኰ
ኲኳኴ ኵ
73
x ኸ ኹ ኺ ኻ ኼ ኽ ኾ w ወ ዉ ዊ ዋ ዌ ውዎ ʿ ዐ ዑ ዒ ዓ ዔ ዕ ዖ z ዘ ዙ ዚ ዛ ዜ ዝ ዞ
ዟ
ž ዠ ዡ ዢዣዤዥዦ
ዧ
y የ ዩ ዪ ያ ዬ ይ ዮ d ደ ዱ ዲ ዳ ዴ ድ ዶ
ዷ
ǧ ጀ ጁ ጂ ጃ ጄ ጅ ጆ
ጇ
g ገ ጉ ጊ ጋ ጌ ግ ጎ
ጐ
ጒ ጓ ጔ ጕ
ṭ ጠ ጡ ጢጣ ጤጥ ጦ
ጧ
č ጨ ጩ ጪ ጫ ጬጭ ጮ
ጯ
ጰ ጱ ጲ ጳ ጴ ጵ ጶ
ጷ
ṣ ጸ ጹ ጺ ጻ ጼ ጽ ጾ
ጿ
ṣ ፀ ፁ ፂ ፃ ፄ ፅ ፆ f ፈ ፉ ፊ ፋ ፌ ፍ ፎ
ፏ
p ፐ ፑ ፒ ፓ ፔ ፕ ፖ
ፗ
74
Appendix II: Sample of text document used for this study የወላይታ ቀጠና ቃለ ሕይወት ቤተ ክርስቲያን በወላይታ ዞን በሁምቦ ወረዳ በወንዝ ሙላት ምክንያት ለተፈናቀሉ ወገኖች መልሶ ማቋቋሚያ የሚውል አንድ ነጥብ አምስት ሚሊዮን ብር መለገሷን የዞኑ ግብርናና ገጠር ልማት መምሪያ አስታወቀ፡፡ ቤተ ክርስቲያኗ የሰጠቺው ገንዘብ
በብላቴ ወንዝ ሙላት ሳቢያ ከቤት ንብረታቸው ለተፈናቀሉ ከ5 ሺህ 300 በላይ
ወገኖች ቀለብ፣አልሚ ምግብ፣የቤት ቁሳቁስ፣ መድሀኒት መግዢያና ለሌሎች ድጋፎች እንደሚውል በመምሪያው የምግብ ዋስትና፣አደጋ መከላከልና ዝግጁነት ዴስክ ሃለፊ አቶ ብዙነህ ገብረ መድህን ተናግረዋል፡፡ እንዲሁም የበቆሎ ዘርና የስኳር ድንች ቁርጥራጭ ገዝቶ በማከፋፈልና ተጎጂዎችን መልሶ በማቋቋም ቀድሞ ወደ ነበሩበት ሕይወት ለመመለስ ተከታታይ ስራ እንደሚከናወን ኃላፊው አሰታውቀዋል፡፡ በብላቴ ወንዝ ሙላት ሳቢያ በወረዳው ከአባያ ጨው ካሬ፣ ከአበያ ቢሳሬና ጉሩቾ ቀበሌዎች ባለፈው ወር መጀመሪያ ከቤት ንብረታቸው ለተፈናቀሉት ለእነዚሁ ወገኖች ከፌደራል አደጋ መከላከልና ዝግጁነት ኤጀንሲ መቆየቱን አቶ ብዙነህ ገልጸዋል፡፡
75
የተለያየ ድጋፍ ሲደረግ
Appendix III: The queries used in the experiment Query 1:ህ ዝ ብ ተ ሳ ት ፎ Query 2:በ ሀ ገ ሪ ቱ ተ ስ ፋ ፋ እ የ ተ ባ ለ ያ ለ ው Query 3:ድ ብ ቅ ሚስ ጥ ር ያ ላ ቸ ው Query 4:በ ወ ረ ዳ ውበ ሽ ታውን ለ መከ ላ ከ ል Query 5:የ ዬ ኒ ቨ ር ሲቲ የ ምር ምር ስ ራ ዎ ች Query 6:ውበ ቷ ማር ኮ ት ተ ከ ተ ላ ት Query 7:መን ግ ስ ታዊ ድ ር ጅቶ ች Query 8:ህ ዝ ቡ በ ነ ቂ ስ ወ ጥ ቶ Query 9:ስ ፖር ታዊ ል ምምዶ ች ን Query 10:ለ አ ካ ባ ቢውኑ ዋ ሪ ዎ ች ገ ን ዘ ብ ሰ ጥ ቷ ል Query 11:የ አ የ ር ጸ ባ ይ ለ ውጥ ና Query 12:ጤና ያ ጣሰ ው Query 13:ተ ያ ይ ዞ የ መታሰ ቢያ ሥነ ስ ር አ ቱ Query 14:በ ማድ ረ ግ የ መቀ ሌ ዩ ኒ ቨ ር ስ ቲ ና Query 15:ል ማዳ ዊ ድ ር ጊ ቶ ች
76
Appendix IV፡ Amharic Stop Words እ ዚ ሁ እ ና እ ን ደ እ ን ደ ገ ለ ጹት እ ን ደ ተ ገ ለ ጸ ው እ ን ደ ተ ና ገ ሩ ት እ ን ደ አ ስ ረ ዱት እ ን ደ ገ ና ወ ቅ ት እ ን ዲሁም እ ን ጂ እ ዚ ህ እ ዚ ያ እ ያ ን ዳ ን ዱ እ ያ ን ዳ ን ዳ ች ው እ ያ ን ዳ ን ዷ ከ ከ ኋ ላ ከ ላ ይ ከ መካ ከ ል ሁሉ ሁሉ ን የ ሁሉ የ ሁሉ ን ሁሉ ም የ ሁሉ ም በ ሁሉ ም ከ ሁሉ ም ለ ሁሉ ም ኋ ላ ሁኔ ታ ሆነ ሆ ኖ ም ሁሉ ን ምላ ይ ሌ ሎች ል ዩ መሆኑ ማለ ት ማለ ቱ መካ ከ ል የ ሚገ ኙየ ሚገ ኝ ማድ ረ ግ ማን ማን ምሰ ሞኑ ን ሲሆ ን ሲል ሲሉ ስ ለ ቢሆ ን ብ ለ ዋ ል ብ ቻ ብ ዛ ት ብ ዙ ቦ ታ በ ር ካ ታ
77