Design and Implementation-Algorithms of Amharic Search Engine System for Amharic Web Contents Hassen Redwan, Solomon Atnafu Department of Computer Science Addis Ababa University, Ethiopia E-mail:
[email protected] Abstract—On the Web, the use of languages other than English (e.g. Amharic language) has been growing exponentially. The number of web documents in Amharic language as well as Internet users in Ethiopia is growing dramatically. However, the major search engines have been lagging behind in providing indexes, stemming and search features to handle this language. Therefore, the design and implementation of web search engine that considers the typical characteristics of the Amharic language is needed. In this paper, we design Amharic Search Engine system for Amharic language web documents and briefly discuss the algorithms for implementing the engine. The Crawler, Indexer and Query Engine are the basic components of this search engine. Typical characteristics of the Amharic language were considered by testing the engine for morphological variants as well as Amharic aliases support. For experimentation, two runs of the crawler were conducted by using 10 threads that crawl in parallel Index Terms— Amharic Search Engine Algorithm, Information Retrieval, IR in Amharic language, Amharic on World Wide Web
I. INTRODUCTION Recent technological advancement on web search engines provides the easiest way to reach information resources that are available on the Web. Existing search and retrieval engines provide limited assistance to users in locating the relevant information they need from the Web. Crawlers are programs used by search engines to gather information about web pages. Crawlers perform many useful services, including link maintenance, downloading, printing, and visualization [1, 2, 3]. Search engines use crawlers to find what is on the Web and then they construct an index of the pages that were found. Over the last decade, the WWW has become a major source of information. Although English is still the dominant language on the Web, information in other languages is steadily gaining prominence. The use of languages other than English has been growing exponentially on the Web. Data provided in [4] indicate that in March 2009 the percentage of English language users on Internet is only 29.1% as compared to 45% in June 2001 [5]. This figure implies that there are a great number of non-English language content on the Internet and the need for attention to non-English documents on the Web. However, the major search engines have been lagging behind in providing indexes, stemming and search features to handle these languages. Thus, in this paper we design and implement a search engine system that considers the typical characteristics of web contents in Amharic language.
Amharic (›T`—) is a Semitic language spoken in Ethiopia. It is an Afro-Asiatic language of the Southwest Semitic group and is related to the Ge'ez language. It also has affinities with Tigré, Tigrinya, and the South Arabic [6]. It is the second most spoken Semitic language in the world, after Arabic. Ethiopia's Semitic languages are written in a unique script of more than two hundred thirty characters which represent syllables and compound sounds rather than individual letters [7]. The script is now supported in Unicode (See Unicode table U+1200 – U+137F). Amharic is the national and official language of the Federal Democratic Republic of Ethiopia. It has been the working language of government and the military throughout modern times. Outside Ethiopia, Amharic is the language of some 2.7 million emigrants [8] in United States, Europe and Asia (notably in Egypt, Israel, and Sweden). Based on the World Internet Usage and Population Statistics [9] report of 2009, the number of Internet users in Ethiopia - with a total population of 82,544,838 and area of 1,127,127 sq km - has grown by 2,810% from 10,000 in December 2000 to 291,000 in March 2009. Accordingly, our work explores the increasing quantity of Amharic collections available on the Web and proposes a novel detail design and implementation of search engine for Amharic web contents. II. STATEMENT OF THE PROBLEM Nowadays, people of the world greatly rely on search engines so as to get information from the Web. The number of Web pages in non-English languages is more than that of English pages [10, 11]. Since general search engines are optimized mainly for English, they do not really search well those documents that are written in non-English languages [10] such as Amharic. Despite the increasing availability of information in Amharic language on the Web, there is no search engine for the language’s documents. There exist multiple regional dialects of the spoken Amharic language and the transliteration of words that are borrowed from foreign languages whose way of writing using the Amharic characters are not standardized. It is a common practice that different people pronounce the same meaning words differently. Since there is no standard reference to say that one is correct and the other is incorrect, people usually understand the different pronunciation as one and the same in practice. These words are frequently used by Amharic language users and their variations are somehow accepted in the practice. The absence of standard reference cannot allow us to change these differently pronounced words to the same
978-1-4244-6273-5/09/$26.00 ©2009 IEEE
root, the trend is to accept and understand all as they are. Thus until such a time that we have a reference, search engines should also accept this differences as conceived by users of the language. This need a closer investigation so as to make them accessible and being accommodated on Amharic search engines. (see Table 1 for illustration) Table 1: Different way of writing some words with the same meaning No
Different Amharic words with the same meaning
1
تƒ' Öªƒ' ×Dƒ' ÖK?’>¾U' T>K=’>¾U' T>K=”¾U
3
T>Á´Á' T>Á²=Á
4
cT”Á' cT’>Á
5
‚¡•KAÍ=' •¡’>KAÍ=
6
¢Uú¿}`' ¢Uú¨K=’>¾U ' T>K’>¾U' T>K=”¾U' T>K”¾U' T>K?”¾U.
Figure 7: Screen Shot Result for the Alias Query "•••••" As indicated in table 2, comparison was made on the results displayed by Google and our Amharic search engine. The Google Amharic search result for these terms is different as the Google Amharic search engine performs exact term matching. Unlike the Google Amharic search engine, our engine displayed similar results for these words. This shows that our engine considered the Amharic alias observed in Amharic language. C. Sample Result for Shorter and Longer Form Words In the search interface, the query •/•/••• /•/••• which is the short form for the name ••• •••••• (Haile Gebresillassie) is entered. In this case, Amharic short forms with two forward slashes are supported. As a result (as shown in Figure 8) the document that contains the word ••• •••••• is displayed.
Figure 8: Screen shot Result for Short Form Query "•/•/•••"
Table 2: Google versus Our engine result for variants of same word
[5] [6] [7] [8] [9]
As shown in Figure 9, the result of the shorter form query is the same as that of the longer form query ••• ••••••
[10] [11] [12] [13] [14] [15] [16]
Figure 9: Screen shot result for the Query "••• ••••••" VI. CONCLUSION AND FUTURE WORKS We designed the Amharic Search Engine and briefly discussed our algorithms for implementing the engine. This search engine allows indexing and searching of Amharic documents written in multiple character set representations of the Amharic Language. We also designed the engine in such a way that non-Unicode based Amharic documents can also be recognized by the crawler component of the engine. Comparison of the search for variants of the same Amharic alias word made by Google for Amharic and our search engine showed that our engine displayed similar results for these words while Google displayed different results. Some of the many contributions of our work are: (1) this work is the ever first work to design and develop a Search Engine for Amharic web documents (2) It demonstrated the possibility of incorporating Amharic alias in a Amharic search engines (3) It has demonstrated how Amharic abbreviations can be incorporated in Amharic search engine (4) It has developed a method to incorporate the searching of Amharic web documents in multiple encodings in a single Amharic search engine. REFERENCES [1] [2] [3] [4]
Fielding, R.T. "Maintaining Distributed Hypertext Info structures: Welcome to MOM spider's Web", Proceedings of the First International Conference on the World-Wide Web, Geneva, Switzerland, May 1994. Blue Squirrel Software. "WebWhacker". (Accessed: July 1, 2009) http://www.bluesquirrel.com/whacker/ Maarek, Y. S., et al. "WebCutter: A System for Dynamic and Tailorable Site Mapping" Proceedings of WWW6, Santa Clara, USA,1997. Internet Usage World Stats - Internet and Population Statistics 2009. (Accessed: Jul 1, 2009) http://www.internetworldstats.com/stats7.htm/
[17] [18] [19]
Liu, Chun-chou, et al. "User Behavior and the Globalness of Internet: From a Taiwan Users' Perspective", Journal of Computer-Mediated Communication, 7(2): 1-16, 2002 Amharic language. In Encyclopedia Britannica. http://www.britannica.com/EBchecked/topic/20500/Amharic-language (Accessed: June 03, 2009) Ethiopic Languages. (Accessed: June 03, 2009) http://www.imperialethiopia.org/languages.htm Amharic Language. In Wikipedia, the free encyclopedia. (Accessed: June 15, 2009) http://en.wikipedia.org/wiki/Amharic_language/ Internet Usage World Stats - Internet and Population Statistics 2009. (Accessed: Jul 3, 2009) http://www.internetworldstats.com/stats1.htm/ Bar-Ilan, J. and Gutman, T. "How do Search Engines handle nonEnglish queries? - A case study". The Twelfth International World Wide Web Conference, Budapest, Hungary, May 2003 Language Definition. Available at: http://www.unixl.com/dir/education/languages/language_definition/ (Accessed on July 10, 2009) Grefenstette, G and Nioche, J. "Estimation of English and non-English Language Use on the WWW" Proceedings of RIAO'2000, Paris, 237246, 2000 Moukdad, H. “Lost In Cyberspace: How Do Search Engines Handle Arabic Queries?” The Twelfth International World Wide Web Conference, Budapest, Hungary, 20-24 May 2003 Moukhad, H., and Large, A. "Information retrieval from full-text Arabic databases: Can search engines designed for English do the job?" Libri 51, 63-74, 2001 Moukdad, H. and Cui, H. "How Do Search Engines Handle Chinese Queries?" Webology, 2 (3), 2005 Nega A. and Willet, P. "Stemming of Amharic Words for Information Retrieval". Literary and Linguistic Computing, 17(1): 1-17, 2002 Nega A. and Willet, P. "The Effectiveness of Stemming for Information Retrieval in Amharic". Program: Electronic Library and Information Systems, 37(4), 254-259, 2003 Pingali, P., Jagarlamudi, J. and Varma, V. "WebKhoj: Indian language IR from Multiple Character Encodings", WWW2006, 2006 Izumi Suzuki, Yoshiki Mikami, Ario Ohsato. A Language and Character Set Determination Method Based on N-gram Statistics. ACM Transactions on Asian Language Information Processing, 1(3), 269-278, September, 2002.