Automatic Building of Arabic Multi Dialect Text Corpora by ...

24 downloads 2932 Views 646KB Size Report
resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. .... available in the form of blogs, comments, forums, Facebook,. Twitter, etc. if we .... according to these words used for each dialect; using Bing API. [7]. Then we ...
Automatic Building of Arabic Multi Dialect Text Corpora by Bootstrapping Dialect Words Khalid Almeman and Mark Lee School of Computer Science University of Birmingham Birmingham, UK Email: {kaa846;m.g.lee}@cs.bham.ac.uk Abstract—The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words. Keywords— Multi Dialect; Text Corpora; Automatic Building

I.

INTRODUCTION

In Arabic there is a shortage of available tools and corpora for NLP tasks. This shortage becomes a huge when discussing dialect resources, especially dialects in written corpora, where most of the Arabic texts are written in MSA. This point raises a big obstacle for Arabic researchers in the area of NLP. The problem of shortage of resources in Arabic has several aspects. All researchers in Arabic NLP have to construct some or most of their own resources [1]. Each researcher starts from zero and then spends a long time compiling sufficient resources. The standardisation issue in NLP tasks is another aspect to consider when discussing shortage of data. Finally, we should consider the lack of Arabic resources, which suggest that Arabic research in the area of NLP tasks does not involve in-depth work, when compared to that required for other languages; such as English, Chinese, or other European languages. For the purposes of this work we built automatic Arabic dialects corpora by exploiting the web as a corpus. A key contribution of this paper is accent identification by creating words lists for each of the four main dialects. This involved conducting a survey with a group of Arabic speakers from different countries to assist in categorising the words. Once the lists were ready we bootstrapped the word by using a web search engine to download pages that were likely to have the same accent as the word that we are searching for. The paper is organised as follows: Section II gives a description of the relationship between MSA Arabic and other dialects of Arabic; Section III highlights the benefits of the usage of a web corpus; Section IV presents the methodology

978-1-4673-2821-0/13/$31.00 ©2013 IEEE

that has been used, including a description of the survey that was conducted; Section V describes the process of implementation; Section VI presents an evaluation of the work; Section VII discusses related work; Conclusions and future work are presented in Section VIII. II.

ARABIC DIALECTS

A. MSA vs. Dialects Arabic can be categorised into MSA and dialects. MSA is the formal language of communication understood by the majority of Arabic people. It is used to communicate between people of different dialects; and to broadcast news etc. Conversely, dialects are used in daily conversation and, in recent times, have begun to be used on both television and radio and also as written forms on the Internet. Thus, unless there is a requirement for MSA, Arabic people prefer to use their local dialect. B. Dialectal Variation Each Arabic country has its own particular main dialect. This main dialect of each country can be divided into a group of sub-dialects; e.g. the Saudi dialect includes Najdi (Central) dialect, Hejazi (Western) dialect, Southern dialect, etc. The majority of the dialect words originate from either MSA, e.g. ‫ ألت له‬/Aulitluh/ (Levantine) has the origin ‫قلت له‬ /Qultulahu/ ‘I said to him’1 in MSA or they are loanwords e.g. ‫ ساندويشه‬/sAndawyšah/ (Gulf) ‘sandwich’. In both cases there have been significant transformations that distinguish the original words from the current expression. The gap between MSA and the Arabic dialects has affected morphology, word order and vocabulary [2]. An interesting experiment conducted by Kirchhoff and Vergyri [ibid.] computed dialect vocabulary overlaps as a percentage of shared unigrams, bigrams and trigrams and compared these numbers with equivalent statistics for two varieties of English. In Arabic, the inventory of words (unigrams) only overlaps by 10% and there is little overlap of bigrams and trigrams as Table I shows. This can be compared to English, for example, where the percentage of shared unigrams was found to be 44% and approximately 20% and 5% for bigrams and trigrams, 1

In this paper, Arabic MSA and dialects words are represented in some or all of four variants according to context: (Arabic word /HSB transliteration scheme [3]/ (dialect) ‘English translation’).

respectively. The percentages of bigrams and trigrams show the huge variations between MSA and dialects compared to English in regards to bigrams and trigrams. Of the dialect words that originate from MSA, they have experienced changes in the affixation or stems of the words. Many Arabic dialects convert the phonetics of prefixes and suffixes; for example in the Egyptian dialect the prefix ‫ س‬/s/ meaning ‘will’ is converted to ‫ ح‬/H/; in North Africa where the suffix ‫ ـوا‬/wA/ ‘they’ is replaced with ‫ ـوش‬/wš/. Some changes also occur in stems e.g. the ‫ حرف القاف‬/q/ is changed to ‫ ء‬/A/ in Levantine and Gulf people use ‫ تس‬/ts/ or ‫ تش‬/tš/ instead of ‫ ك‬/k/ for some words. Table II shows some of these changes, comparing MSA with the some of the main dialects. The developments in phonetics are reflected in the writing system. As a result of new phonetics, which have developed and been absorbed into Arabic, additional symbols have been introduced for certain phonemes. These did not previously exist in the MSA alphabet; one example is ‫ ﭪ‬/v/, which is primarily used in loanwords [4]. The point here is that the usage of these representations of new phonetics is not standardised, nor are all the new phonetics represented as yet. C. The need for a multi dialect written text corpora The majority of the Arabic dialects occur as spoken rather than written forms. There is a need for dialectal written forms to deal with spoken dialects in many NLP applications such as in Automatic Speech Recognition (ASR) applications sensitive to the dialects. ASR and TTS as examples have a language model. The language model should have same dialect which we would use in speech. Kirchhoff et al. [4] have tried to use MSA text in speech recognition for the Egyptian dialect. However, their experiment did not yield any improvements. So, the need for written forms of Arabic dialects is crucial to NLP dialect tasks, especially those tasks that involve speech. The differences in words, syntax and phonetics between MSA and dialects and dialects themselves make it very important to use dialect based corpora instead of using MSA speech or text based corpora. TABLE I.

PERCENTAGE OF SHARED UNIGRAMS, BIGRAMS AND TRIGRAMS IN THE EGYPTIAN CORPUS AND THE MSA CORPUS, AND FOR THE CONVERSATIONAL BRITISH ENGLISH CORPUS AND AMERICAN ENGLISH CORPUS [2] Shared Unigrams (%)

ECA-MSA 10.3%

BE-AmE 44.5%

Bigrams (%)

1%

19.2%

Tri-grams (%)

< 1%

5.3%

TABLE II.

SOME CHANGES IN PHONETICS BETWEEN ARABIC DIALECTS COMPARED WITH MSA

MSA has converted to: Egyptian Levantine Gulf North Africa

θ

ð

Q

j

s (or) t s (or) t θ t

z (or) d z ð ð

A A g g

g j j (or) g j

There is also no Arabic multi dialect text corpus that covers the previously outlined requirements. If large sized multi dialect corpora were built to address the shortage in Arabic resources, this would be very efficient, as it would give an opportunity for researchers to compile more and more information in the area of Arabic NLP. III.

MAKING USE OF THE WEB TO BUILD TEXT CORPORA: THE WEB AS A CORPUS

Kilgarriff [5] stated many benefits to be derived from using the web as corpus. He also stated that many works has been done already making use of the web in this way. The Arabic content on the Internet has in recent years increased dramatically and now exceeds 2 billion pages [6]. Thus this content is deemed comprehensive enough to exploit to build dialects corpora. The huge number and variety of Arabic texts on the web mean a good selection of different text corpora is available. Although most of the Arabic texts on the web are written in MSA, such as newswires, commercial sites, government sites, etc. there are still enough dialects on the web available in the form of blogs, comments, forums, Facebook, Twitter, etc. if we can make use of these resources we can compile useful Arabic dialects text corpora. IV.

METHODOLOGY

The methodology involves five main steps in the process of building multi dialect corpora: Collect multi dialect words and phrases, conduct a survey, estimate the counts, start downloading and perform a clearance and a normalisation. A. Step 1: Collect multi dialect words and phrases The first step is to collect dialect words and phrases that have been used in different Arab countries. All the words in each list should be found for one of the four main dialects; Gulf, Egyptian, North Africa and Levantine. The step of collecting words was done by exploring the web to extract dialects words. Some of these words were derived to get greater numbers of dialect words. At the end of this step about 1500 dialect words had been collected from web corpus. These words were categorised into four main dialects. We can ensure that the words in each dialect list were actually used in this dialect. However, we cannot certain that these words do not appear in other dialects. Therefore, there was a need to conduct a survey with a group of Arab speakers from different countries to guarantee that they do not use any words from lists other than their own. B. Step 2: Survey dialect speakers The key stage of this work is this step, which was to distinguish between the words that are used in each main dialect. To perform accent identification, six people, 3 males and 3 females, were chosen to participate in the survey. The job of each participant was to make sure that the entire words list, except for his/her word list were not used in his/her country. Some of the words in the list are used in other dialects with minor changes, this is acceptable because we will search for an exact word; also some words are rarely used, this is also

acceptable because we were looking for majority usage. Once this step is completed we will introduce pure words/phrase for each dialect to ensure these words are not used in any other dialects. Table III shows some examples. By the end of the survey, the four lists of multi dialect words are ready for bootstrapping in the web corpus. Table IV shows the number of words for each dialect and the total of words after the survey; which is about 1000 words and phrases. As can be seen from Table IV Egyptian was identified as having the lowest number of tokens. The reason for that is the location of Egypt in the centre of the Arab world. It shares words with many other Arab countries, unlike the Gulf which is located on the east of Arab world, and Levantine and North Africa which are respectively in the North and West of the Arab world. C. Step 3: Estimate the counts Before the downloading stage took place we needed to calculate the average for how many tokens will be produced per link. Table V shows the results of the estimation. In this step we intend to make sure that each dialect corpus will be more than 10M tokens. To simplify the process we will choose either 50 or 100 pages for each dialect according to the estimation. Based on the results in Table V, it is possible to determine how many pages we need for each dialect to get comparable results. We set the page count to 50 pages for Gulf and Levantine, and 100 pages for North Africa and Egyptian; more than this number of links might be retrieved due to repeated results. This will then give an average 12M token corpus for each dialect i.e. the result of each dialect > 10M tokens. TABLE III. Gulf ‫باتسر‬ ‫ابتل‬ ‫ابخص‬ ‫اتربع‬

EXAMPLES OF CATEGORISED WORDS AND PHRASES Egyptian ‫اتلم‬ ‫مسحول‬ ‫اشتغالة‬ ‫الباشكاتب‬ TABLE IV.

North Africa ‫اتاي‬ ‫البغرير‬ ‫طوبيس‬ ‫الدسة‬

TOTAL NUMBER OF WORDS

Dialect Gulf North Africa Levantine Egyptian Total TABLE V. Dialect Gulf North Africa Levantine Egyptian

Levantine ‫تؤبرني‬ ‫ابو شريك‬ ‫ابيش‬ ‫اجبد‬

D. Step 4: Start the downloading After the estimation for the required pages of each dialect, we moved on to the next step which involved collecting links according to these words used for each dialect; using Bing API [7]. Then we downloaded web pages for each dialect. The downloaded pages have been saved as html files. E.

Step 5: Clearance and Normalisation The clearance of the resulted corpora is an important step as the biggest drawback in a web corpus is noise. The collecting of web pages from different resources such as forums, blogs, comments, etc. makes it very hard to make generalisations of the rules. So, each clear step includes removing unwanted symbols, tags, underscores, and more than one space, etc. We also needed to perform a normalisation for the resulted texts. The normalisation includes removing repeated words and phrases in web pages especially the more popular words used in forums; such as ‫' الصفحة الرئيسية‬home page', ‫' عضو‬member', ‫' التسجيل‬register' etc. Also there is a need to remove repeated sentences or paragraphs that appear in sequence. However, if the sentences are repeated but not in sequence, we cannot remove it as it might be come from a different context. V.

IMPLEMENTATION

Three main steps are necessary to implement this work; the first step is to collect the required links, and then download them. The last step is to clear and normalise all resulting texts. After this we should extract the main features to measure the resultant corpora. For the first step, we retrieved the first 50-100 links for each search using Bing API [7]; and then downloaded those pages on which the results were likely to have the same word accent. For all four corpora, we downloaded more than 55000 web pages from different web resources. Although most of the Arabic web pages use cp1256 encoding, other coding is also used in Arabic web pages, such as UTF-8 or ISO 8859-6. In this work we use cp1256 encoding when converting to text, as most of the Arabic web pages use this encoding. The cleaning stage is important, wherein we would like to clear the pages and files so they can be pure Arabic text. Some of the noise that had to be cleared was HTML tags, underscores, tabs, blank lines, more than space, more than one dot in sequence, etc.

Total of words 430 200 274 139 1043

ESTIMATION OF HOW MANY PAGES WE NEED PER DIALECT

Word count 430 200

1word/link produced “Average” 711 words 528 words

We need 50 pages 100 pages

274 139

794 words 979 words

50 pages 100 pages

Table VI shows the results of the experiment after the cleaning and normalisation. Gulf, North Africa, Levantine and Egyptian corpora have 14.5M, 10.1M, 10.4M and 13M tokens respectively. Table VI also shows the distinct types in each dialect which are 920K, 720K, 770K and 770K respectively. As the corpora have been built from the web, they have a high percentage of repeated sentences with an average of 39% as Table VI shows. Table VII shows the total results of all four corpora sizes; which is more than 50 million tokens. For these words there are more than 2 million distinct types for all corpora.

TABLE VI. Dialect

Size in mega token 14.5 10.1 10.4 13

Gulf North Africa Levantine Egyptian

TABLE VII. Size in mega token 48

Average of % of sentences repeated 39%

TABLE VIII. Dialect Gulf North Africa Levantine Egyptian Total

RESULTS % of repeated sentences 41% 41% 33% 42%

Distinct types 920K 720K 770K 770K

TOTAL RESULTS Distinct types for all dialects “not a total” 2M

1word/link produced “Average for all” 753K

SENTENCES COUNTS AND AVERAGE OF LENGTH

Exactly size 15,291,500 10,570,195 10,885,829 13,611,341 50,358,865

TABLE IX.

Number of sentences 1,729,545 1,044,495 1,143,354 1,265,921 5,183,315

Average of sentence length 8.84 10.12 9.52 10.75 9.8

COMPARING UNKNOWN WORDS

Corpus Arabic Gigaword CCA Our dialect corpora

The % of unknown words 18% 15% 39%

Dialect(s) MSA Contemporary Multi dialect

There are more than 5 million sentences in all four corpora, as Table VIII shows. The average of sentences count is 1.3 million per corpus and the average sentence length is about 10 words per sentence. VI.

EVALUATION

To evaluate this work we will compare it to the MSA gigaword corpus [8] and a Corpus of Contemporary Arabic (CCA) [9]. As yet, there is no available dialect corpus to compare it to. A. Comparing results 1) OOV comparing After extracting 50K distinct types then analysing those using MSA morphology analyser Alkhalil [10], the percentage of unknown words when comparing between several corpora were determined, as shown in Table IX. For Arabic gigaword [8] and CCA [9] the results were 18% and 15% respectively, where the result of unknown words for our dialect corpora in this paper was 39%. It is clear that when comparing the tokens that are used in MSA with the tokens that are used in dialects, there is a significant difference between them. Dialect words are different in their affixes, new loanwords, the affixes of the loanwords and diacritisation. 2) Size comparing We can compare this work in its size with CCA [9], where it has some dialects (contemporary) words but still mostly with MSA context. Table X shows the result of this comparing. As

can be seen from Table X, a big difference in sizes between our four corpora, with an average of 12M tokens and CCA [9] which is less than 1M tokens. The size of CCA [9] is also reflected on the count of distinct types which is 125K, where the average distinct types of our dialects corpora is 800K. This big difference in distinct types will affect the researches on NLP tasks, where the work on CCA [9] gives a chance to deal with just 15% of the word variety of the average of our corpora, and it has just 6% of the words that in four corpora. 3) Syntax comparing It is very difficult to measure syntax differences between MSA and dialects corpora where there is no available dialects parsing or dialects POS. However, when looking at a dialects corpus it is a clear that its syntax differs from MSA syntax in general context, sentence representation and the position of POS tagging. Another issue is that there is a difference between the dialects themselves in syntax. B. Analysis Table XI shows the frequencies of token types in the four groups; those tokens that are listed once -in Greek known as 'hapax legomena'-, those tokens that are listed from 2 to 9 times, those tokens that listed 10-100 times, and those tokens that appear more than 100 times. It is very common to have the majority be a percentage of tokens that have a frequency of less than ten times [11]. Also it is common that a very few tokens have a very high frequency [ibid.]. Table XII shows the greatest 10 unigrams with their frequencies for all four corpora. Those unigrams that appear in Table XII are originally MSA words. When these words are used in dialects there are some diacritisation changes e.g. ‫كل‬ /kul/ (MSA) is ‫ كل‬/kil/ (Gulf) 'eat'. The total of the 10 greatest unigrams is about 9% of the corpus size. As can be seen from Table XII some of the words have more than one meaning according to contexts. Most of the tokens in Table XII are function words. If we try to look at the most frequent tokens by excepting function words we will get the results that seen in Table XIII. We have applied Zipf law to all four corpora to measure the balancing. Zipf law suggests that there is a constant k such that: 



f.r = k Where f is the frequency and r is the rank TABLE X.

SIZE COMPARING

Corpus

All tokens

Distinct types

CCA

600K

125K

Gulf

14.5M

920K

North Africa

10.1M

720K

Levantine

10.4M

770K

Egyptian

13M

770K

All four corpora

48M

2M



TABLE XI.

FREQUENCY OF FREQUENCIES OF TOKEN TYPES

Token frequency 1 2-9 10-100 >100

Percentage of frequency 49% 39% 10% 2%

According to Zipf law, the figure should have a line with a slope -1 to make sure the resulting corpus is balanced. The results of Zipf law for all corpora were almost -1, which is predicted by Zipf law to have a balanced language. Figure 1, 2, 3 and 4 show the results. Table XIV shows the bigrams, trigrams and five-gram counts, which resulted. In our corpora we have more than 43 mega bigrams, an estimated 38 mega trigram and an estimated 30 mega five-gram. Remarkably, as Table XIV shows, the bigram has the greatest count. However, it got the lowest count in distinct bigrams i.e. about 15 mega bigrams. On the other hand, the distinct trigrams got the greatest. The extraction of ngrams in written corpora is very useful, since they highlight the main features of the language/dialect. TABLE XII.

Another issue is that there are web pages using MSA as their main syntax. However, are some of the dialects words appear in them. Because of this we found that there are MSA pages in the corpus, as these are very difficult to remove automatically. Generally, it is better for a corpus to have MSA syntax in addition to the main dialect; every dialect has MSA sentences within the dialect itself. Finally we found that many web pages included mixed dialects so they repeated the same information in more than dialect. This might cause noise when managing more than one dialect concurrently.

10 GREATEST UNIGRAMS TOKENS FREQUENCIES (INCLUDING FUNCTION WORDS)

Frequency

Token

953440 771032

‫من‬ ‫في‬

527566 519358 400466 349776 264736 228592 182279 163140

‫و‬ ‫على‬ ‫ما‬ ‫هللا‬ ‫ال‬ ‫يا‬ ‫عن‬ ‫كل‬

TABLE XIII.

Many Arabic words differ in just ‫' النقاط‬the marks', these words are more likely to have spelling mistakes than other words e.g. ‫' على‬upon' ‫' علي‬Ali (proper noun)', where the difference in just the marks of the last letter.

Some meaning(s) according to diacritisation From, that, who? In, at, on, within, during, by, mouth (very rarely used) And, with To, at, in, upon, onto That, not, what Allah No O From, about All, each, every, eat, tired (very early used)

Figure 1. Zipf law of Egyptian, slope = - 0.9761. log of rank on the X-axis versus frequency on the Y-axis

Figure 2. Zipf law of Gulf, slope = -0.9759. log of rank on the X-axis versus frequency on the Y-axis

GREATEST UNIGRAMS TOKENS FREQUENCIES (NON-FUNCTION WORDS)

Frequency

Token

349776 100041 90579 71986 70455

‫هللا‬ ‫محمد‬ ‫وهللا‬ ‫يوم‬ ‫علي‬

Some meaning(s) according to diacritisation Allah Mohammad And Allah Day Ali

Figure 3. Zipf law of Levantine, slope= -0.972. log of rank on the X-axis versus frequency on the Y-axis

C. Error evaluation The rich morphology of the Arabic language gives a large number of derivations of each word for MSA. When the new prefixes and suffixes that appear in the dialects have been added, they will give a huge number of forms for each word, which might then produce a lot of spelling mistakes. Spelling mistakes might happen when writing the dialect words; unlike MSA words which have a standard in writing, dialects do not have standardisation as yet.

Figure 4. Zipf law of North Africa, slope = -0.9793. log of rank on the Xaxis versus frequency on the Y-axis

TABLE XIV.

BIGRAMS, TRIGRAMS AND FIVE-GRAMS COUNTS

Dialect bigram distinct bigram Egyptian 11.8 4.8 Gulf 12.9 5.3 Levantine 9.3 4.3 North Africa 9.1 3.9 4 corpora 43 15.1

Count of / mega trigram distinct 5-gram distinct trigram 5-gram 10.6 6.3 8.4 5.6 11.3 6.8 8.4 5.5 8.2 5.3 6.2 4.3 8.1 4.9 6.3 4.1 38.1 21.5 29.3 18.8

VII. RELATED WORK Some of the written corpora are built for specific tasks, such as machine translation task, speech recognition task, education purposes. Others are created for general NLP tasks. The majority of the current Arabic written corpora are from newswire resources including Al-hayat corpus [1], the gigaword corpus [8] and the LDC Arabic Newswire Corpus [12]. These corpora are written in MSA syntax rather than in dialects. As pointed out previously, there is a big difference between MSA and dialects in most of the language aspects [2]. As Table XV shows, most of the MSA texts corpora are not free. The challenge is that Arabic corpora mainly have MSA syntax whereas the majority from newswires still use MSA syntax. The available corpus for the current Arabic NLP researcher looking for a dialect text is the corpus of contemporary Arabic [9], which includes some dialects sentences. Another way is to use transcripts of dialects speech corpora such as ([13]; [14]; [15]; [16]). However, they are not big enough for most NLP tasks. Moreover, all of them are not free.

TABLE XV. Name of Corpus Arabic Newswire Corpus CLARA Al-Hayat Corpus Arabic Gigaword Watan Khaleej A MSA Corpus

The first step of the work is to collect and group dialects words from different Arab websites. Once four lists were ready, a survey was conducted with a group of people from different Arab countries to categorise the words into four main dialects; Levantine, Egyptian, Gulf and North African. We need to make use of the resultant words to obtain links by using a search engine API, then downloading web pages which are likely to have the same accents. We downloaded more than 55000 web pages. After the downloading stage, we performed a suitable clearance and normalisation for the downloaded web pages to obtain clear corpora. These corpora include four main dialects Gulf, Levantine, Egyptian and North African which gave a result of 14.5M, 10.4M, 13M and 10.1M tokens respectively and the total of distinct types in all corpora is more than 2 mega types. In future work we will endeavour to make the corpora cleaner by working on the current version to produce a new version with less noise and then perfume an in-depth analysis of the resulting dialects corpora.

Size 80M tokens

Free Ref. No [12]

Charles University ELRA LDC http://sourceforge.net http://sourceforge.net Building a MSA corpus

50M tokens 18.6M 400M ----3M 21M

No No No Yes Yes No

[17] [1] [8] [18] [19] [20]

REFERENCES [1] [2]

[3] [4]

[5] [6]

[7] [8] [9]

VIII. CONCLUSIONS This paper has explained an Arabic multi dialect written corpora, which has been created by exploiting the web as a corpus. These corpora will be available for public use.

MSA ARABIC CORPORA

Source LDC

[10]

[11] [12] [13] [14]

[15] [16] [17] [18]

[19]

[20]

A. Goweder and A. De Roeck, "Assessment of a Significant Arabic Corpus", in Presented at the Arabic NLP Workshop at EACL (2001). K. Kirchhoff and D. Vergyri, "Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition", Speech Communication 46, 1 (2005), pp. 37-51. N. Habash and A. Soudi and T. Buckwalter T. On Arabic Transliteration. Arabic Computational Morphology, 38:15-22, 2007. K. Kirchhoff and J. Bilmes and S. Das and N. Duta and M. Egan and J. Gang and H. Feng and J. Henderson and L. Daben and M. Noamany and P. Schone and R. Schwartz and D. Vergyri, "Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop", in ICASSP '03, (Missouri, USA: , 2003), pp. 344--347. A. Kilgarriff, "Web as corpus", in in Proceedings of the Corpus Linguistics (Lancaster, England: , 2001), pp. 342-344. A. Alari and M. Alghamdi and M. Zarour and B. Aloqail and H. Alraqibah and K. Alsadhan and L. Alkwai, “Estimating the size of Arabic indexed web content”, Scientic Research and Essays, 7(28):2472-2483, 2012. Microsoft, "Bing search API" (2011). R. Parker and D. Graff and K. Chen and J. Kong and K. Maeda, "Arabic Gigaword", LDC, University of Pennsylvania, Philadelphia (2009). L. Al-Sulaiti and E. Atwell, "The design of a corpus of contemporary Arabic", International Journal of Corpus Linguistics 11, 2 (2006), pp. 135--171. A. Boudlal and A. Lakhouaja and A. Mazroui and A. Meziane, "Alkhalil Morpho Sys1: A Morphosyntactic analysis System for Arabic texts", inACIT'2010 Proceedings (Riyadh, Saudi Arabia: , 2011). C.D. Manning and H. Schütze, Foundations of statistical natural language processing, vol. 999, (MIT Press, 1999). A. Cole and D. Graff and K. Walker, "Arabic Newswire Part 1 Corpus (1-58563-190-6)",LDC (2001). Appen Pty Ltd (Gulf), Sydney, "Gulf Arabic Conversational Telephone Speech, Transcripts", Linguistic Data Consortium, Philadelphia (2006). H. Gadalla and H. Kilany and H. Arram and A. Yacoub and A. ElHabashi and A. Shalaby and K. Karins and E. Rowson and R. MacIntyre and P. Kingsbury and D. Graff and C. McLemore, "CALLHOME Egyptian Arabic Transcripts", LDC, Philadelphia (1997). M. Maamouri and T. Buckwalter and D. Graff and H. Jin, "Fisher Levantine Arabic Conversational Telephone Speech", LDC (2007). Appen Pty Ltd (Iraqi) 2006Appen Pty Ltd (Iraqi), Sydney, "Iraqi Arabic Conversational Telephone Speech, Transcripts", LDC, (2006). P. Zemanek, "CLARA (Corpus Linguae Arabicae): An Overview", in Proceedings of ACL/EACL Workshop on Arabic Language (, 2001). M. Abbas and K. Smaili and D. Berkani, "Evaluation of Topic Identification Methods on Arabic Corpora", JOURNAL OF DIGITAL INFORMATION MANAGEMENT 9 (2011), pp. 185-192. M. Abbas and D. Berkani, "Topic identification by statistical methods for arabic language", WSEAS Transactions on Computers 5, 9 (2006), pp. 1908--1913. A. Abdelali and J. Cowie and H. Soliman, "Building a modern standard Arabic corpus", in workshop on computational modeling of lexical acquisition. The split meeting. Croatia, 25th to 28th of july ( 2005).