Comparative Evaluation of Two Arabic Speech Corpora - IEEE Xplore

23 downloads 0 Views 187KB Size Report
Standard Arabic has 34 basic phonemes, of which six are vowels, and 28 ... there are as many syllables in a word as vowels in it [1], [4]. ... corpora about Colloquial Arabic, The Saudi Accented Arabic ... phonetically rich words, sentences and pronunciation of the ... refugee/medical questions (such as "Where is your pain?",.
Comparative Evaluation of Two Arabic Speech Corpora [email protected]

Ali Hamid Meftah

Yousef Ajami Alotaibi

College of Computer & Information Sciences,

College of Computer & Information Sciences,

Computer Engineering Dept.! King Saud University.

Computer Engineering Dept.! King Saud University.

Riyadh, Saudi Arabia

Riyadh, Saudi Arabia

[email protected]

Abstract: The aim of this paper is to conduct a constructive and comparative evaluation between two important Arabic corpora for two different Arabic dialects, namely, Saudi dialect corpus that was collected by King Abdulaziz City for Science and Technology (KACST), and a Levantine Arabic dialect corpus. Levantine dialect is spoken by ordinary Lebanese, Jordanian, Syrian, and Palestinian people. The later one was produced by the

Linguistic

disadvantages discussed.

This

Data of

Consortium

these

two

discussion

is

(LDC).

corpora

were

aiming to

Advantages

and

presented

and

help digital speech

processing researchers to figure out the weakness and strength sides of these important corpora before considering them in their experiments. Moreover, this paper can motivate in designing, maintaining, distributing, and upgrading Arabic corpora to help Arabic language speech research communities.

Keywords: Arabic; SAAVB; BBN/AUB; Speech; MSA; Levantine.

1.

Introduction

Arabic is a Semitic language, and it is one of the oldest languages in the world. The estimated number of Arabic speakers is 250 million, of whom roughly 195 million are first-language speakers and 55 million are second-language speakers. Currently, Arabic is the fifth most widely spoken language in the world [1], and it is the second language in terms of number of speaker [2]. It is also the language of the Islamic tradition instructions. The Arabic language is typically classified into three forms: classical Arabic, Modem Standard Arabic (MSA), and colloquial Arabic. Classical Arabic is the language of the Quran, the religious instruction in Islam and the great writers and poets. It is also known as Qur'anic Arabic, it is used in numerous literary texts were created during Umayyad and Abbasid times (7th to 9th centuries) [3]. Classical Arabic is often considered to be the parent language of all the spoken varieties of Arabic [3]. Modem Standard Arabic is the form of 978-1-4244-6899-7/10/$26.00 ©2010 IEEE

Arabic that is taught in schools and is used in formal talks, most printed matters in the Arab World including most books, newspapers, magazines, official documents, and reading primers for small children, and most radio and TV broadcasts. MSA has been generally adopted as the common medium of communication throughout the Arab world. Arabic is an official language in more than 22 countries [1]. Colloquial Arabic (or al-'ammiyya; i.e., the dialect of ordinary people or street communication language) is what is spoken in everyday conversation, or which is the form of Arabic used in everyday oral communication and varies significantly not only across regions and countries, but also within a single country. Arabic writing is performed from right to left, and letters take different forms depending on their position in the word, so it is different from English. Standard Arabic has 34 basic phonemes, of which six are vowels, and 28 are consonants [2], [4]. Arabic has fewer vowels than English. It has three long and three short vowels, while American English has at least 12 vowels [1], [2], [4]. Arabic phonemes contain two distinctive classes, which are named pharyngeal and emphatic phonemes. These two classes can be found only in Semitic languages like Arabic and Hebrew [2], [4]. One of the features of Arabic is the vowel lengthening in Arabic phonemes. Some Arabic dialects may have additional or fewer consonant phonemes. For example, Egyptian Arabic lacks the phonemes /THI ( �) and /thi ( �), and it replaces /3/ with /gi [1]. The syllables allowed in the Arabic language have the form: CV, CVC, and CVCC where V indicates a (long or short) vowel while C indicates a consonant. All Arabic syllables must contain at least one vowel. Arabic vowels cannot appear in syllable-initial position and can occur either between two consonants or in final position in a syllable, hence, a vowel cannot be the start phoneme in any Arabic word or Arabic utterances can only start with a consonant [1], [2], [4]. Arabic syllables can be classified as short or long. The CV type is a short one while all others are long. Syllables can also be classified as open or closed. An open syllable ends with a vowel, while a closed syllable ends with a consonant.

For Arabic, a vowel always forms a syllable nucleus, and there are as many syllables in a word as vowels in it [1], [4]. In this paper, we conducted a comparative study to two corpora about Colloquial Arabic, The Saudi Accented Arabic Voice Bank (SAAVB) that was collected by King Abdulaziz City for Science and Technology (KACST) during 2002 and 2003, and BBN/AUB DARPA Babylon Levantine Arabic Corpus that is developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program, and produced by the Linguistic Data Consortium (LDC). This paper is organized as follows; Section 2 is a background for the two corpora. In Section 3, we present a comparative analysis survey. The conclusion follows in Section 4. 2.

Database

2.1 Background Speech corpus is usually collected spontaneously, canonically, or both [5]. A spontaneous speech corpus is collected from real world human-humanlhuman- machine communication. In this case, the speaker does not read prompts, rather he or she speaks naturally to convey a message and/or inquire for information. The speech is expected to be natural and not affected by the reading habits. A canonical speech corpus, on the other hand, is designed for speakers to follow certain procedures, including the reading of prompts, for the purpose of collecting specific speech sounds. The read speech content can be words, sentences such as application phrases and phonetically rich sentences or text such as news, lectures, articles and stories. Although the former is suitable for language understanding and dialogue design, it tends to include unneeded frequently repeated words and utterances but does not necessarily include all the sounds of the language under investigation. However, a canonical speech corpus tends to be phonetically rich, i.e. all the sounds of the language are presented in various phonetic positions. Most of the recently database belong to the third type, which contains both spontaneous and read speech [5], [6], [7]. Some speech databases are recorded directly from the speakers or through a broadcasting medium such as radio or television, this usually recorded at a high sampling rate: 16 kHz, 22 kHz, 48 kHz. The other speech databases are recorded through a telephone system, and it has a lower sampling rate, 8 kHz, due to the telephone system bandwidth. Each of the above categories has its own objectives and applications where the speech content and the recording procedures are different [5]. 2.2 SAAVB.

The Kingdom of Saudi Arabia area is about 1,960,582 square kilometers. It is located in south-west Asia and is surrounded in the north, east and south by other Arab

countries. Saudi Arabia has about 20 million inhabitants and four-fifths of them are native Saudis [5]. SAAVB is a telephony speech database that was collected by KACST during 2002 and 2003 [5]. A summary of the statistics of this database is given in Table 1. This data was acquired from 1033 native speakers of Arabic with Saudi accent (51% males and 49% females). Around 51% of those speakers are aged 16 to 30 years and 49% of them are above 30 years old. 70% of the database was recorded over the Saudi mobile network while the remaining 30% was acquired through the fixed-line network [10]. Each speaker read 59 prompts that consist of: numbers, phonetically rich words, sentences and pronunciation of the Arabic and English alphabets. The average number of words in each prompt is 5 words. The duration of the total recorded speech in SAAVB is 96.37 hours distributed among 60947 audio files (1033 speakers x 59 audio files). This means that the average duration for each speaker is 5.60 minutes and the average duration of each audio file is 5.70 seconds [5]. The database was digitized at 8000 Hz sampling rate with AID conversion precision of 8 bits. The total size of SAAVB is 2.59 GB. It consists of 1033 directories with 183,518 files. Every directory represents a speaker and contains 178 files distributed as follows [5], [10]: First is text document that has all the information needed about the speaker (gender, telephone type, age and acoustic environment). Second are 59 text files of the prompts. Finally is 59 text files of the speech transcription. 2.3 BBN/AUB DARPA Babylon Levantine

Levantine Arabic is the dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It is significantly different from MSA Arabic, in that it is a spoken rather than a written language. It includes different word pronunciations, and even different words, from MSA Arabic, the written and "official" form of Arabic. This corpus was developed with funding from the DARPA, as part of the Babylon program. Approximately 20% of the corpus was recorded by BBN using paid subjects recruited in the Boston area from May 2002 to September 2002. This portion of the corpus was the first to be collected. Subsequently, the remaining 80% was recorded by the American University of Beirut (AUB), under subcontract to BBN, from July 2002 to November 2002. AUB students and staff served as both experimenters and subjects. This portion of the corpus was recorded in Beirut, Lebanon, on the AUB campus [6]. The subjects in the corpus were responding to refugee/medical questions (such as "Where is your pain?", How old are you? and were playing the part of refugees. Each subject was given a part to play, that prescribed what information, they were to give in response to the questions, but were told to express themselves naturally, in their own way, in Arabic.

Table I . The SAAVB Corpus S ummary [ 5 ]

KACST

Produced by

Arabic.

Language

Suadi Arabic.

Dialect

Sep 2002- jun2003 (10

Data collect

months)

Prompt sheet

Total number

1059.

Total number

270,828.

of words Avg. word

263 words

To avoid priming subjects to give their answer (all subjects were thus bilingual.) The following is part of an example scenario: "You are Maraam Samiir Shamali. You were born on 81711971 in Kuwait. You are now 31 years old." The BBN/AVB dataset consists of 164 speakers, 101 males and 63 females. It is a set of spontaneous speech sentences, recorded from 164 subjects speaking in Levantine colloquial Arabic.

number per

Table 2 The BBN/AUB Corpus Summary

prompt sheet

Speakers

Transcription

Total

1033

Male

523 speakers (50.63%)

Female

510 speakers (49.37%)

Age

(16-60)

Native

1033

Non-native

Non

Done by

Arabic orthography.(AO)

No. of word

302,107

No. of text

60,947

files Avg. of word

5

Produced by number Dialect

Jordan, Syria, and Palestine) May2002- November2002

Applications

speech translation, speech recognition, spoken dialog systems

Speaker

. 19,941

are unique No. of

1,843,282

No. of

Transcription

6.1

phonemes per word

Speech

No. of hours

96.37

No. of audio

60,947

files

Levantine colloquial Arabic(Lebanon,

Data collect

34,961

phonemes

Arabic.

Language

in dictionary No. of words

LDC2005S08

Catalogue

per file No. of word

Total

164

Male

101

Female

63

Age

-

Native

-

Non-native

-

Directory name

Idataltext

No. of text files

75900

Vocabulary

15K words

Total words

336K words

Total size

3.1 MB

Format

UTF-8 Unicode Arabic text

Directory name

Idatalaudio

for each

No. of audio files

75900

speaker

No. of hours

45

Format

MS WAV (signed

A vg. duration

Avg. duration

5.60 minutes

Speech

5.70 seconds

PCM)

of each audio file Total duration

48.81 hours

for males Total duration

47.56 hours

Channel Count

1

Sampling rate

16Khz

Resolution

16-bit

Total size

6.5 GB Microphone

Data sources

for females

SAA VB content

[6 ]

Linguistic Data Consortium (LDC)

Modulation

PCM format

Bit rate

8-bit Ii-law

Sampling rate

8 KHZ

Data Type: XML

No. of files

1 file containing XML markup for all utterances

Vocabulary

15K words

Total words

336K words

size

2.59GB

Total size

16 MB

No. of

1033

Format

XML with UTF-8 Unicode Arabic text.

directories No. of files

183,518

No. of Audio

178 files

files in each directory

Database recorded by

Telephone system

BBN/AVB Speech data has been recorded using a close­ talking, noise-cancelling, headset microphone (the Andrea Electronics NC-65). A Java-based data-collection tool, developed by BBN, was used to do the collection of speech. The audio was recorded in MS WAV, signed PCM. Sampling rate was 16 KHz, with 16 bits resolution. The duration of the

total recorded speech in BBL is 45 hours distributed among 75900 audio, the total audio size: 6.5 GB. The Total text size is 3.1 MB, Vocabulary is 15K words Total and words are 336K words [6], [7]. A summary of the statistics of this database is given in Table 2. 3.

Comparative And Analysis

In this section we will try to do a comparative analysis study of these two corpora above through specific important aspects. 3. 1 Speakers

Speakers' attribute are one of the most important parameters at any speech corpus in order to make the dealing with corpus easier and based on correct foundation. Number of speakers, gender, ages, and distribution them across the targeted geographical area of dialect are very important examples of these attributes. There are several options available to approach the target speakers and record their voices. One of them is through advertisements which can be on television, the Internet or in newspapers or other available media [5]. Speech databases are different in the number of speakers. Table 3 presents different examples (i.e., depending in their sizes of number of speakers) of corpora for different languages. For our investigated corpora, number of speakers in SAAVB is 1033 speakers; while in BBN/AUB database is 164 speakers, so the range of difference between these two corpora is highly noticeable. Table 3.

Number Of Speaker In Some Databases [51-

Number of

Speech Database

Speakers 90

VOYAGER

100

Global Phone,

140

WSJCAMO

500

VAHA

Cantonese

630

TIMIT

1253

FRESCO

1300

Phone Book

2000

SALA

In the Table 3 speaker number is ranging from 2000 down to 90 speakers. It has been claimed that there is little evidence that having a very large number of speakers is significantly better than a relatively small number of speakers, but others claimed that this is not a precise argument [8]. Regarding speakers' gender of speakers in SAAVB corpus this issue is clearly distributed and explained in their catalog and it is as following: First, male count 523 speakers (50.63%); second female count 510 speakers (49.37%). On the other side, BBN/AUB corpus implicitly reported speaker gender distribution as flowing: First, male count 101 speakers (61.58%); second female count 63 speakers (38.41%).

Regarding speakers' ages, SAAVB database distributed and explained it in a well-manner and straight way in their catalog which is as following: First, 16--30 years old there are 512 speakers (50.53%); Second, 31-45 years old there are 364 speakers (35.24%); finally, 46-60 years old there are 147 speakers (14.23%). Regarding ages of the speakers of BBN/AUB corpus, unfortunately no explicit or implicit information was reported in the catalog and they completely ignored this kind of important attribute to their speakers. All speakers of SAAVB database are native Arabic speakers (i.e., their mother tong is Arabic), while BBN/AUB database no word was said about this very important information which may help digital speech and language investigators [5], [6]. One of the most obstacles that face any team in selecting the right sample of speakers in Arabic country is the lack of dialect map. A dialect map is necessary to get a right sample of speakers that cover almost dialects in the targeted country or region. SAAVB database determine a model for the number of speakers per city (118 cities in Saudi Arabia) as a function of the population of the city as following [5]: s = a +

{Jlog(CP)

CP

>=

50,000.

where, s is number of speakers, CP is city population, ex and � are determined based on two constraints. First, the number of speakers equals eight for a single city with a population of 50,000. Second, the total number of speakers is 1056. ( s = 8 for CP = 50,000 and s = 4 for CP = 5,000.) On the other side Levantine dialect distributed among four countries, namely, Syria, Lebanon, Palestine, and Jordan. Levantine Arabic can be divided into two major branches: First one is north Levantine (Syria, Lebanon); second, is south Levantine (Palestine and Jordan) [9]. South Levantine shows closer relationship with Egyptian Arabic (derived primarily from classical Yemeni Arabic), whereas North Levantine, though also rooted in classical Yemeni Arabic, shows more relations with classical Najdi Arabic. Northern Levantine can be sub-divided into the following branches [9]: Initially is north Syrian (Aleppo); second is west (coastal) Syrian (Latakia to Tripoli); third is central Syrian (Rama to Damascus); finally is Lebanese (Mount Lebanon and Beirut). The developers of BBN/AUB not shown how they chose the sample of speakers form this countries, only refer to that the 20% of the corpus was recorded in the Boston area. This portion of the corpus was the first to be collected. Subsequently, the remaining 80% was recorded by the American University of Beirut (AUB), AUB students and staff served as both experimenters and subjects. 3.2 Collected speech format

SAAVB database belong to the third type of collected speech, which contains spontaneous and read speech, while BBN/AUB corpus is a spontaneous speech only [5], [6], [7]. Spontaneous speech is harder to work with than read speech and has its own characteristics and problems [5]. The use of spontaneous speech is more suitable in many digital

speech and language applications such as forensic environmental applications and recognition. Also it conveys the actual speech features that produced and, then, perceived by humans' vocal and auditory organs. This is going to alleviate the process of building and realizing the artificial speech and language system that allow communication by speech between humans and machines

noisy than SAAVB database and it is easier when you view and brows this corpus files from CD-RaM's discs. Finally, we must not forget that SAAVB, and BBN/AUB corpora are considered valuable resources for Arabic language researchers due to the lack of similar speech corpora especially for Classical, MSA, Saudi, Colloquial, Levantine other regional dialects.

3.3 Data source

References

SAAVB database recorded through a telephone system, and it has a lower sampling rate, 8 kHz, due to the use of telephone system bandwidth, while BBN/AUB database recorded at a high sampling rate: 16 kHz. Tables 4 and 5 Show recording environments for mobile network and fixed­ line network on SAAVB database [10].

[1] Y. A. Alotaiby, and S. A. Selouani, "Evaluating the MSA West Point Speech Corpus", International Journal of Computer Processing of Oriental Languages, Vol. 22, No. 4 (2009) pp. 1-20. [2] Y. A. Alotaiby, M. Alghamdi, and F. Alotaiby, "Using A Telephony Saudi Accented Arabic Corpus in Automatic Recognition of Spoken Arabic Digits". 4th International Symposium on ImageNideo Communications over fixed and mobile networks (lSIVC08), Bilbao, Spain. [3] The Wikipedia website. [Online]. Available: http://en.wikipedia.orglwiki/Modern_Standard_Ara bic [4] H. Satori, M. Harti, and N. Chenfour , "Introduction to Arabic Speech Recognition Using CMUSphinx System". Baywood Publishing Company, Inc ,2007., Pages: 4, arXiv ID: 0704.2083. [5] M. Alghamdi, F. Alhargan, M. Alkanhal,A. Alkhairy, M. Eldesouki and A. Alenazi, "Saudi Accented Arabic Voice Bank". Journal of King Saud University: Computer Sciences and Information. Vol, 20.2008. : 4358. [6] Linguistic Data Consortium (LDC) Catalog Number LDC2005S08, http://www.ldc.upenn.edul 2005 [7] M. Alsulaiman, Y. Alotaibi, M. Ghulam and M. A. Bencherif, "Arabic Speaker Recognition: Babylon Levantine Subset Case Study", American Journal of Applied Sciences 7 (3): 408-412, 2010 [8] R. Schwartz, T. Anastasakos, F. Kubala, J. Makhoul, L. Nguyen, G. Zavaliagkos, BBN Systems & Technologies, "Comparative Experiment on Large Vocabulary Speech Recognition" , IEEE International Conference on, 1994. ICASSP-94., 1994 [9] The Wikipedia website. [Online]. Available: http://en.wikipedia.orglwikilLevantine_Arabic [10] M. Alkanhal, M. Alghamdi and Z. Muzaffar, "Speaker Verification Based on Saudi Accented Arabic Database", International Symposium on Signal Processing and its Applications in conjunction with the International Conference on Information Sciences, Signal Processing and its Applications. Sharjah, United Arab Emirates. 1215 February 2007.

Table 4. SAAVBRecording Environments For Mobile Network

Recording

%

environment Quite

34.76

Noisy

34.76

Moving vehicle

30.48

Table 5. SAAVBRecord·mg EnVlronments For F"Ixed - L"me Network

Recording

%

environment Quite

75.32

Noisy

24.68

BBN/AUB database used a close-talking, noise-cancelling, headset microphone (the Andrea Electronics NC-65). A Java­ based data-collection tool was presented in [6]. It was recorded through a microphone and using a kind of switch (or mouse click) which induced bursts or spikes in all the wave files, The recorded speech amount was not of the same size for all the speakers, as some speakers spoke for a long time compared to others, a ratio of 1-5 was sometimes observed. Some speakers had many 'euh' at each answer, making a non­ spontaneous manner of talking, There was too much silence at the beginning and at the end of the files, in the whole dataset. Silence represented a ratio of approximately 113 from the full dataset, and some speakers were not well understood as others, due perhaps to the long distance from the microphone [7].

4

Conclusion

After our study to these two corpora, we can conclude that the SAAVB database is presented and documented in a more clear manner than BBN/AUB database regarding speakers' distribution (i.e., gender, ages, native or non native, and region among countries or provinces). SAAVB and BBN/AUB are differing in data source, but BBN/AUB is less