using syllables as indexing terms in full-text ... - Semantic Scholar

5 downloads 0 Views 49KB Size Report
transformations, such as Soundex (cf. Kettunen 2009; McNamee et al. 2009). Lately automatic morphological methods (i.e. non-supervised or semi-supervised) ...
USING SYLLABLES AS INDEXING TERMS IN FULL-TEXT INFORMATION RETRIEVAL Kimmo Kettunen, Paul McNamee and Feza Baskaya Kymenlaakso University of Applied Sciences, Kouvola, Finland; Johns Hopkins University Applied Physics Laboratory, Laurel, Usa; University of Tampere, Tampere, Finland Abstract This paper describes empirical results of information retrieval in 13 languages of the Cross Language Evaluation Forum (CLEF) collection augmented with results of Turkish using syllables as a means to manage morphological variation in the languages. This kind of approach has been used in speech retrieval (e.g. Larson and Eickeler 2003), but for some reason it has not been much tried out in text-based IR, although it has many clear advantages. Firstly, a quite well working version of it can be implemented with a very simple syllabification algorithm, consisting of only variants of one syllable structure rule, CV, consonant vowel. Secondly, although syllable-based word form variation management resembles n-gramming (McNamee and Mayfield 2004), it has the advantage, that the number of grams with syllables is more restricted which keeps the size of the text index smaller and retrieval faster. Thirdly, syllable-based approach makes possible to use different types of syllabification procedures, which can be either very fine grained, i.e. language specific or very coarse, i.e. more language independent. Fourthly, syllable based methods work for both speech and text retrieval. Our results show, that the two different CV syllabification procedures produced good results with four morphologically complex languages of the CLEF collection. For Turkish they produced also good results. For three of the languages that got good results with the CV syllabification (De, Fi and Tu), we tried also language specific, accurate syllabification procedures. Accurate syllabification was not able to produce as good IR results as CV procedures. Keywords: full-text information retrieval, syllables, management of word form variation, syllables as index terms

1. Introduction Variation of word forms in natural languages has been one of the problems of full-text retrieval since the beginning of computerized textual information retrieval. Several different methods for management of variation have been proposed, and these include stemming, lemmatization, n-gramming, truncation and different phonetic transformations, such as Soundex (cf. Kettunen 2009; McNamee et al. 2009). Lately automatic morphological methods (i.e. non-supervised or semi-supervised) have been tried out with some success, but still the old methods like stemming with human written rules are in full and effective use in text IR.

Most of the word form variation management methods are based on character level manipulation of words. Simple stemmers prune the word forms more or less well, lemmatizers use sophisticated combination of linguistic rules and dictionary representations for word forms to deduce the base forms for inflected words. The crudest systems may only truncate words to beginnings or endings of certain length and n-gramming techniques reduce the words to overlapping character strings. Still these crude, semantically unaware character level methods work amazingly well as shown e.g. in McNamee, Nicholas and Mayfield (2009). In this study we focus on a rarer character level (or sub-word unit) based method for word form variation management that has a linguistic basis, namely syllabification of words. Syllables can be seen as the next level of word structure after sounds that are represented as alphabet characters in written language1. Syllables also give a suitable handle to manage structure of written words: number of allowed syllables in each language is limited to tens (abstract syllable types) and from hundreds to thousands (concrete syllable tokens) (Pellegrino et al. 2008). Although rules for syllabification are quite limited in many languages, automatic syllabification of words is challenging, mainly because the syllable is not easy to define precisely linguistically. Consequently, no accepted standard algorithm for automatic syllabification exists, and syllabifications for the same word may vary. There are two approaches to the problem of automatic syllabification: rule-based and data-driven. The rule-based method implements some theoretical position regarding the syllable, whereas the data-driven paradigm tries to infer new syllabifications from examples assumed to be correctly syllabified already. A typical example of a rule-based orthographic syllabification algorithm is the Finnish hyphenation algorithm described in Karlsson (1985). It consists of 8 abstract syllable structure rules which as an implementation produce about 95 % recall and over 99 % precision for syllabification of the Finnish test corpus. Bouma (2003) reports on trial of Dutch hyphenation using finite state transducers. The simplest method achieves accuracy of 94.5 %, and two others, TEX and TBL, 99.8 % and 99.1 %. Adsett and Marchand (2009) discuss the data-driven approach to automatic syllabification. They compare five different datadriven syllabification procedures across nine European languages. All the algorithms achieve mean word accuracy across all lexicons of over 90%; the best algorithms achieve mean accuracy of 95–96.8 %. A detailed analysis of data-driven syllabification vs. rule-based syllabification of one language, English, is given in Marchand, Adsett and Damper (2007, 2009). Their results imply that data-driven syllabification works better than rule-based at least for English both in pronunciation and spelling domain. Spelling by analogy is the best data-driven method in both domains. Bartlett et al., however, show that syllabification with a structured support vector machine (SVM) performs better than syllabification by analogy. SVM syllabification achieves word accuracies varying from 86.7–90 % when compared to CELEX data of English. SVMs with Dutch and German achieve word accuracy percentages of 98.2–99.8. It should be emphasized that although automatic syllabification looks like a simple procedure, it is not, which can be deduced from the accuracy figures. For many other levels of linguistic analysis (e.g. morphology, syntax) there are so called golden 1

Strictly linguistically taken syllables are phonological, not orthographic, units, but we are concerned here only about orthographic syllabification, which is also referred to as hyphenation many times in different publications (Bartlett et al. 2008).

standards against which you can test your automatic procedure, but syllabification lacks these resources (Adsett and Marchand 2009; Bartlett et al. 2008). In our case the problem of highly correct syllabification is not that severe, while our application domain, information retrieval, itself allows quite a lot of noise, and presents an application field where “good enough” result can be very useful. Syllable-based retrieval has been used much in speech retrieval. According to different publications, it has worked quite well in spoken document retrieval (e.g. Ng and Zue 2000; Wang 2000). Larson and Eickeler (2003) have used syllable-based indexing and language models in German spoken document retrieval. They also test the approach with text documents, and the best performance in both types of documents is achieved with syllable 2-grams. A slightly similar type of approach is presented in Gouvea and Raj (2009). They introduce a search system, where indexing is based on “particles”, which are phonetic in nature and “comprise sequences of phonemes that together compose the actual or putative pronunciation of documents and queries”. The idea of a particle can be applied both to written and spoken documents. Although the idea is somehow similar to syllable-based retrieval, the authors emphasize that particles are not syllables. The main goal of our research is to test, whether orthographic syllabification can work as an effective means for management of morphological variation in a number of different languages and their full-text IR collections. We proceed in a two-fold way: first we test how two variants of a simple and naïve syllabification approach consisting only of one syllable rule work with the languages. After that we test more elaborate language specific syllabifiers with a smaller number of the same languages.

2. Empirical data Our empirical test material for IR runs consists of materials for 14 languages. Cross Language Evaluation Forum (CLEF) has available IR collections for 13 European languages. The languages of the collection are: Bulgarian, Czech, Dutch, English, Finnish, French, German, Hungarian, Italian, Portuguese, Russian, Spanish and Swedish. The size of the collections vary from ~17 000 to 450 000 documents. Number of topics for each collection is between 50 and 367. (McNamee et al. 2009). Retrieval experiments in the 13 CLEF languages were conducted using the HAIRCUT text retrieval engine (Mayfield and McNamee, 2004), which adopts a statistical language model of retrieval and supports a variety of tokenization choices. HAIRCUT has previously been used to achieve state of the art results on multiple CLEF test sets. Turkish has available one IR collection, so called Milliyet newspaper material. This collection consists of newspaper articles published in the Turkish newspaper Milliyet. The size of the collection is 408,305 documents, and it has 72 topics. (Can et al. 2008). For the Turkish collection our search engine was Lemur.

3. Results To get a start and baseline to our syllable approach, we used two, very simple rules for splitting words up into sequences of “syllables” 2 for all the languages: 1) Scan left to 2

Syllables is in quotes because the CV procedures produce both valid and invalid syllable sequences. Perhaps a proper name for the entities could be syllagrams, i.e. syllable like sequences of characters. However, some linguists have adapted a so called strict CV theory, which claims that all languages have only CV syllables (van der Hulst and Ritter, 431).

right and insert a split after a vowel that immediately follows a consonant; 2) put syllable juncture before every CV sequence. These two algorithms would produce the following syllabifications: CV_1 (ca + rbo + hy + dra + te + s; do + gs; go + es) CV_2 (car+bo+hyd+ra+tes; dogs; goes) A CV syllable structure is usually a basic one in many languages and even the only one in some languages (Maddieson 2008). Thus the CV procedures were a natural starting point for testing, whether syllabification can work as a basis for word form variation management. We adopted the CV_1 and CV_2 syllabification algorithm to all the 13 languages of the CLEF collection and the Turkish material. From the sequence of syllables for each word, we created indexing terms based on: (1) single syllables; (2) bigrams of syllables; and (3) trigrams of syllables. Keywords in the queries were also handled in the respective manner when queries were run. Table 1 shows the results of the CV_1 and CV_2 runs for the 13 CLEF languages and Turkish. Best result for each language is emphasized. Table 1. Results of CV_1 and CV_2 syllabification runs for 14 languages, title and description queries, mean average precisions (MAPs) words snow 4 N/A

0,311

syl1_ CV1 0,207

0,227

N/A

0,329

0,182

0,258

0,164

0,191

0,271

0,185

0,330

0,370 0,410

0,279

0,391

0,286

0,297

0,375

0,24

EN

0,406

0,437 0,399

0,212

0,377

0,267

0,233

0,349

0,20

ES

0,440

0,485 0,460

0,242

0,445

0,310

0,22

0,429

0,288

FI

0,341

0,430 0,499

0,302

0,459

0,376

0,272

0,433

0,312

FR

0,364

0,402 0,384

0,200

0,366

0,248

0,227

0,335

0,215

HU

0,198

N/A

0,375

0,202

0,317

0,230

0,175

0,286

0,182

IT

0,375

0,418 0,374

0,182

0,388

0,263

0,17

0,374

0,259

NL

0,381

0,400 0,422

0,264

0,378

0,251

0,294

0,357

0,231

PT

0,316

N/A

0,336

0,166

0,326

0,204

0,167

0,295

0,16

RU

0,267

N/A

0,341

0,276

0,239

0,125

0,257

0,263

0,152

SV

0,339

0,376 0,424

0,262

0,406

0,314

0,246

0,371

0,257

TU

0,186

0,218 0,305

0,169

0,303

0,215

0,207

0,264

0,196

Avg-8

0,372

0,415 0,421

0,243

0,401

0,289

0,245

0,378

0,250

Chg-8

N/A

Avg-A

0,323

11.47 13.31 % % N/A 0,389

-34.69 7.89 % -22.18 % % 0,229 0,351 0,242 0,228

0,334

0,214

Chg-A

N/A

N/A

-29.15 8.69 % -25.25 % %

BG

0,216

CS DE

20.54 %

syl2_ CV1 0,215

syl_ CV1 0,101

syl1_ CV2 0,211

syl2_ CV2 0,20

syl3_ CV2 0,097

Table legend: words = surface forms (lower-cased); snow = Snowball stemmer; 4 = overlapping, word-spanning character 4-grams; syl1 = single syllables; syl2 = syllable bigrams; syl3 = syllable trigrams; Avg-8 is average over 8 'Snowball' languages, i.e. languages that had available Snowball stemmer; Avg-A is average over the CLEF data; Chg-A is change over plain words with the CLEF data average Firstly, the results of CV_1 runs show that single syllables and trigrams were both ineffective, except in the languages that were most complex or had significant compounding. Syllable bigrams, however, seemed to work quite well for many languages. We measured mean average precision and found statistically significant relative gains vs. surface forms in four languages using syllable bigrams with CV_1 procedure: German (+18.5%), Finnish (+34.8%), Hungarian (+60.4%), and Swedish (+19.9%). With Turkish the CV_1 procedure with syl2 was performing at the same level as 4-grams, which is interesting. There were also several cases where syllable bigrams with CV_1 runs showed slight improvement over inflected word forms which wasn't statistically significant (CS, ES, FR, IT and PT). Out of the four languages that achieved statistically significantly better results with bigram syllables, three (FI, SV and DE) also outperformed Snowball stemming slightly. This can be considered interesting, as stemming with a Snowball stemmer has many times been shown to perform very well with morphologically more complex languages (Airio 2006; Kettunen 2009). Simple syllabification was not able to outperform usage of 4-grams, which was overall the most effective method of keyword variation management for all the languages, except for Italian. This is consistent with the results of McNamee and Mayfield (2004), McNamee (2008) and McNamee, Nicholas and Mayfied (2009). The effectiveness of n-gramming, however, is achieved with the cost of huge text indexes, which many times make the use of n-gramming impractical. Results of McNamee, Nicholas and Mayfied (2009) show that n-gram indexing with 4-grams can consume up to three times as much index storage and queries can take seven times as long to execute when compared to plain words (these figures with English data). Results of the CV_2 runs for the 13 CLEF languages and Turkish showed that CV_2 procedure is not as good an option as CV_1, it performed usually slightly below CV_1. Many times syl1 results of CV_2 runs outperform syl1 results of CV_1 runs, but as syl1 is not performing that well in general, this is not very interesting. With Czech and Russian syl2 results of CV_2 runs were slightly better than results of CV_1 runs, but with other languages lower. Overall syl2 results of CV_2 runs are performing the best, just as with CV_1. With morphologically complex languages CV_2 results with syl2 outperform also plain words clearly. This confirms further that syllable bigrams seem to offer the most optimal solution if syllables are used as means to manage word form variation. After the initial results we decided to concentrate on those languages that got good results with the CV procedures and were also morphologically interesting. We had available more fine-grained proper syllabifiers for three languages, DE, FI and TU3, and we tried out, how proper syllabification works for these languages. 3

For German and Turkish the syllabifiers were implemented by the third author, for Finnish by the first.

For Turkish proper syllabification did not perform any better than CV procedures. Syllable unigrams achieved MAP of 0,21, bigrams 0,27 and trigrams 0,20. Once again, bigrams were the best way to construct both index and queries. Syllable bigrams with proper Turkish syllabification gained 3 % lower MAP than CV_1, and 0.6 % better MAP than CV_2.

4. Discussion and conclusions Our aim in this research was to study, whether syllabification can be used effectively in management of word form variation in full-text retrieval of different languages. As our test data we had CLEF collections for 13 different European languages and a separate Milliyet collection for Turkish. We tried first a crude CV syllabification procedure for all of the languages. Two versions of this procedure put a hyphen either after a CV sequence or before it. The results showed that in both ways the CV procedure was able to perform quite well when the textual indexes were built from bigrams of CV syllables. However, CV_1 procedure was clearly better than CV_2. CV_2 outperformed plains word clearly in most of the languages, but left 2-3 per cent behind CV_1 with most of the languages. We achieved good results with CV procedures in four of the CLEF languages that were also morphologically complex (DE, FI, HU and SV) and Turkish. Turkish results were especially good, as the CV_1 procedure performed at the same level as 4-grams. After the initial results we tested three languages with language specific elaborate syllabifiers. The results showed that language specific syllabification was not able to outperform simple CV procedure. Having introduced our results, we can now get to the question of why syllables should work at all as indexing and query terms that take reasonably well care of morphological variation found in different languages. McNamee (2008; also McNamee et al. 2009) has studied the question why n-grams have a performance advantage over plain words. They designed an experiment to remove morphological regularity from words by shuffling the characters of words randomly, and got results suggesting strongly that the fundamental reason why n-grams are effective is because they control for morphological variation. According to them this also explains a variety of previously observed phenomena about n-grams, namely: that n-grams yield greater improvements in more morphologically complex languages; n-grams of lengths 4 and 5 (about the size of root morphemes) are most effective. We noted earlier briefly that syllables and n-grams resemble each other quite much, but syllables are of varying length and when they are used, there is not as much overlapping of characters as in n-grams. If we consider the lengths of our indexing terms, unigrams, bigrams and trigrams of syllables in Table Z, we see that the mean length of bigram syllables in most of the languages is around Z, i.e. more or less the same length as 4- and 5-grams. Although we have not yet made a test of randomized character shuffling and syllabification based on the results of randomizing to confirm this hypothesis, we believe that the basic explanation why syllables work in management of word form variation is the same as with n-grams; They are able control for morphological variation and there is also an ideal length for the query and index terms made out of syllables. We wish to do more research with this respect later. Overall our results show that syllables can be used effectively in management of word form variation for different languages. They are not able to outperform 4-grams,

but at best they perform at the same level or slightly better than Snowball stemmer for morphologically complex languages, such as Finnish, German, Hungarian, Swedish and Turkish. As the best results are achieved with a very simple syllabification and indexing procedure (CV_1 syllabification and bigram syllable indexing), we believe that the approach has some promise even in a practical IR settings. We also believe that some language typological factors affecting the results could be found (cf. Fenk-Oczlon and Fenk1999). This aspect needs more consideration and we wish to continue also this work later on.

5. References Addsett, Connie; Marchand, Y. 2009. A Comparison of data-driven automatic syllabification methods. In: Karlgren, J.; Tarhio, J.; Hyyrö, H. (eds.) String Processing and Information Retrieval, 16th International Symposium, SPIRE 2009. Heidelberg: Springer. 174–181. Airio, Eija 2006. Word normalization and decompounding in mono- and cross-lingual IR. Information Retrieval, 9. 249–271. Bartlett, Susan; Kondrak, G.; Cherry, C. 2008. Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In: Proceedings of ACL-08, HLT. Columbus. 568–576. Bouma, Gosse 2003. Finite state methods for hyphenation. Natural Language Engineering 9. 5–20 Can, Fazli; Kocberber, S.; Balcik, E.; Kaynak, C.; Ocalan, H.C.; Vursavas, O.N. 2008. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 59. 407–421. Cross Language Evaluation Forum. http://www.clef-campaign.org/. Fenk-Oczlon, Gertraud; Fenk, A. 1999. Cognition, quantitative linguistics, and systemic typology. Linguistic Typology 3. 151–177. Gouvea, Evandro B. and Raj, B. 2009. Word particles applied to information retrieval. In: Boughanem, M; Berrut, C.; Mothe, J.; Soule-Dupuy, C. (eds.) Advances in information retrieval. 31th European Conference on IR Research, ECIR 2009. Heidelberg: Springer. 424–436. Van der Hulst, Harry; Ritter, N.A. (eds.) 1999. The syllable: views and facts. Berlin: Mouton de Gruyter. Karlsson, Fred 1985. Automatic Hyphenation of Finnish. In: Karlsson, F. (ed.) Computational morphosyntax. Report on Research 1981–1984. Publications of the Department of General Linguistics, University of Helsinki, 13. 93–113. Kettunen, Kimmo 2009. Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval – an overview. Journal of Documentation 2. 267–290. Larson, Martha; Eickeler, S. 2003. Using syllable-based indexing features and language models to improve German spoken document retrieval. Proceedings of Eurospeech. 8th European Conference on Speech Communication and Technology. Retrieved 15 May, 2010, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.124.4455&rep=rep1&typ e=pdf Maddieson, Ian 2008. Chapter 12: Syllable Structure. In: The World Atlas of Language Structures Online. Retrieved 30 April, 2010, from http://wals.info/feature/12.

Marchand, Yannick; Adsett, C.; Damper, R. 2007. Evaluating automatic syllabification algorithms for English. Retrieved April 14, 2010, http://eprints.ecs.soton.ac.uk/14285/1/MarchandAdsettDamper_ISCA07.pdf. Marchand, Yannick; Adsett, C.; Damper, R. 2009. Automatic syllabification in English: a comparison of different algorithms. Language and Speech 52. 1–27. McNamee, Paul 2008 Textual representations for corpus-based bilingual retrieval. PhD Thesis, University of Maryland Baltimore County. Retrieved 4 May, 2010. http://apl.jhu.edu/~paulmac/publications/thesis.pdf. McNamee, Paul; Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. Information Retrieval 7. 73–97. McNamee, Paul; Nicholas, C.; Mayfield, J. 2009. Addressing morphological variation in alphabetic languages. Proceedings of the 32nd Annual International Conference on Research and Development in Information Retrieval (SIGIR-2009), Boston, MA. 75–82. Ng, Kenney; Zue, V.W. 2000. Subword-based approaches for spoken document retrieval. Speech Communication 32. 157–186. Pellegrino, François; Coupé, C.; Marsico. E. 2007. An Information theory-based approach to the balance of complexity between phonetics, phonology and morphosyntax. Retrieved May 5, 2010, from http://www.ddl.ishlyon.cnrs.fr/fulltext/pellegrino/Pellegrino_2007_PCM_LSA.pdf. Wang, Hain-Min 2000. Experiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese. Speech Communication 32. 49–60.

Kimmo Kettunen is a research manager at the Kymenlaakso University of Applied Sciences in Kouvola, Finland. He has a Ph. D. in information retrieval from the University of Tampere and a Master’s degree in general linguistics from the University of Helsinki. His current research interests are evaluation of machine translation quality with Normalized Compression Distance, mono- and multilingual information retrieval and application of HLT in general. E-mail: [email protected]

Paul McNamee is a principal computer scientist with the Johns Hopkins University Applied Physics Laboratory. He earned a Ph.D. from the University of Maryland Baltimore County, where his doctoral research investigated the effects of tokenization in monolingual and cross-language information retrieval. Dr. McNamee is currently involved in several knowledge discovery projects at the JHU Human Language Technology Center of Excellence. Email: [email protected]

Feza Baskaya

Suggest Documents