Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag ∗
[email protected]
Abstract A new algorithm is presented which performs fully unsupervised basic morphologic analysis in any desired language without prior knowledge of that language. The algorithm detects morpheme boundaries and can also be modified to perform other tasks, e.g. clustering of word forms of the same lemma and the classification of the found morphemes. The primary aim is to reach maximum precision, so that the output can be used in a postprocessing machine learning step to increase recall. The algorithm is based on cooccurrence measures and letter successor variety and does not use any complex or computationally intense methods such as LSA. Consequently it is fast, efficient and scales well.
1
Introduction
This paper describes first results of a study of unsupervised, knowledge-free and therefore language-independent acquisition of morphology. This topic involves many different goals such as finding a segmentation of word forms into their morphemes, clustering different word forms of the same lemma, providing declensional and conjugational classes, extracting alternation rules etc. The identification of valid morpheme boundaries can be considered as the first step of the analysis. This step can be further divided into two parts: First, the identification of morphemes with a precision as high as possible, no matter how low the recall. Second, enlargement of this knowledge by common machine learning methods preferably without loss of too much precision. In a kind of circular feedback mechanism, this combined knowledge can be used in a repeated first step in order to find more knowledge. This paper focuses on the first part, the identification of morpheme boundaries (also called morphology segmentation of word forms). Both the identified morphemes as well as the method itself can be used to produce a clustering of word forms of the same lemma as a side effect with a quite high precision. Another possible application ∗
University of Leipzig
of this method is the classification of the found morphemes into prefixes, stems and suffixes. A brief evaluation for German and English is given but will be expanded in future work on an improved version of the algorithm. 1.1
Related work
Knowledge-free morphology segmentation has been the aim of several algorithms based on different approaches. Most of the methods can be divided into three general approaches: the minimum description length (MDL) model (first used in this context by (Brent et al. 95) and (Kazakov 97)), semantic based (Schone & Jurafsky 01) and letter successor variety based model (Harris 55). They all make use of very different mechanisms and so it might be possible to combine them in order to further boost their good results. The approach described in this study is directly based on the letter successor variety method but also makes use of context similarity. Therefore this approach can be viewed as one of the first attempts to merge such methods. 1.1.1 Minimum description length One of the first successful knowledge-free algorithms is based on expectation maximization (Goldsmith 00). The initial algorithm cuts each word at one position based on a probability and the lengths of hypothesized stems and affixes. It then attempts to generalize various words into signatures (classes of words that have the same morphology). The quality of the algorithm improved when the minimum description length model (see (de Marcken 95)) was included, which has already been used in such a context by (Brent et al. 95) but also directly as a fitness function in a genetic algorithm approach (Kazakov 97). MDL represents a kind of balance between over- and undergeneration of stemming rules: the optimum is the most compressed representation of the data (the words) in the sense as to use the least necessary number of word forms and signatures at
the same time. This removes all free parameters which makes a rather elegant solution. This method only considers the list of distinct words at any point and thus it has an (unknown) upper bound of quality since in a language certain things cannot be explained by any kind of frequency. Notably, Goldsmith’s approach has since been used as the baseline algorithm for other algorithms to be compared against. Another approach from the category of minimum description length based algorithms is the one introduced by (Creutz 03). Adding maximum likelihood (ML) and later the Hidden Markov Model (Creutz & Lagus 05) for a classification of the found morphemes the authors constructed an improved version of a segmentation algorithm: it randomly segments words and then measures how well the segmentation fits into the incrementally built knowledge base. This algorithm seems to be specialized on agglutinative languages such as Finnish and it tends to overgenerate slightly in other cases. Moreover it needs information on the length and frequency distributions of morphemes of that language (thus it is not entirely knowledgefree). Another enhancement worth mentioning was proposed by (Argamon et al. 04) by adding a recursive component to the analysis which while keeping steady results for morphology-poor languages might improve results on morphology-rich languages. 1.1.2
Semantic context
An entirely different approach has been taken by (Schone & Jurafsky 01) who included the semantic context of the words to be segmented into their segmentation algorithm. First a list of affix candidates is generated by simply counting frequencies. Using these candidates it is possible to generate a list of possible other word forms of the same lemma for each input word such as listening and listen. Second, latent semantic indexing (LSA) (Deerwester et al. 90) is used in order to find out whether the generated pairs of words are semantically similar according to the corpus used (that is, whether they appear in similar contexts). 1.1.3
Letter successor variety
The oldest and seemingly least successful approach to date is the letter successor variety method (Harris 51). The idea is to count the amount of different letters encountered after (or before, respectively) a part of a word and to com-
pare it to the counts before and after that position. Morpheme boundaries are then likely to occur at sudden peaks or increases of that value. Parameters of this approach can be varied (Harris 55) but on the whole it has not yet been successfully employed for morpheme segmentation, see also (Hafer & Weiss 74) and (Frakes 92), because when applied to the whole list of distinct word forms, the ‘noise’ from too many different possibilities renders the results nearly useless. The method has also been used for the generation of ‘good’ candidate lists for postprocessing machine learning steps for morpheme segmentation by (D´ejean 98), but unfortunately the authors do not mention the quality of the results they obtained by this method. 1.1.4 Corpus vs. word list Another possible distinction of the existing approaches for morphology segmentation can be based on whether they make use of the list of word forms only (and eventually their frequencies) such as Linguistica (Goldsmith 01) or (D´ejean 98). The work of (Baroni 03) can be included here, too. It is neither based upon the minimum description length model nor on any of the other possibilities. Baroni uses the cooccurrence of potential morphemes as an information source about the morphemes themselves. The other possibility is to include context information on the word level from the raw text in one way or another such as the approach taken by (Schone & Jurafsky 01). This kind of classification also shows that the methods used have at least two independent components which means that such algorithms might be able to boost each other’s performance.
2
Context Similarity
The approach described in this study represents a combination of using context information (albeit in a different way from that described by (Schone & Jurafsky 01)) and the letter successor variety (LSV) idea described by (Harris 51). The idea is that the letter successor variety used on the plain list of word forms has to put up with too much noise from irregularities and other sources. However, a list of word forms that all have one or more kinds of syntactic information in common (such as gender, case, number) would make the noise for the LSV method manageable. This kind of approach would, of course, work even better, if it were possible to generate such a list for
each word. For the word running, the list would contain such word forms as swimming, walking or diving. In order to obtain such a list of word forms with the same syntactic information for a given input word form, it is necessary to reflect on the possibilities of language. Whichever language is considered, syntagmatic relations will always hold between word forms standing immediately next to each other. For example, it is very probable that after the verb goes, any kind of lexicalized direction information will appear, such as home, to or out. On the other hand, in front of such direction information tokens, all kinds of verbs are likely to occur. Some or many of them will also have the same grammatical markers as the input word form such as runs, walks or jumps. These word forms are crucial for further analyses because they are morphologically similar to the input word. Therefore, the first step is to compute all neighbouring word forms of a given word form A. At that point it is useful to discriminate mere frequent coappearance from statistically significant cooccurrence. This can be done by a multitude of methods and in this case the log-likelihood (Dunning 93) has been chosen because in other experiments (Bordag 05) it has proven to be one of the best measures. The typical (left or right) neighbours of the word form A, along with their significances as found by applying this or another significance for− → mula, can be represented as a vector An (the index n means neighbour) in the assumed vector space of word forms. The second step consists of comparing pairs of word forms based on their neighbourhood vectors. This can be done by simply counting the number of common words in the vector or using a distance or similarity measure such as the dice coefficient. Again, from other experiments (Bordag 05) the dice coefficient along with the simple counting proved to perform best and thus simple counting has been chosen. Thus, for any given word A it is possible to retrieve a − → − → vector As of most similar words to A. As then ‘contains’ all words that usually have similar left or right neighbours as A. For the example word running given above, the most similar words are run (108), using (99), runs (71), working (70), operating (70), moving (67), getting (65). The value 99 means that in the used
corpus there are 99 different word forms appearing significantly often to the left and the right of both the words running and using.
3
Letter Successor Variety on Context Similarity Vectors
The second step of the algorithm described in this study is based on LSV and takes the context similarity vectors of the previous step as input. There is currently one free (probably language independent) parameter required at this point. From the context similarity vectors, only the 150 most similar words are kept for further processing. This is because keeping all similar words reintroduces some of the noise as discussed in section 1.1. Another noteworthy detail is that from this point on, all remaining word forms in the vector are treated as a set of words (dubbed similarity set) without any ranking. Later, it might be possible to introduce optimizations of the core algorithm by taking the similarity ratings into account. In the next paragraphs, the German word form gl¨ uckliche (happy) will be used as an example since its morphological segmentation includes two suffixes: gl¨ uck-lich-e. The -lich derives an adjective from the noun Gl¨ uck (happiness) and the -e is an ambiguous inflectional ending. One of its meanings is female gender. In the examples given, the character # marks the beginning and end of the word form. The LSV based algorithm computes several values for each transition between characters of a given word form: The left and the right letter successor variety, the overlap factor, the bigram score and the multiletter score. Finally all of them are combined in order to produce a score for each transition. A high score translates into a morpheme boundary by means of a threshold. The complete computation of the example is additionally depicted in Table 1. Computing the letter successor variety is done as follows: Count the number of different letters encountered after a given string from the beginning (or the end, respectively). In our example, after the string #gl¨ u only 1 distinct letter could be found in the similarity set, although 4 word forms began with this string. The same can be done for the other direction: before the string liche#, there were 7 different letters encountered out of a total of 15 word forms ending with that string.
Computing the overlap factors: In order to detect that a smaller string is part of a larger morpheme, it is possible to count how many of the words seen with the suffix -iche# (17 of 150) in the example have also been seen with -liche# (15 of 150), see also (D´ejean 98). In this case, this is a strong evidence that -iche# is part of -liche# and thus a weight of 15/17 = 0.88 will be computed for the transition at this place: gl¨ uckliche. A problem with the overlap factor is that in some languages there are phonemes which are realized through a succession of several letters. This is the case for the th in English or for sch in German. This means that after the single ‘letter’ sch in, for example, schlimme (bad), 7 different letters after sch can be observed. The overlap factor at this point is wrongly 1.0, because of all 18 words which begin with #sc- 18 also begin with #sch-. But there are only 18 out of the 150 words which begin with the multiletter sch at all, thus the overlap factor should in fact be 18/150 = 0.12 (at the same time expressing the uncertainty of making a decision at this point, because only 18 out of 150 words begin with the same letter). In order to detect this, the hypothesis has been made that the multiletters generally have a far higher frequency than other bi- or trigrams. This can be expressed as the multiletter weight. In order to compute the multiletter weight, a ranking of all bi- and trigrams can be produced in order to distribute weights to each ngram between 1.0 (to the highest ranked) and 0.0 (to the lowest ranked). Using the weight of either the bigram or the trigram to the left (or respectively to the right), it is possible to take either the frequency count of the string of the transition one to the left or two to the left, depending on whether the weight of the bi- or the trigram is higher. This overlap factor can then be taken into account in the form of a weighted average. In the example in Table 1 for the transition li-ch, this means that to the right there is the frequency based bigram weight of 0.6 for the bigram ch. This is larger than the trigram weight 0.5 for che. Therefore the final right overlap factor is the weighted average (36/39 ∗ 1.0 + 36/129 ∗ 0.6)/(1.0 + 0.6) = 0.68 instead of the initial 36/39 = 0.92. Over 3% of the German words begin with sch and almost all of them were wrongly analyzed to have this prefix. After applying multiletter weights, almost none
of them were falsely analyzed with only minor changes to the analyses of other cases. Since this algorithm does not (yet) take the distributions of the potential suffixes into account , it has a bias towards analyzing frequent strings at the end of a word as suffixes. This and other effects can lead for a word like Barbier to be falsely analyzed as Barbi-er because -er is one of the most frequent suffixes in German. However, in this case the bigram ie is divided, which in fact is a multiletter. Thus, the bigram frequency based ranking can be reused in order to distribute a bigram weight. The weight states how improbable it is for a division to occur in that bigram where 0.9, for example, means that it is extremely improbable for a division to occur at that place. In table 1 for example the value 0.7 in the row of bigram weight means that it is quite improbable to divide the string ch. Another consequence of the missing distributions of affixes at this stage is that an overestimation of common strings encountered at the beginning and the end of word forms can occur. Therefore, a trivial uncertainty weight has temporarily been introduced which weights down all short strings (uni- and bigrams) at the beginning or end of word forms. In the experiments conducted for the evaluation it was arbitrarily chosen to be 0.3 for unigrams and 0.6 for bigrams at the ends of the word forms. Thus, for the word form rote the final score for rot-e would be 11.2, but since -e is a unigram at the end of a word, the score would be 11.2 ∗ 0.3 = 3.36. This mechanism removed all wrong and approximately half of the correct affix boundary identifications at the beginnings or ends of word forms. The remaining half of correct boundary identifications will still be enough in order to induce a learning of these affixes in a postprocessing machine learning step. The missing distributions of affixes can later be added in a postprocessing step in order to refine the results. This bootstrapping process is subject to further research. The final right score for any transition n can then be computed quite easily (and in the same manner as the left score). It consists of the multiplication of the initially obtained LSV with the averaged overlap factor and the inverse bigram weight. In the example in Table 1, for the transition gl¨ uckli-che the right score would be computed as follows: the initial LSV is 4, the weighted over-
the input word: LSV from left: LSV from right: freq. left string: freq. right string: multil. bigram left: multil. trigram left: multil. bigram right: multil. trigram right: bigram weight: final score left: final score right:
#
g
150
16
0.0 0.0
l 6 2 5 3
u ¨ 2 1 4 3 0.0
0.0 0.0 0.0 0.2 2.0
0.0 0.0 0.0 0.4 1.0
c 1 1 4 3 0.0 0.0 0.1 0.0 0.0 0.8 0.8
k 1 2 4 4 0.0 0.0 0.0 0.0 0.1 0.9 0.5
l 1 7 4 15 0.1 0.0 0.0 0.1 0.0 1.0 6.3
i 1 2 4 17 0.0 0.0 0.2 0.3 0.2 0.8 0.7
c 1 4 4 36 0.2 0.0 0.6 0.5 0.2 0.8 2.2
h 1 3 4 39 0.2 0.1 0.2
e 2 16 4 129 0.6 0.3
0.7 0.2 0.2
0.3 0.4 3.0
#
150
Table 1: Depicting the LSV based algorithm for the example of the German word form gl¨ uck-lich-e. Weights were rounded and the given scores and weights refer to the transition to the left of the letter. lap factor (36/39∗1.0+36/129∗0.6)/(1.0+0.6) = 0.68 and the bigram weight is 0.2. Thus, the result is 4∗0.68∗(1.0−0.2) = 2.2. After applying the same method (but left to right changed) in order to compute the final left score, the final overall score for a given transition is the maximum of either the left or the right score. The final scores can then be interpreted as representing morpheme boundaries. There are various possibilities to interpret such scores. In this first prototype, a simple threshold has been introduced, as another free parameter. All final scores above the threshold are considered to mark morpheme boundaries and the words are then segmented using these boundaries. The difference between the final left and right score can be used in order to classify the morphemes. If the right score is higher than the left score, then the morpheme discovered to the right is probably a suffix and a prefix otherwise. This topic is subject to further research and an evaluation is not yet available. I am inclined to try to use a model such as described in (Creutz & Lagus 05) for a more proficient tagging of the categories. 3.1
Clustering of word forms of the same lemma
The simple detection of affixes described in the previous section can be used for the clustering of word forms belonging to the same lemma. In fact, this task can be reformulated as a retrieval task: for a given input word form A, retrieve all word forms of the same lemma. The first step is to identify and remove the affixes of a given word form based on the detected morpheme boundaries. In the example gl¨ uckliche
the stem gl¨ uck remains. The second step is to remove these and all trailing affixes from all words in the context. Thus, if the suffix -lich- was detected, then the removal of it and all trailing from the word form gl¨ uklichen results in the string gl¨ uck. The removals are only temporary in order to detect which word forms have the same stem. After retrieving all word forms with the same stem, the initial word forms can be printed out as a result of the word form retrieval algorithm. Additionally, since the use word form set contains only contextually similar words, it is also possible to print out all word forms whose stems differ only in one letter, which might help to detect stem alternations. This is a further side effect of the algorithm which has to be investigated in detail.
4
Evaluation
There are almost as many different evaluation methods as there are algorithms for any of the tasks of morpheme identification, morphologic segmentation and lemma to word form clustering. Since one of the two main goals of the algorithm described in this paper is to produce correct morpheme segmentations, an evaluation will be provided which measures the accuracy and recall of finding proper morpheme boundaries. However, most algorithms such as (Goldsmith 01) and (Schone & Jurafsky 01) provide an evaluation which measures the accuracy and productiveness of word form retrieval. Therefore the precision and recall numbers provided below cannot be compared to the evaluations of the cited algorithms. In general, it would be necessary to organize a standardized evaluation framework such as SENSEVAL 2 (SENSEVAL 01) for
the word sense disambiguation task. This framework should comprise several corpora of raw text of typologically distinct languages along with a list of both morphologically correct segmentations for that language and lemmatizations. Since I will give an evaluation based on the German and English language and the gold standard used will be the German and the English part of CELEX (Baayen et al. 95), it is, for example, quite difficult to tell the relation between this algorithm and the one described by (Creutz & Lagus 04), which was tested on Finnish and English. Providing more complete and comparable evaluations as well as the word form clustering algorithm will be the focus and direct consequence of this work. As mentioned above, the languages used to evaluate the algorithm were German and English. I used the corpora available from (Quasthoff 98). The German part contained about 24 million sentences and the English corpus contained 13 million sentences. The gold standard from which information about word form stems and correct morphology segmentation has been acquired was CELEX (Baayen et al. 95). The computation of the neighbourhood cooccurrences and the similarities based on them takes up by far the most computation time (several days on a modern PC) due to the huge amounts of sentences. The computation of similarity has been optimized so that not every word was compared with every other: cues from sentence cooccurrences have been used in order to single out candidates of words which might have some neighbour cooccurrences in common. The computation time of the LSV based algorithm once the similarity data is available is negligible. The evaluation was performed on the most frequent 20.000 word forms. In the evaluation, the overlap between the manually tagged morpheme boundaries and the computed ones is measured. Precision is the number of found correct boundaries divided by the total number of found boundaries. Recall is the number of found correct boundaries divided by the total number of boundaries present in the gold standard. Thus, a complex word analyzed by the algorithm as ent-z¨ unde-t would have one correctly detected boundary for the prefix ent- and one wrongly detected boundary, because the correct analysis would be ent-z¨ und-et. But according to CELEX both are wrong because this word is not analyzed in CELEX. Such cases were, of course,
excluded from the evaluation. Table 2 gives an overview of both precision and recall for three different threshold settings as well as the most frequent prefixes and suffixes found in the data. German threshold t=3 Prec. 75.59 Rec. 44.83 threshold t=4 Prec. 79.88 Rec. 32.48 threshold t=6 Prec. 83.19 Rec. 15.24
English 61.80 29.02 62.97 21.00 66.02 11.31
Prefixes verbegeVerBeuneinBundesaus-
Suffixes -en -e -t -er -ung -s -es -lich -te
Table 2: Precision and recall of morpheme boundary detection for various threshold settings for both corpora and the most frequent pre- and suffixes for the German corpus only. As can be seen, precision cannot be raised much by increasing the threshold, but recall decreases significantly when doing so. An error analysis shows that over 50% of ‘errors’ according to CELEX were not errors and most of the other errors are at least arguable. For example, in most languages the gender marking is being considered as a suffix. In German, because of the absence of neutrum and masculinum suffixes, the femininum suffix -e is not considered to be a suffix even if there are word forms with the same stem and without this ‘suffix’, such as Schule and schulisch. Consequently, all occurrences of the femininum suffix are marked as wrong according to CELEX. Even worse, the rather low 61.80% − 66.02% precision of the English evaluation results from the fact that the algorithm would always analyze words like lured as lur-ed instead of lure-d according to CELEX. Another way to see this, however, is that there is a deletion which occurs because the past tense -ed would produce a double ee in the word form lureed which is phonologically unsound. Over 58% of all the errors of the English evaluation are due to this problem thus if it would not count as a mistake (by adding some kind of deletion detection algorithm, perhaps based on semantics), then precision would be around 85%. Since there are more examples like this (plopp-ed according to the algorithm and wrong according to CELEX but this is an addition again due to phonological reasons), the es-
timated precision of the algorithm in general lies somewhere between 90% − 95%. Other common sources of ‘errors’ are words of foreign origin, especially Latin words in the two evaluated languages.
5
Conclusions
This study presents a method which performs morpheme analyses of word forms of a given language based on a corpus of raw text. The results shown are competitive and it has been shown that the algorithm can be improved in many different ways. Possible enhancements include iterative applications of the algorithm while utilizing knowledge such as affix frequency distributions or the affixes found in earlier steps. One particular advantage of the described algorithm is that the intermediary results, namely the sets of similar words based on neighbourhood cooccurrences, can be used to explain the results obtained by the algorithm. In fact, the algorithm works in such a way that the decisions it makes are grounded directly in the rules of morphology such as e.g. that a morpheme is a unit which can be replaced by another morpheme in order to produce another existing word form. Therefore, if the algorithm makes a seemingly wrong decision (according to CELEX) such that e.g. the word Virologe has the suffix -ologe, it is possible to produce the following explanation from the available data: Not only do all words appearing in similar contexts have this suffix (e.g. Biologe), but also there are almost no words that have the suffix -loge without the preceding o. The few words remaining (e.g. Kataloge (catalogues)) usually have no similarity (using a word similarity algorithm) with the words ending in -ologe. This makes the algorithm a suitable tool for a linguistic analysis of the morphology of an unknown language.
References (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Efficient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. (Baayen et al. 95) R. Harald Baayen, Richard Piepenbrock, and L´ eon Gulikers. The CELEX lexical database (CDROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, http://www.ldc.upenn.edu/Catalog/ CatalogEntry.jsp?catalogId=LDC96L14, 1995. (Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213–248, 2003. (Bordag 05) Stefan Bordag. Algorithms extracting linguistic relations and their evaluation. In preparation, 2005.
(Brent et al. 95) Michael Brent, Sreerama K. Murthy, and Andrew Lundberg. Discovering morphemic suffixes: A case study in MDL induction. In 5th International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida, 1995. (Creutz & Lagus 04) Mathias Creutz and Krista Lagus. Induction of simple morphology for highly inflecting languages. In Proceedings of 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON), pages 43–51, Barcelona, July 2004. (Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. (Creutz 03) Mathias Creutz. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of ACL-03, the 41st Annual Meeting of the Association of Computational Linguistics, pages 280–287, Sapporo, Japan, July 2003. (de Marcken 95) Carl de Marcken. The unsupervised acquisition of a lexicon from continuous speech. Memo 1558, MIT Artificial Intelligence Lab, 1995. (Deerwester et al. 90) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshmann. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990. (D´ ejean 98) Herv´ e D´ ejean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295–299, Adelaide, January 1998. (Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61– 74, 1993. (Frakes 92) William R. Frakes. Stemming Algorithms, chapter 8, pages 131–160. Frakes und Baeza-Yates, 1992. (Goldsmith 00) John Goldsmith. Linguistica: An automatic morphological analyzer. In Arika Okrent and John Boyle, editors, The Proceedings from the Main Session of the Chicago Linguistic Society’s Thirty-sixth Meeting, Chicago, 2000. Chicago Linguistic Society. (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153–198, 2001. (Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1974. (Harris 51) Zellig S. Harris. Structural Linguistics. University of Chicago Press, Chicago, 1951. (Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190–222, 1955. (Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ıve morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105–112, Prague, Czech Republic, April 1997. (Quasthoff 98) Uwe Quasthoff. Projekt: Der Deutsche Wortschatz. In Gerhard Heyer and Christian Wolff, editors, Tagungsband zur GLDV-Tagung, pages 93–99, Leipzig, March 1998. Deutscher Universit¨ atsverlag. (Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision. (SENSEVAL 01) SENSEVAL. Second International Workshop on Evaluating Word Sense Disambiguation Systems. Toulouse, France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001.