AMP: A SYSTEM FOR AUTOMATED ...

2 downloads 121 Views 756KB Size Report
George Tambouratzis. 1. & Marina ..... and a diphthong in successive syllables in the same word (Smyth 1956, Σταματάκος 1949, Τζάρτζανος. 1960). 451 ...
AMP: A SYSTEM FOR AUTOMATED MORPHOLOGICAL PROCESSING OF ANCIENT GREEK George Tambouratzis1 & Marina Vassiliou1, 2 1

Institute for Language and Speech Processing & 2University of Athens [email protected], [email protected]

Abstract The present article describes AMP, a system for automated morphological processing of Ancient Greek word forms. It is considered a hybrid approach, combining pattern recognition techniques with limited linguistic knowledge to achieve accurate segmentation into stem and ending, and is expected to substantially contribute to the creation and/or enrichment of Greek morphological lexica. Though the current implementation concerns Attic dialect word forms, its modularity ensures its extensibility to other dialects and/or synchronies of the Greek language with minor modifications. AMP is initialised with a set of potential endings and through an iterative masking-andmatching process expands the sets of possible stems and endings. Subsequently, a group of criteria is used for ranking all possible morphological analyses and determining the most likely segmentation of each word form. In comparison to other approaches, AMP explicitly models any stem-ending interactions during the word formation process, in order to improve segmentation accuracy.

1. Introduction The Greek language is characterised by a highly inflectional nature. This wide morphological variety poses an intriguing problem to many applications (e.g. information retrieval, machine translation etc.), where it is necessary to process word forms and determine their respective stems in an automated manner. To support that, the creation of morphological lexica is rendered essential. Furthermore, in consideration of the long evolution history of the Greek language, one would need to generate a series of morphological lexica, each covering a specific time-period and/or geographical area. Substantial benefits can thus be obtained if a method is devised for generating morphological information not manually, but rather with the minimum possible human intervention. The research presented here is intended to provide a system that allows the automated morphological processing (in particular the stemming, i.e. segmentation into stem and ending) of a given set of word forms, with the minimum amount of external guidance and postprocessing. Tambouratzis & Carayannis (2001) have proposed the Automated Morphological

447

Processor (AMP), a system for the morphological processing of Modern Greek (MG). This system draws on the pattern recognition paradigm and, more specifically, employs an iterative masking-and-matching technique for detecting stems and endings. A priori linguistic knowledge is limited, introduced only to achieve the segmentation accuracy necessary for the system to be used in practical applications. It is noteworthy that AMP has by design a modular architecture, so that it can be easily customised for capturing the idiosyncrasies of the Greek language over time. In this article, a substantial development of the AMP system is presented, which enables the accurate morphological analysis of Ancient Greek (AG). The Attic dialect has been selected as a case study, leading to respective modifications of the AMP system concerning the dialect characteristics and properties. One of the main tasks within the design of AMP is to determine the extent to which it is possible to limit a priori linguistic knowledge. Though the introduction of such information has been shown to considerably improve the word segmentation accuracy (Tambouratzis & Carayannis 2001), it was found that only a limited number of morphophonological rules suffice for the accurate segmentation of the majority of AG word forms. For our experiments we have used as a test corpus a number of Attic dialect texts assembled from Thesaurus Linguae Graecae (TLG)1 and encoded in the Beta coding scheme2. The accuracy of the system is determined by automatically processing the corpus word forms and then comparing the system results with the segmentation produced by a human expert on the same texts (hereafter termed as gold-standard). The experimental results indicate that for the majority of word forms the AMP output coincides with the gold-standard segmentation.

2. Review of morphological processing approaches Research on the morphological processing of word forms has focussed on two directions, rule-based systems and connectionist techniques. One of the first working rule-based systems is that proposed by Porter (1980) for suffix stripping of words in Modern English, which was implemented on an early computer system. More recently, approaches for the automated morphological processing of word forms have been proposed, in an effort to support the linguistic processing of texts using machines. Kazakov & Manandhar (1998) have proposed a hybrid approach, which combines genetic algorithms and inductive logic programming to generate an optimal set of word segmentation rules. Neuvel & Fulop (2002) propose a semi-supervised system that uses the word forms contained in a small reference lexicon as templates to determine the segmental differences between these word forms, in conjunction with part-of-speech information for each word. 1 2

http://www.tlg.uci.edu/ http://www.cs.utk.edu/~mclennan/OM/Beta-codes.html

448

Goldsmith (2001) presents a method for the morphological processing of word forms, which is to a large extent language-independent and is implemented as the software package Linguistica3. This method is based on the Minimum Description Length (MDL) principle, analysing word forms into morphemes, so that the representation created is the most compact possible. Details of the Linguistica implementation, together with a method for evaluating its performance in a given language are presented in Goldsmith (2006). Creutz & Lagus (2002) propose two related unsupervised methods for segmenting word forms into morpheme-like units, designed especially for languages with rich morphology. These methods are based on (i) the MDL principle and (ii) the Maximum Likelihood Optimisation method. A comparison (Creutz 2003) indicates that the MDL-based variant outperforms Linguistica, in particular for larger corpora. Schone & Jurafsky (2000) introduce semantic information to improve the accuracy of morphology learning, by allowing associations only when the stem and stem + affix patterns are semantically similar. Based on this work, Schone & Jurafsky (2001) propose an algorithm for the knowledge-free induction of inflectional morphologies, combining cues from orthography, semantics and syntactic distributions. The number of systems designed for the morphological processing of Greek is limited. To the authors’ knowledge, apart from the MG version of the AMP system (Tambouratzis & Carayannis 2001), only Sgarbas et al. (1995) have presented a two-level morphological processor for Modern Greek, based on the PC-KIMMO environment (Koskenniemi 1983).

3. Segmenting Ancient Greek word forms This section presents the AG version of the AMP system. The AMP system comprises four phases, namely (1) Initialisation, (2) Masking-and-Matching, (3) Enrichment and (4) Evaluation, which are described comprehensively in the following sections. 3.1 Phase 1: Initialisation The system is initialised with a small set of a priori endings, which can (a) be manually compiled, (b) be statistically determined based on the frequency-of-occurrence of endings in the given text corpus or (c) be the output of the combination of the previous two methods. 3.2 Phase 2: Masking-and-Matching After system initialisation, the main system module is invoked, i.e. the masking-and-matching process, which is inspired from the pattern recognition paradigm and determines the sets of possible stems and endings within a given corpus.

3

http://linguistica.uchicago.edu/

449

The masking-and-matching approach relies on detecting the matching parts of different patterns, while ignoring their remaining ones. For instance, when comparing the patterns

x1 x2 and y1 y 2 , masking-and-matching focusses on the similarity between x1 and y1 (attempting to match these parts exactly) and temporarily ignores x 2 and y 2 . In the current implementation, the patterns xi and y i correspond to stems and endings respectively. Consequently, the first part of a word form is matched to a known stem, resulting in the determination of a candidate ending. Likewise, the second part of a word form is matched to a known ending, thus generating a candidate stem. More specifically, the masking-and-matching module works with the initialisation set, resulting in the determination of new candidate stems and endings. By iterative executions of this process the system (a) detects new segmentations, thus expanding the sets of potential stems and endings, and (b) updates the frequencies of already defined stems and endings, when encountered in new environments, by utilising the morphological knowledge accumulated within all previous iterations. During masking-and-matching a number of constraints, reflecting empirical linguistic knowledge, are activated, in order to optimise the system performance: •

Endings should have a maximum length of 7 characters;



Stems should have a minimum length of 3 characters, to avoid the subsequent overgeneration of possible endings4;



Accents are omitted from the stem but retained when situated within the ending, since then they are considered to be characteristic of the inflectional paradigm;



No null endings are allowed. It is worth mentioning that rules as the aforementioned ones have been integrated even in

language-independent systems performing morphological analysis (cf. Goldsmith 2006). 3.3 Phase 3: Enrichment A fundamental assumption when designing AMP for Modern Greek has been that each word form consists of one stem and one ending, with no interactions taking place during word formation. Assuming that an n-letter word is denoted as < w1 w2 ..wn > , a p-letter ending as

< e1 e 2 ..e p > and an m-letter stem as < s1 s 2 ..s m > , this assumption is expressed as in (1):

4

This rule is activated during Phase 2 (masking-and-matching), while the segmentation of small-stem words is handled in the final analysis phase, to avoid the involvement of very short stems in the masking-and-matching iterations.

450

< w1 w2 ..wn > = < s1 s 2 ..s m > + < e1 e2 ..e p > , ⎧ ⎪n = m + p ⎪ where ⎨if i ≤ m , then wi ≡ s i ⎪if i > m , then w ≡ e i ⎪ i ⎩

(1) m

The first relation indicates that the length of a word is equal to the sum of the lengths of its stem and ending. The second relation indicates that each of the first m characters in the word coincides with the characters of the stem, while the third relation indicates that each of the last n-m word characters coincides with the corresponding characters of the ending. The assumption expressed by (1) indeed holds for the vast majority of MG word forms, as the experimental results obtained have shown, with the segmentation accuracy approximating 95%, though it does not necessarily hold for AG or other synchronies of the Greek language. Therefore, when using AMP to process AG texts, it has been decided to abandon the above restriction and model almost all possible stem-ending interactions, as this phenomenon spans a much wider space of the AG inflectional paradigms. Indicative cases are given below: 1)

φύλαξ Æ φύλακ (stem) + ς (ending) [= guardian]

(noun: masculine, nominative, singular) 2)

λυθείς Æ λυθέντ (stem) + ς (ending) [= untied]

(passive aorist participle, masculine, nominative, singular of the verb ‘λύω’)] The stem-ending interactions modelled are distinguished into interactions between consonants (15 rules in total) and interactions between vowels (8 rules), the latter being most frequently associated with contracted verbs5. All interaction rules are associated with weights, which indicate the likelihood of their contributing to the generation of the correct segmentation. Indicative examples of such rules are shown in Table 1. Table 1. Excerpt of the stem-ending interactions modelled in AMP

Rule 1(a) 1(b) 2(a) 2(b) 2(c) 2(d)

Contracted form εῖ  εῖ  ξ  ξ  ξ  ξ 

Expanded form ε+ε  ε+ει  κ+σ  γ+σ  χ+σ  ττ+σ 

Example δοκεῖτε ‐> δοκέ+ετε  δοκεῖ ‐> δοκέ+ει  ἔδοξεν ‐> ἔδοκ+σεν  λέξουσιν ‐> λέγ+σουσιν  ἕξετε ‐> ἕχ+σετε  πράξας ‐> πράττ+σας 

5

The process of contraction unites in a single long vowel or diphthong two adjacent vowels or a vowel and a diphthong in successive syllables in the same word (Smyth 1956, Σταματάκος 1949, Τζάρτζανος 1960).

451

The interactions modelled help the system enrich the output of masking-and-matching (Phase 2), hence leading to the generation of more potential segmentations, out of which the most likely one will be selected at a later stage. Nevertheless, these interactions are only searched for after the completion of the initial masking-and-matching iterations, in order to avoid over-generalisation. Table 2 presents various possible segmentations for the word form ‘ἐξείλετο’ (3rd person singular, indicative, middle aorist of the verb ‘ἐξαιροῦμαι’ ‘take out for oneself’). Segmentations 5 and 6 are generated by applying rule 1 of Table 1, while segmentations 8 to 11 are generated by rule 2. Table 2. List of possible segmentations for the word form “ἐξείλετο”

Segmentation number 1 2 3 4 5 6 7 8 9 10 11 12

Stem

Ending

ἐξείλετ‐  ἐξείλε‐  ἐξείλ‐  ἐξεί‐  ἐξε‐  ἐξε‐  ἐξ‐  ἐκ‐  ἐγ‐  ἐχ‐  ἐττ‐  ἐ‐ 

‐ο  ‐το  ‐ετο  ‐λετο  ‐ελετο  ‐είλετο  ‐είλετο  ‐σείλετο  ‐σείλετο  ‐σείλετο  ‐σείλετο  ‐ξείλετο 

It should be noted that due to the modelling of the stem-ending interactions the number of candidate segmentations increases substantially, rendering the processing of AG word forms much more intricate. 3.4 Phase 4: Evaluation The fourth phase is associated with the evaluation & ranking of all possible segmentations generated by the system, the aim being to define the most likely solution per word form, i.e. the most likely segmentation into stem and ending. To this end, AMP uses the following set of criteria: Criterion C1. Maximum frequency of ending: This criterion selects the solution for which

the ending has the highest frequency-of-occurrence, i.e. the ending that appears with the largest number of distinct stems. Criterion C2. Maximum frequency of stem: This criterion selects the solution involving the

stem that appears with the largest number of distinct endings.

452

Criterion C3. Maximum combined frequency of stems and endings: This criterion selects

the solution based on the highest frequency-of-occurrence of both the ending and the stem. This collective frequency is calculated by direct summation of the two frequencies. Criterion C4. Minmax-frequency of the stem and ending parts: According to this crite-

rion, the system chooses the solution by maximising the minimum of the frequencies-ofoccurrence of both the stem and the ending. Criterion C5. Minimum length ending: Based on this criterion, the solution involving the

shortest possible ending is selected. Criterion C6. Minimum length ending with a priori information: This criterion extends

and improves C5 by consulting linguistic knowledge. Hence, it favours those solutions containing a priori endings. Criterion C7. Joint entropy of stem and ending: This criterion selects the solution for

which the combined entropy (Duda et al. 2001), i.e. randomness, of the stem and the ending is minimised. The rationale for introducing this metric is that, by minimising the entropy of the final segmentation, the solutions involving frequent stems and endings will be favoured. Criterion 8. Deviation of the frequency ratio from a median value: This criterion takes

into account the relative frequencies of stems and endings and selects the segmentation which has the minimum distance from an “ideal” ratio of stem/ending frequencies. Criterion C9. Ratio of stem/ending frequencies & PoS knowledge: This criterion improves

on C8 by analysing the segmentation accuracy per part-of-speech (PoS). The two main categories examined are nominal and verbal word forms. 3.5 AMP Architecture In Figure 1 the architecture of the revised AMP system is depicted, its crucial difference being, in comparison to the implementation for MG, the introduction of the stem-ending interaction module. A given text corpus is pre-processed, in order to extract all the contained distinct word forms, which are to be fed to the system for segmentation. The system is initialised with a group of a priori endings (Phase 1) and then executes the masking-and-matching process in an iterative mode, yielding potential segmentations for each word form of the given corpus (Phase 2). The number of the generated segmentations is subsequently enriched through the application of the interaction rules (Phase 3). Finally, the system consults a set of criteria, in order to evaluate and rank the solutions derived (Phase 4) and reach the optimal (i.e. the most likely) segmentation for each word form.

453

Figure 1. AMP system architecture

4. Experiments with AG texts

4.1 Experimental corpus In order to evaluate the proposed method for AG, a text corpus has been assembled, containing rhetoric speeches delivered by seven orators in Athenian courts of Law, within the 4th and 5th centuries B.C. The corpus details are shown in Table 3. Though the number of word forms differs considerably over the seven orators, the ratio of the total number of words over distinct

454

word forms for each orator is stable, ranging from 3.6 to 4.4. Only in the case of Isocrates, does this ratio rise to 7.55, due to the larger size of texts by this orator. Table 3. Text corpus details

Orator Andocides Antiphon Dinarchus Aeschines Isaeus Isocrates Lysias All orators

Number of texts 4 15 3 15 12 31 34 114

Total number of words 17455 18150 10676 49534 12158 119303 16749 244025

Number of distinct word forms 4377 4135 2936 11232 2894 15792 4351 29228

Ratio of words to distinct word forms 3.99 4.39 3.64 4.41 4.20 7.55 3.85 8.35

For the evaluation of AMP results, a reference set has been created, by manually segmenting a set of word forms into stem and ending to generate a gold-standard, with extensive cross-checking to ensure the consistency of the results. In addition, a PoS categorisation has been performed, each word form being labelled as verbal (both finite and non-finite forms), nominal (including nouns, adjectives, participles) or invariant (i.e. non-inflected forms such as adverbs). More specifically, the first 1,250 word forms of the speeches of Lysias have been manually annotated, followed by all word forms of Isaeus. 4.2 Experimental set-up The experimental corpus was processed using the AMP system, the results being shown in Figure 2 for the texts of Lysias, including the actual number of stems and endings being determined at the end of each iteration. A total of 5 masking-and-matching iterations within Phase 2 have been performed. These 5 iterations have been followed by the application of interaction rules (within the Enrichment Phase), aimed at fine-tuning the segmentation results. More specifically, in the first two iterations of Phase 2, a large number of new stems and endings are discovered. The discovery rate is thereafter reduced (3rd and 4th iterations) and eventually converges to zero (5th iteration). However, after the employment of interaction rules the number of potential solutions increases considerably. The aforementioned results are valid irrespective of the criterion chosen, as this modifies only the choice of the “preferred” segmentation rather than the segmentations generated. Other experiments, involving text corpora from single orators or combinations of several orators have produced similar results.

455

Figure 2. Development of solutions during AMP operations for the texts of Lysias 7000 6000 5000 4000

stems

3000

endings

2000 1000

ct io

t io In

te

ra

ra I te

ra I te

ns

5 n

4 t io

n t io I te

ra

ra I te

n

3

2 t io

t io I te

ra

at lis In

itia

n

n

io

n

1

0

4.3 Experimental results and assessment of ranking criteria The ranking criteria of Phase 4 have been applied in turn to the task of choosing the best segmentation. The results obtained for the texts of Isaeus are shown in Table 4, with similar results being obtained for the residual orators. In this table, the percentage of word forms, for which the gold-standard segmentation corresponds to the nth highest AMP solution (for values of n equal to 1, 2, 3 and 4), are shown separately, while cases for which the gold-standard corresponds to a solution ranked from 5th up to 10th are grouped together. The last column indicates the percentage of word forms for which the gold-standard segmentation is not included within the 10 highest-ranked AMP solutions. Consequently, based on their performance, the criteria can be clustered in two groups, (i) the lower-performing criteria, achieving an accuracy substantially lower than 50% and (ii) the higher-performing ones, with a performance exceeding 65%. Table 4. Comparison of the gold-standard segmentations and AMP results for Isaeus (in percentages)

Criterion

1st solution

2nd solution

3rd solution

4th solution

5th-10th solutions

C1 C2 C3 C4 C5 C6 C7 C8 C9

20.83 5.52 14.01 28.38 66.97 68.97 40.47 68.53 88.42

45.94 13.25 36.91 20.51 18.44 14.77 15.39 13.14 5.05

11.81 13.58 16.73 13.11 7.08 5.66 11.69 5.95 1.34

7.63 15.79 10.13 10.49 2.65 2.36 11.32 3.52 0.80

7.91 41.60 16.66 22.94 1.41 4.43 17.17 5.34 0.94

Not found 5.88 10.27 5.55 4.57 3.45 3.81 3.96 3.52 3.45

456

Lower-performing criteria: In general, criteria C1-C3 tend to have the correct segmenta-

tion ranked second-best or lower. More specifically, criterion C2 gives the poorest results, since less than 6% of the top-ranked AMP segmentations coincide with the gold-standard ones. Criteria C1 and C3 also result in relatively low segmentation accuracy, with only 21% and 14% of word forms respectively being correctly segmented by AMP. Criterion C4 performs slightly better, providing an accuracy equal to 28%. The overall low segmentation accuracy (less than 25%) of the first four criteria, which rely solely on maximum frequency, is justified, since they implicitly presuppose the high frequency-of-occurrence for the majority of both stems and endings. The entropy-based criterion C7 has the highest accuracy within this group, retrieving the gold-standard segmentation as the top-ranking solution for approximately 40% of the word forms, yet it is not considered satisfactory. Its poor performance may be attributed to the fact that it maximises the significance of frequency-of-occurrence, while it can not adequately capture the role of stem-ending interactions in the word formation. Higher-performing criteria: Criteria C5, C6 and C8 are broadly equivalent performance-

wise, with a segmentation accuracy ranging between 67% and 69% in accordance to the goldstandard. It is noteworthy that within this group C6 is included, which provided the highest segmentation accuracy (approximately 95% for several corpora) for MG (Tambouratzis & Carayannis 2001). Yet this criterion only determines 67% of segmentations in the current experiments due to the much higher complexity of segmenting AG. Criteria C5 and C6 manage to achieve a much higher accuracy, because they take into consideration a characteristic property of endings, i.e. their tendency to be short. Furthermore, since shorter endings are characteristic of nominals, which are highly frequent within texts, these two criteria succeed in correctly segmenting the majority of word forms. The high segmentation accuracy of criterion C8 (> 65%) can be ascribed to the fact that it appropriately combines the actual frequencies of both stems & endings. The best system performance, approaching a level that allows its utilisation in practical applications, is observed when using criterion C9. In this case, for 88.4% of word forms, the top-ranking AMP solution coincides with the gold-standard segmentation. C9 performs better with respect to verbal word forms. Though their frequency-of-occurrence is, as a rule, much lower than that of nominals, it still affects the overall segmentation accuracy. Hence the fact that C9 manages to accurately segment verbals leads to a superior overall system performance. The four higher-performing criteria have been investigated more thoroughly, to determine their comparative strengths and weaknesses. More specifically, the accuracy with which nominal and verbal word forms are segmented has been studied, to reveal potential categoryspecific problems. The segmentation accuracy of nominals is very similar for all four higher-

457

performing criteria, ranging from 91.0% to 94.5%, when comparing the top-ranked AMP segmentation to the gold-standard one. On the contrary, the segmentation accuracy for verbal word forms varies substantially, as for criteria C5, C6 and C8 this is lower than 33%, while in the case of C9 it exceeds 78%. Comparable results are obtained when studying the texts of the other orators from the experimental corpus. 4.4 Investigating the number of segmentations per word form As noted before (Section 3.3), the activation of interaction rules, when processing an AG text corpus, augments the number of potential segmentations considerably; thus it renders the task of determining the most likely segmentation substantially harder than in MG. To illustrate this point, two large collections of texts, one in Modern Greek (a total of 43,150 distinct word forms) and one in Ancient Greek (29,228 distinct word forms) have been processed by AMP, the results in terms of the number of possible segmentations per word being shown in the histogram of Figure 3. It was found that the maximum number of candidate segmentations for a word form reaches 8 for MG, while for AG it is 3.5 times higher, at 27 segmentations. Likewise, the average number of candidate segmentations per word form is 4.41 in MG compared to 8.87 for AG. These comparative results indicate the substantially increased complexity of segmenting AG word forms in comparison to MG. Figure 3. Histogram illustrating the number of segmentations per word form for texts in MG and AG modern greek

ancient greek

25.0

percentage

20.0

15.0

10.0

5.0

0.0

1

3

5

7

9

11

13

15

17

19

21

23

25

27

number of segmentations per word

458

5. Future directions

The work presented here represents ongoing work. Future directions involve investigating further the ranking criteria, in an effort to improve the segmentation quality. One specific direction involves tuning the different parameters in an automated manner, in order to ensure system optimisation. For this task, artificial intelligence techniques such as genetic algorithms appear as a prime candidate to improve the segmentation accuracy, using a training corpus to define the optimal parameter values. A further improvement in the segmentation accuracy may be achieved by appropriately combining several of the most promising criteria. The present work has reported on the successful application of the system to a welldefined, widely-used dialect of Ancient Greek for which a wealth of texts survive. To that end, it is planned to study in the future other synchronies of the Greek language.

6. Conclusion

In this article, a system for the morphological analysis of Ancient Greek texts has been presented. This model is based on the AMP system, employing a masking-and-matching approach, and has been originally designed and optimised for Modern Greek. Since the underlying AG model is substantially more complicated than that of MG, several modifications have been rendered essential, in order to achieve a high segmentation accuracy. The major modification involves explicitly modelling the stem-ending interactions, which occur frequently in Ancient Greek. The introduction of these interactions implies the provision to the AMP system of a limited amount of readily retrieved linguistic knowledge. The number of potential segmentations per word form in AG has been found to be on average twice as high as that observed for MG. In order to ensure a high segmentation accuracy, in this task of higher complexity, criteria based on information theory, such as entropy, as well as statistical distributions have been evaluated comparatively to the criteria employed for MG. This effort has been supported by the creation of a gold-standard of word form segmentations, since, in contrast to the former MG implementation, no electronic access to a complete AG morphological lexicon could be secured to support the necessary evaluations. The results obtained indicate that a segmentation accuracy of approximately 88% can be consistently achieved. It is believed that the system accuracy could be further improved by studying the use of criteria in more detail and fine-tuning the amount of linguistic knowledge provided to the system. In this respect, the presented system represents only the first stage, with additional mechanisms, possibly integrating human input, to achieve an even higher accuracy, this being the subject of future work.

459

References Creutz, M. (2003) “Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency”. Proceedings of ACL-03, Sapporo, Japan, 280-287. Creutz, M. & K. Lagus (2002) “Unsupervised Discovery of Phonemes”. Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02, Philadelphia, PA, 21-30. Duda, R. O., P. E. Hart & D. G. Stork (2001) Pattern Classification (2nd edition). New York: Wiley Interscience. Goldsmith, J. (2001) “Unsupervised Learning of the Morphology of a Natural Language”. Computational Linguistics 27 (2), 153-198. Goldsmith, J. (2006) “An algorithm for the Unsupervised Learning of Morphology”. Natural Language Engineering 2 (4), 353-371. Kazakov, D. & S. Manandhar (1998) “A Hybrid Approach to Word Segmentation”. Proceedings of Inductive Logic Programming-98, Lecture Notes in Computer Series. Springer-Verlag, Vol. 1446, 125-134. Koskenniemi, K. (1983) “Two-Level Morphology: A General Computational Model for Word-form Recognition and Production”. Technical Document No 11. University of Helsinki, Helsinki. Neuvel, S. & S. Fulop (2002) “Unsupervised Learning of Morphology without Morphemes”. Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), Morphological and Phonological Learning. Philadelphia, PA, 31-40. Porter, M. (1980) “An Algorithm for suffix stripping”. Program 14 (3), 130-137. Schone, P. & D. Jurafsky (2000) “Knowledge-Free Induction of Morphology Using Latent Semantic Analysis”. Proceedings of the Computational Natural Language Learning Conference, Lisbon, Portugal, 67-72. Schone, P. & D. Jurafsky (2001) “Knowledge-Free Induction of Inflectional Grammars”. Proceedings of the 2nd Meeting of the North American Chapter of the ACL. Association for Computational Linguistics, Morgan-Kaufmann, 183-191. Sgarbas, K., N. Fakotakis & G. Kokkinakis (1995) “A PC-KIMMO-Based Morphological Description of Modern Greek”. Literary and Linguistic Computing 10 (3), 189-201. Smyth, H. W. (1956) A Greek Grammar for Colleges. Harvard: Harvard University Press. Σταματάκος, Ι. (1949) Ιστορική Γραμματική της Αρχαίας Ελληνικής. Αθήνα. Tambouratzis G. & G. Carayannis (2001). “Automatic Corpora-based Stemming in Greek”. Literary and Linguistic Computing 16 (4), 445-466. Τζάρτζανος, Α. (1960) Γραμματική της Αρχαίας Ελληνικής Γλώσσης. Αθήνα: Οργανισμός Εκδόσεως Διδακτικών Βιβλίων.

460