A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information Ammar Halabi*1, Ahmed-Derar Islim*2, Mohamed-Zakaria Kurdi*†3 *
Department of Artificial Intelligence, Faculty of Informatics University of Aleppo, Aleppo, Syria 1
[email protected] 2
[email protected] 3 Department of Computer Science, Faculty of Engineering and Technology Mamoun University of Science and Technology, Aleppo, Syria 3
[email protected]
Abstract. This paper focuses on the problem of archaeological textual information retrieval, covering various field-related topics, and investigating different issues related to special characteristics of Arabic. The suggested hybrid retrieval approach employs various clustering and classification methods that enhances both retrieval and presentation, and infers further information from the results returned by a primary retrieval engine, which, in turn, uses Latent Semantic Analysis (LSA) as a primary retrieval method. In addition, a stemmer for Arabic words was designed and implemented to facilitate the indexing process and to enhance the quality of retrieval. The performance of our module was measured by carrying out experiments using standard datasets, where the system showed promising results with many possibilities for future research and further development. Keywords: Information retrieval, Arabic Information Retrieval, Arabic Stemming, Arabic Lexical Analysis, Latent Semantic Analysis, Automatic Document Categorization.
1
Introduction
Today, with new archaeological sites being discovered and interesting findings being unearthed in already established ones, growing archaeological data places a significant amount of information at the archaeological community's disposal by every mission at the end of a successful season. In principle, archaeologists and researchers in associated disciplines should be able to access this information in a convenient and consistent manner to effectively retrieve material for the support of their own research, and to conduct collaborative research, via information exchange, with other researchers in the community or in other research communities. The needs of information recording, organization, acquisition and dissemination in the archaeological community suggest interesting possibilities for the adoption of computer-based information systems. We started our research with the goal of designing and implementing a robust Archaeological Information Retrieval system, which is capable of indexing and searching cross-language corpora as well as crossmedia corpora. This information retrieval system is a basic need and a critical component in a complete Archaeological Information System, and can be considered as the starting point for developing such system [10]. In this paper, we tackle the problem of the retrieval of textual information archaeology. We introduce a background on textual information retrieval systems,
2
Ammar Halabi*1, Ahmed-Derar Islim*2, Mohamed-Zakaria Kurdi*†3
followed by proposing architecture for the Archaeological Textual Information Retrieval System, and we end by presenting our results and conclusions. The current application covers Arabic language only but the adopted design makes it relatively easy to add new languages: only a lightweight stemmer need to be added per language.
2
State of the Art
2.1 Document similarity calculus Latent Semantic Analysis (LSA) [5] is a technique used in statistics and natural language processing to find hidden – or latent – relations between a set of observations and a set of associated features. This is done by mapping features and observations onto an intermediate concept space which only preserves the most significant characteristics. By this new representation, relations that were unobvious between features and observations can be revealed. In the context of natural language, features are represented by terms, and observations are represented by documents. LSA can be applied for document retrieval by projecting user queries and indexed documents onto the concept space to uncover the relation between the user needs and documents in the corpus. The application of LSA in textual information retrieval is known as Latent Semantic Indexing (LSI). In theory, to find a lower rank approximation of the 2D term-document matrix, LSA makes use of the reduced Singular Value Decomposition (SVD); a matrix factorization tool used in signal processing and statistics [1]. Using SVD, the term-document matrix X (where row vectors represent terms, and column vectors represent documents) is written as the product of three orthonormal matrices, U, Σ, and V, as shown in Fig. 1
Fig. 1. SVD of the term-document matrix (X)
As in Fig. 1, U holds the eigenvectors of the matrix X.XT, V holds the eigenvectors of the matrix XT.X, and Σ is a diagonal matrix having it’s diagonal formed by the square roots of the eigenvalues of the matrix X.XT (or equally by the square roots of are the singular values, the eigenvalues of the matrix XT.X). The values and the left and right singular vectors. where The LSA concept space is extracted by keeping only the k largest singular values and the corresponding k left and right singular vectors. These k singular values
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information
3
represent the new reduced concept space, where the left and right singular vectors provide the means to transform to and from this space. After computing the SVD of the term-document matrix, relations between terms, documents, or terms and documents can be revealed. Moreover, user query vectors can be projected on the concept space to retrieve the closest set of documents that meet the user’s needs. LSI can be applied for document indexing and retrieval including cross-lingual corpora, and for modelling the process of human learning and text comprehension [13], [5]. LSI is also reported to outperform the Vector Space retrieval model, where comparisons between query vectors and document vectors are done in the original space [14]. 2.2 Automatic document clustering One approach to achieve textual document retrieval is by automatic clustering.. These operations are concerned with a wide variety of probabilistic and statistical machine learning methods that aim to learn about classes of texts written in natural languages. Document clustering can be defined as the partitioning of a dataset of documents into subsets (clusters), where all documents in each of these subsets share some common traits expressed using some certain distance measure. This technique is an unsupervised learning method that no prior information related to potential similarities between documents is used in the learning process [2]. Clustering can be used to organize documents in separate divisions so they can be browsed and processed more easily by humans. It may enhance the quality of presentation of retrieval results by grouping semantically similar documents. Partitioning clustering methods, such as the k-means algorithm [21], attempt to cluster a given set of vector objects into k clusters by iteratively minimizing total intra-cluster variance, or, the squared error function. On the other hand, hierarchical clustering methods such as Hierarchical Agglomerative Clustering (HAC) [21], agglomerate or break up a hierarchy of clusters (dendogram). Cutting the tree at a given height will give a clustering at a selected precision. Clustering in information retrieval can be used for category browsing, especially when we apply an automatic label extraction method to label resultant document clusters [11], [28]. 2.3 Automatic document classification Document classification [21] is a supervised machine learning method which assigns documents to pre-defined labels (categories). This technique is used in pattern recognition and data analysis. It works by constructing a classifier which learns a model from a training dataset composed of documents along with their corresponding categories. This dataset has to contain enough information for the classifier model to be effective at predicting the classes of new documents. Reference [25] Provides a review of classification techniques. By classifying user queries using a classifier trained on a categorized or clustered textual document dataset, we can achieve document retrieval. That is, the category that the classifier refers to when asked to classify a query contains documents that are related to that query.
3
The Archaeological Text Retrieval System
In Fig. 2, we illustrate our proposed hybrid architecture for The Archaeological Text Retrieval System.
4
Ammar Halabi*1, Ahmed-Derar Islim*2, Mohamed-Zakaria Kurdi*†3
This architecture was primarily based on the general architecture and conventions used in text retrieval systems, with an additional layer of functionality inspired by the work of Sahami [22].
Fig. 2. Generic architecture of IR systems; the crescent refers to offline operations, and lighting bolts refer to online operations
For illustrative purposes, the architecture can be divided, into two main blocks. The first block (1) is modelled after the general architecture of text retrieval systems which provides primary retrieval of documents. This is achieved by: Document pre-processing, which includes parsing files of different formats, term stopping, and stemming. Construction of inverted index. Utilizing a primary retrieval engine based on the well defined LSI retrieval model. The second block (2) adds extra levels of functionality in order to refine the results of primary retrieval, i.e. set of retrieved documents, obtained from the first block and also to improve the quality of result demonstration, where documents are automatically clustered and classified into different categories giving the user a better perception of the retrieval result than in the case of simply presenting a list of retrieved documents. This is achieved by: Feature selection for primarily retrieved documents. Clustering of primarily retrieved documents. Classification of new documents into one of the learned categories.
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information
5
By this architecture, we seek to improve the quality of retrieval and result demonstration of the overall system. In another words, the features provided by the Archaeological Text Retrieval subsystem are: Primary, semi-semantic text retrieval. Enhanced retrieval results automatically refined using clustering and classification techniques. The problem of archaeological text retrieval is similar to the problem of general text retrieval where only minimal modifications of the general text retrieval system are required to adapt for retrieval in a specific field, i.e. special stop lists, and additional stemming rules. Therefore, we focused on the problem from a general point of view assuring that selected methods can be extended to build a cross-media retrieval system. 3.1 Document Pre-processing Concerning Arabic in particular, we came out with our own stop-word list which includes functional vocabulary like prepositions, adverbs, pronouns, and others [Appendix A]. We also implemented a Light Stemming algorithm which strips frequently used suffices and postfixes of Arabic words [Appendix C]. Light stemming of Arabic terms is reported to contribute to the effectiveness of retrieval better than Root Normalization, which adopts a more aggressive stemming approach by reducing words to their roots [15], [16]. Our algorithm is similar in concept to the stemming algorithm described in [16] and [4]. However, we suggested and implemented a different set of stemming rules, making more use of the knowledge of Arabic morphology [Appendix B, C]. 3.2 Indexing In this operation, the index is constructed and the term-document matrix is built, which serves as an abstract representation of documents for the retrieval model to act upon. 3.3 Primary Text Retrieval Latent Semantic Indexing (LSI) was chosen as a primary retrieval model for the following reasons: LSI generally outperforms the vector space model and provides solutions to the problems of synonymy [3], [5]. An LSI-based retrieval engine can be extended to achieve cross language information retrieval [19], [20]. LSI uses the reduced SVD decomposition to project document and query vectors on a new dimensionally-reduced space. This new representation of documents can also be used for document clustering to refine primary search results [17], [24]. This LSA-based retrieval engine provides primary retrieval of relevant documents to be further refined by further processing of the retrieval result. 3.4 Document Clustering Both HAC and k-Means clustering algorithms were used in document clustering. The output of HAC was used to seed the K-Means algorithm. By this approach, after documents are grouped under different levels of hierarchy using HAC, one level of the output hierarchy is used to seed K-Means which refines clustering at the given level by making useful re-assignments of documents into clusters. In addition, using HAC to seed K-Means can yield improvements in performance, where K-Means will potentially converge faster than in the case of randomized document seeding [22]. Upon clustering, cluster descriptors are extracted, which are the most representative
6
Ammar Halabi*1, Ahmed-Derar Islim*2, Mohamed-Zakaria Kurdi*†3
terms of the documents contained in the corresponding cluster. They effectively assist users to understand the categories of clustered retrieval results. These descriptors are extracted using the Probabilistic Odds method [22]. 3.5 Document Classification A Naïve Bayesian classifier [18] was employed for the classification of user queries and new documents. Naïve Bayesian classifiers have yielded good results in text classification [6], [22], and they have been applied successfully [23]. After primary retrieval results are clustered, resultant clusters serve as a training set to train the classifier, so as new documents or queries presented to the classifier when the system is online will be classified as belonging to one of the learned classes. This allows users to accurately determine the most a cluster which is most relative to their search queries or example documents.
4
Results
By using our light Arabic stemmer [Appendix] and refining primarily retrieved documents, we obtained promising results where these techniques proved efficient and effective in Arabic textual information retrieval. Fig. 3 compares the effects of our stemming technique to the light10 stemmer [16] on the performance of document clustering using the k-means algorithm with document features projected on the reduced space of SVD. This experiment was performed using the Sulaiti dataset [27], which was assembled from newspapers, magazines, radio, TV and webpages, summing to 411 texts that are manually classified under 8 different categories. The quality of document clustering was measured using the F-measure: F
2 precision recall precision recall
Fig. 3. Document clustering performance as a function of (k)
These results show that our stemmer matches the light10 stemmer for effectiveness regarding the document clustering. Moreover, by looking at Table I, It is shown by
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information
7
stemming the documents of three datasets that our proposed stemmer does a better job in reducing the number of features (terms) to be used in indexing and retrieval. This considerably improves the performance of computationally-demanding retrieval systems when stemming is applied. The datasets used include the Sulaiti dataset [27] in addition to two datasets we assembled from researchers in archaeology and an online resource for ancient Arabic texts. Dataset Sulaiti
Number of Terms No Stemmer Light10 New Stemmer 108209 44329 31198
Hammadeh
22921
10197
7980
Mashkat
656807
222478
135369
Table 1. Number of Terms Extracted from Arabic Datasets Using Different Stemmers
Regarding the performance of our proposed retrieval model, we were short of free Arabic datasets that are tailored for evaluating retrieval engines. Therefore, we were not able to conduct numerical experiments to measure the quality of our Arabic retrieval engine. However, empirical results and user reactions indicated that retrieval results of the primary LSI retrieval engine on indexed Arabic corpora were good. In addition, further clustering and classification operations successfully improved the quality of presentation for the primarily retrieved group of documents, and successfully assisted users reaching required information more rapidly.
5
Conclusions
In this paper, we have demonstrated our work in Arabic information retrieval and viewed the architecture of our Archaeological Text Retrieval System. This architecture was designed after the generic retrieval systems’ architecture with an additional layer of functionality that improves presentation of retrieval results. Our results showed that our stemming algorithm is highly effective, and that statistical and probabilistic methods for retrieval and language modeling such as the LSI, automatic document clustering, and classification are effective for Arabic textual information.
References 1. G. Akritas and G. I. Malaschonok, “Applications of Singular-Value Decomposition,” Mathematics and Computers in Simulation, vol. 67, issue 1-2, 2004, pp. 15-31. 2. P. Berkhin, “Survey of clustering data mining techniques,” Tech. Rep., Accrue Software, San Jose, CA, 2002. 3. M. W. Berry, S. T. Dumais, and G. W. O'Brien, “Using Linear Algebra for Intelligent Information Retrieval,” SIAM Review, vol. 37, issue 4, 1995, , pp. 573-595. 4. Chen, F. Gey, “Building an Arabic Stemmer for Information Retrieval,” in Proc. Eleventh Text Retrieval Conference TREC’02, Gaithersburg, Maryland, USA, 2002, pp. 19-22. 5. S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the Society for Information Science, vol. 41, issue 6, 1990, pp. 391-407. 6. S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in Proc. 7th ACM International Conference on Information and Knowledge Management ACM-CIKM’98, Bethesda, USA, 1998, pp. 148155. 7. Fox, "Lexical Analysis and Stoplists", in Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Ed. Prentice Hall, 1992.
8 Ammar Halabi*1, Ahmed-Derar Islim*2, Mohamed-Zakaria Kurdi*†3 8. W. B. Frakes, "Stemming Algorithms", in Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Ed. Prentice Hall, 1992. 9. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992. 10. Halabi, A.D. Islim, R. Keshishian, and O. Rehawi, “The Archaeological Text Retrieval System” BSc. thesis, Dept. Artificial Intelligence, Faculty of Informatics, University of Aleppo, 2007. 11. M. A. Hearst and J. O. Pedersen, “Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results,” in Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), Zurich, Switzerland, June 1996, pp. 76-84. 12. Hull, “Stemming algorithms – A case study for detailed evaluation,” Journal of the American Society for Information Science, vol. 47, issue 1, 1996, pp. 70–84. 13. T.K. Landauer, P.W. Foltz, and D. Laham, “Introduction to Latent Semantic Analysis,” Discourse Processes, vol. 25, 1998, pp. 259-284. 14. T. K. Landauer and M. L. Littman, “A statistical method for language-independent representation of the topical content of text segments,” in Proc. Eleventh International Conference: Expert Systems and Their Applications, Avignon, France, May 1991, vol. 8, pp. 77-85. 15. L. Larkey, L. Ballesteros, and M. E. Connell, “Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis,” in Proc. 25th annual international ACM SIGIR conference on Research and development in information retrieval, Tampere, Finland, 2002, pp. 275-282. 16. L. Larkey, L. Ballesteros, and M. Connell, “Light Stemming for Arabic Information Retrieval,” in Arabic Computational Morphology: Knowledge-based and Empirical Methods, A. Soudi, A. van den Bosch, and G. Neumann, Ed. Kluwer/Springer's series on Text, Speech, and Language Technology, 2005. 17. K. Lerman, “Document Clustering in Reduced Dimension Vector Space,” unpublished, available at http://www.isi.edu/~lerman/papers/papers.html (retrieved on 13-08-2007), 1999. 18. Lewis, “Naive Bayes at forty: The independence assumption in information retrieval,” in Proc. European Conference on Machine Learning ECML’98, Heidelberg, DE, 1998, pp. 4-15. 19. M. L. Littman, S. T. Dumais, and T. K. Landauer, “Automatic cross-language information retrieval using latent semantic indexing,” in Cross-Language Information Retrieval, G. Grefenstette, Ed. Kluwer Academic Publishers, 1998, pp. 51-62. 20. M. L. Littman and F. Jiang, “A Comparison of Two Corpus-Based Methods for Translingual Information Retrieval,” Tech. Rep. CS-98-11, Duke University, Department of Computer Science, Durham, NC, Jun. 1998. 21. C. D. Manning, P. Raghavan, and H. Schütze. (2007, August 13). Introduction to Information Retrieval [Online], Cambridge University Press. Available: http://wwwcsli.stanford.edu/~schuetze/information-retrieval-book.html. 22. M. Sahami, “Using Machine Learning to Improve Information Access,” Ph.d. thesis, Dept. Computer Science, Stanford University, 1999. 23. [Sahami et al 1998] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian Approach to Filtering Junk E-mail,” in Proc. AAAI-98 Workshop on Learning for Text Categorization, Madison, Wisconsin, USA, 1998, pp. 55-62. 24. H. Schutze and C. Silverstein, “Projections for efficient document clustering,” in Proc. 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia, Pennsylvania, USA, 1997, pp 74-81. 25. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002. 26. Al-Sughaiyer and I. Al-Kharashi, “Arabic morphological analysis techniques: A comprehensive survey,” Journal of the American Society for Information Science and Technology, vol. 55, issue 3, 2004, pp. 189-213.
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information 9 27. L. Al-Sulaiti and E. Atwell, “Designing and Developing a Corpus of Contemporary Arabic,” in Proc. sixth TALC conference, Granada, Spain, 2004, pp.92. 28. Toda, and K. Ryoji, “A search result clustering method using informatively named entities,” in Proc. Annual ACM International Workshop on Web Information and Data Management, Bremen, Germany, 2005, ACM Press, pp. 81–8.
Appendix 6
Background on Arabic Morphology
Arabic is a highly inflected language with very rich vocabulary and morphology. In Classical Arabic, roots resemble the origin of vocabulary. Roots are combinations of three, four, or even five letters drawn from the Arabic alphabet. By taking a root and applying Arabic morphological rules, one can derive a vast number of valid Arabic vocabulary where the meaning, tense, number, gender, and grammatical case of each derived word are determined by its original root and its pattern which denotes how the original root was inflected to derive the stem, in addition to prefixes and suffixes added to the stem to form the given word, as shown in Fig. 4. Therefore, the derivation process might not only add letters to the beginning and the end of the root, but also adding infixes and diacritics (short vowels or accents).
Fig. 4 Arabic derivational system; source [26]
For example, taking the abstract root “( ”درسdrs), the verb “س َ ( ” َد َرdarasa) can be derived which has the meaning of the verb (studied) in its past tense. Another valid derivation leads to another word; “ً( ”دَرْ ساdarsan) which corresponds to the noun “lesson” in its accusative grammatical case. Also the word “( ”ال ُم َدرّسانalmoudarrisaan) means “the two teachers” where the prefix “( ”الـal) serves as the definite article “the” and the postfix “ ”انserves as a number indicator of “two”. Normally, a group of words derived from a single root share a common meaning, as it is the case in the example above where all derived roots bared meanings related to the concept of “studying”. However, many words derived from the same root don’t share obvious meanings in common. For example, taking the root ( قلبKlb), the word “ ٌ( ”قَ ْلبKalbon) corresponds to the noun (heart) in its nominative case, where the word “ب َ َ( ”قَلKalaba) corresponds the verb “turned sth. upside down” in its past tense. In addition to the morphological rules used to derive valid words, there are nonstandard solid words األفعال و األسماء الجامدة, where a valid Arabic word is not derived from a valid original root, like the solid noun “( ”انسانinsaan) which means “human”, and the solid verb “( ”نعمna’am) which means “yes”. More than that, irregularities can be found in Modern Arabic texts due to modern advances in different fields, and the emergence of many words and technical terms, which lead to transliteration of these terms into Arabic. i.e. “Mechanics” is transliterated into “( ”ميكانيكmikaneek).
10
Ammar Halabi*1, Ahmed-Derar Islim*2, Mohamed-Zakaria Kurdi*†3
These characteristics of Arabic morphology places significant challenges for computational linguistics, where the stem of a give Arabic word cannot be obtained by merely stripping prefixes and postfixes. These problem of Arabic morphological analysis, however, has been approached by different methods and levels, influenced by the purpose of analysis. For a comprehensive background and survey about different Arabic morphological analysis techniques, please refer to [26].
7
The Arabic Stop-word List
For the construction of our stopword list, we determined the categories of functional words in the Arabic language. TABLE 2. Number of Terms Extracted from Arabic Datasets Using Different Stemmers
Category (Arabic) األسماء الموصولة أسماء اإلشارة حروف الجر الحروف المشبھة بالفعل األفعال الناقصة أسماء االستفھام ظروف الزمان و المكان أدوات النھي أدوات النفي الضمائر المنفصلة ،أدوات الوصل )العطف (االستئناف أدوات االستثناء أدوات الشرط
Category (English) Relative pronouns Demonstrative pronouns Prepositions Particles resembling verbs Modal verbs Interrogative pronoun Temporal Prohibitive particles Negation particles Pronouns
Examples التي،الذي أولئك،ھذا مع،في لعل،لكن صار،كان متى،كيف فوق،قبل ال لم أنتن،أنا
Conjunction particles
حيث،و
Exceptive particles Conditional particles
عدا،إال إن،إذا
The stopword list was generated by words from these categories of functional words (nouns, verbs, and particles), and adding possible prefixes and suffixes indicating article, number, gender, preposition, interrogation and conjunction. The initial stopword list contained 700 words, and after addition of possible prefixes and postfixes the final stopword list contained 6700 words.
8
The Arabic Stemmer
Due to the complex morphology of Arabic, the formalization of a morphological analyser that is capable of uncovering affixes, derivation patterns, and roots of words is a challenging task [16]. Fortunately, experiments show that complex morphological analysis is not necessary to improve he performance of Arabic IR systems, but a simple processing of words that effectively strips frequently used prefixes and suffixes is sufficient for normalization of words that express similar meanings. This is known as Light Stemming, which can be considered as light-weight morphological analysis of Arabic words. Root normalization, on the other hand, is the process of normalizing Arabic words by performing a complete morphological analysis, and normalizing Arabic words by finding their original roots. Our light stemmer is conceptually similar to light stemmers described in [4], [16]. However, our stemmer makes more use of the knowledge of Arabic morphology. The stemming rules applied for each word of using our implemented Arabic light stemmer are as follows:
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information 11
1. Lightly normalize the word by removing diacritics (ٍ ،ٌ ،ً ،ْ ،ِ ،ُ ،َ ) and the shadda character ( ◌ّ ), changing the letters (آ،إ، )أto ()ا, changing the letter ( )ىto ()ي, and changing the final letter ( )هto ()ة. 2. If the word starts with a definite article ( فبالـ، وبالـ، فللـ، وللـ، فالـ، والـ، كالـ، بالـ، للـ، الـ فكالـ، وكالـ،), remove the definite article if two letters at least are left after removal and return true. 3. If a definite article was removed in step 2, then this word is definitely a noun. Check if this noun has a subject pronoun suffix ( ، يون، يان، ية، ي، ين، ات، ون، تان، ان،ة يين،)يات. Remove the suffix when found if two letters at least are left after removal, and terminate the stemming process. 4. If no definite article was found in step 2, then check for existence of an object pronoun suffix ( ھن، ھا، ھم، ھما، ه، كن، كم، كما، نا،)ني. Remove the suffix when found if two letters at least are left after removal, then attempt to remove a subject suffix as in step 3, but without termination of the stemming process. 5. Check for existence of a general prefix ( فبـ، فلـ، فكـ، وبـ، ولـ، وكـ، )و. Remove the prefix when found if two letters at least are left after removal. 6. Check for existence of a verb prefix ( ستـ، سيـ، سنـ، )سا. Remove the prefix when found if two letters at least are left after removal, then terminate the stemming process.