Chat mining: Automatically determination of chat ... - Semantic Scholar

3 downloads 32460 Views 267KB Size Report
Karadeniz Technical University, Department of Computer Engineering, Faculty of .... proaches, neural networks and Support Vector Machines (SVM) are widely ...
Expert Systems with Applications 37 (2010) 8705–8710

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums Özcan Özyurt *, Cemal Köse Karadeniz Technical University, Department of Computer Engineering, Faculty of Engineering, 61080 Trabzon, Turkey

a r t i c l e Keywords: Chat mining Topic detection Chat conversations Feature selection Text classification

i n f o

a b s t r a c t Mostly, the conversations taking place in chat mediums bear important information concerning the speakers. This information can vary in many fields such as tendencies, habits, attitudes, guilt situations, and intentions of the speakers. Therefore, analysis and processing of these conversations are of much importance. Many social and semantic inferences can be made from these conversations. In determining characteristics of conversations and analysis of conversations, subject designation can be grounded on. In this study, chat mining is chosen as an application of text mining, and a study concerning determination of subject in the Turkish text based chat conversations is conducted. In sorting the conversations, supervised learning methods are used in this study. As for classifiers, Naive Bayes, k-Nearest Neighbor and Support Vector Machine are used. Ninety-one percent success is achieved in determination of subject. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction With the development of internet, the computer became an important communication means. In this wise, chat conversations are widely used as text based communication tools. Chat mediums are one of the communication mediums which are used by people from all ages frequently. The importance of social and semantic inferences from chat mediums is increasing day by day with this much usage and extension of these mediums (Haichao, Siu, & Yulan, 2006; Khan, Fisher, Shuler, Tianhao, & Pottenger, 2002; Kose & Ozyurt, 2006; Kose, Ozyurt, & Ikibas, 2008). From this point of view, it becomes necessary to analyse these conversations and to understand the characteristics of the speakers. One of the most important factors in analysing the chat conversations is determination of conversation topic (Haichao et al., 2006). Logs which are kept in the computer constitute important data sources in communication used in chat mediums. With manipulation of these files and implementation of data mining rules, basic characteristics of the speakers can be deducted (Bengel, Gauch, Mitter, & Vijayaraghavan, 2004; Bing, Xiaoli, Wee, & Philip, 2004; Haichao et al., 2006). Thus, much beneficial information such as guiltiness analysis, tendencies of speakers, and area of interests will be attained through conversations. With the aid of machine learning techniques, data mining and good analyse of chat conversations, it will be possible to develop applications such as determining terrorist attacks and making guiltiness analysis in the near future (Khan, Fisher, Shuler, * Corresponding author. Tel.: +90 0462 8716922x8562; fax: +90 462 8717424. E-mail addresses: [email protected] (Ö. Özyurt), [email protected] (C. Köse). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.06.053

Tianhao, & Pottenger, 2002; Kolenda, Hansen, & Larsen, 2001; Elnahrawy, 2002; Haichao et al., 2006). In text mining or chat mining applications, two methods, unsupervised and supervised learning, are applied for data classification (Amasyali & Diri, 2006; Han & Kamber, 2006; Han, Karypis, & Kumar, 2001; Koppel, Argamon, & Shimoni, 2002). The information obtained in unsupervised learning is examined, and the data matching with each other are aggregated in a cluster while the ones not matching with each other are aggregated in another cluster. This event is called as clustering. In the clustering operation, there are no preset classes. Therefore, this kind of learning is named as unsupervised. The case which consists of preset classes is called as supervised learning. The operation which is the counterpart of clustering in supervised learning is classification operation. One of the biggest problems of the supervised approach is primarily to determine the classes precisely and accurately by using training sets. Once the classes are determined, these approaches are easier and more effective compared to the unsupervised approaches. On the other hand, all topics can be obtained from the text in unsupervised approaches. However, it is more difficult and complicated compared to the supervised approaches (Bingham, Kab, & Girolami, 2003; Han & Kamber, 2006; Joachims, 1998; Kolenda et al., 2001). In text mining applications, determination of conversation topic is one of the important study areas. Most of the studies made in this area are conducted on classification of news texts. Other studies on this area are related to the determination of text writer’s characteristics (Koppel et al., 2002; Amasyali & Diri, 2006). With pervasion of chat conversations and these mediums’ becoming an

8706

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

accumulation of data, various studies are conducted in this area. Determination of speakers’ characteristics, genders and determination of the subject of the conversation constitute the basis of these studies. These studies can be gathered under the name of chat mining. Various studies were conducted in this area using different classification techniques. Regression models, k-Nearest Neighbors classification (k-NN), Decision Trees, Bayesian probabilistic approaches, neural networks and Support Vector Machines (SVM) are widely used in chat mining applications (Bengel et al., 2004; Elnahrawy, 2002; Haichao et al., 2006; Kose et al., 2008). Elnahrawy (2002) showed an offline topic categorization approach analysing chat conversation logs related to criminal activities is presented. Here, logs are first pre-processed by removing stopwords and then converted into term frequency weighted vectors. Then, categorization techniques including k-NN, Naive Bayes and linear SVM are employed for topic classification. Bengel et al. (2004) also adopted a categorization approach for analysing chat messages from Internet Relay Chat. In this study, archived chat conversations are filtered on the basis of time, channel and speaker. The resultant collections of chat conversations are grouped as ‘‘sessions” for processing and categorization. Each of these ‘‘sessions” is pre-processed with stop-word removal and stemming, and then represented using TFIDF weight scheme for classification. Haichao et al. (2006) indicative term-based approach is used for determination of the talked subject in chat conversations. It is tried to classify subjects of conversations by using supervised approaches in this study. Kose et al. (2008) conducted a study concerning the determination of genders of speakers as an example of information inference from chat conversations. Chat conversations were analysed, and comparative method was used for determination of speakers’ genders. In this study, structural analysis of chat conversations were made, and it was attempted to determine characteristics of conversations via methods of machine learning and data mining. Subject of the talked conversation was tried to be determined as data inference sample from chat conversations. The main purpose of the study is to find answer to the question of ‘‘Which subject are the speakers talking about?” Primary originality of our study is to obtain social and logical inferences from Turkish text based chat conversations. For this purpose, we discoursed on determination of the spoken subject in chat conversations by using term-based approach. Supervised learning techniques were used here. In other words, in the analysis of chat conversations, a specific conversation cluster was used as training set, and subjects were determined beforehand. In determination subjects of chat conversations, classes were fixed beforehand so as to determine commonly spoken topics in limited number of conversations. As for the test set, it was tried to determine the preset speech subjects in which the conversations which were being analysed were included. Naive Bayes, k-Nearest Neighbor, Support Vector Machine methods were used in the classification of conversations. The rest of this paper is organized as follows. Collection and analysis of data are given in Section 2. Determining Characteristics of Chat Conversations are discussed in Section 3. Phases of subject determination from chat conversation operation and its details are presented in Section 4. The implementation and results are discussed in Section 5. The conclusion and future work are given in Section 6.

2. Collection and analysis of data For the evaluation of chat conversations, data were gathered from chat mediums by using msn messenger log files and mIRC (Shareware Microsoft Relay Chat) which are widely used. The gathered data are text files in which conversations are kept. The data existing in these files were subjected to pre-treatment, and were

Table 1 Information of conversations gathered from chat mediums. Statistical information

Numerical values

Total number of conversations Total number of words used in the conversations Total number of most frequently used 20 words Words with spelling error Proportion of number of words with spelling error to the total number of the words The number of Acronyms, short forms and icons Proportion of the number of most frequently used 20 words to the total number of the words Proportion of the number of most frequently used 50 words to the total number of words Proportion of the number of most frequently used 100 words to the total number of words

154 24,993 4195 2454 9.8% 2167 16.8% 31.4% 44.3%

prepared for data mining. In the pre-treatment of the data, basic steps of the data mining were taken into consideration. Total size of the conversations is 4.7 Mb. The conversation which has the shortest duration among the conversations is 1 min while the longest one is 155 min. One hundred and fifty four conversations were gathered from these mediums and 75 of them were used as training sets. The rest 79 conversations were used as test data sets. Main features of the conversations are presented in Table 1. Statistical information such as total number of words or number of words with spelling errors is also given in Table 1. Besides a list of the total number of words and of most frequently used words in the conversations was made. The proportion of most frequently used 20, 50 and 100 words to total number of words were found as 16.8%, 31.4%, and 44.3%, respectively. This demonstrates that specific words are frequently used in conversations. In accordance with the gathered data, nearly half of the words used in conversations consist of most frequently used 100 words. Considering the total number of the words as 24,993, it is seen that almost half of the conversation is made up of the same words. This demonstrates that specific words are frequently used in chat mediums. Another conspicuous situation in chat conversations is proportion of words with spelling errors to the total number of words. As can be read from the table also, the number of words with spelling errors consist 9.8% of the total words. While the number of the words with spelling errors was calculated, exceptions like acronyms, short forms and icon numbers were not taken into considered. When number of these words is taken into consideration, it is seen that 19% of the words used in conversations were written incorrectly or deficiently. This means that one out of each five words is spelled with errors. When this information was taken into account, it is easily seen that chat conversations bear far much differences from news or normally written texts.

3. Determining characteristics of chat conversations When chat conversations are examined, it is seen that the content is different from the normally written materials. This originates from the nature of chat conversations. There are many mistakenly written words in the conversations that take place in chat mediums. A lot of short forms, signs and words which have special meanings are used in these conversations. In addition to this, the frequency of the words which are used in chat conversations, and the length of the sentences bear differences compared to the normal texts (Haichao et al., 2006; Khan et al., 2002; Kose & Ozyurt, 2006; Kose et al., 2008). From this point of view, chat language is rather different from the normal writing language in terms of syntax. In this study, firstly, conversation texts obtained

8707

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710 Table 2 Acronym and short form examples used in conversations. Acronyms

Meaning

Short forms

Meaning

_ KIB AEO SÇS ARO SG

Take care

tmm

OK

God be with you I love you God bestow mercy upon you See you later

tlf üniv insß cvp

Phone University If God wills Response

Table 3 Samples from signs used in conversations. Signs

Meaning

:), :)))))), :)), :-)), :D, :d ?, ?-, ????, _?, . . .!? :P, :PPP, :p ;), ;)), ;))) :(, :((, :((( :-), :=), :-)))

Laughing Laughing loudly Question and asking other meanings? To show tongue To blink Unhappiness Laughing

from chat mediums were examined, and it was tried to determine basic characteristics of chat conversations. 3.1. Chat language In real-time and informal environment of IM (Instant Messages) systems, chat messages are very different from conventional text (Haichao et al., 2006; Kose & Ozyurt, 2006; Kose et al., 2008). Therefore, chat language includes acronyms, short forms, polysemes, synonyms and mis-spelling of terms. However, it is possible to find mistakenly written words and irregular short forms apart from formal grammatical rules in the texts. Special expressions and spelling errors which are commonly encountered in chat conversations can be grouped as follows. Acronyms are formed by extracting the first letters of a sequence of words. For example, ‘‘KIB” is an acronym for ‘‘Kendine _ Bak (Take care)”, ‘‘SÇS” is an acronym for ‘‘Seni Çok Seviyorum Iyi (I love you)” and ‘‘AEO” is an acronym for ‘‘Allah’a Emanet Ol (God be with you)”. Short forms refer to the case in which a lengthy word is replaced with a shorter alternative expression. For example, tmm is a short form for ‘‘tamam (okey)”, tsßk is a short form for ‘‘tesßekkür ederim (thank you)”. Unlike acronyms, it is observed that only some popular short forms have fixed expressions among different chat participants. Many short forms are highly subjective to the context of the conversation and chatters. Table 2 shows some example short forms, and some of the most popular acronyms. Icons are used in conversations, such as :), :)))))), :)) (Laughing), ?, ?-, ????, _?, . . .!? (question and asking other meanings?), :P, :PPP, :p (to show tongue), :(, :((, :((((Unhappiness). Some of these icons mean same though their spellings are different. Some icons used in conversations are listed in Table 3. Mis-spelling of terms is seen more frequently in chat conversations than formal text documents due to nature of chat document. There are also some cases in which a chat participant purposely mis-spells a word to emphasize its meaning. A common case for mis-spelling is the use of duplicated vowels, such as ‘‘evettttt”, ‘‘yawwwww” and ‘‘okkkk” instead of ‘‘evet (yes)”, ‘‘yahu (Hey!)” and ‘‘okey (okey)” respectively. The number of duplications is not fixed. 4. Determination of conversations’ topic As a nature of chat conversations, the subject which is being talked about may change quickly in any chat logins (Khan et al.,

2002; Tianhao, Khan, Fisher, Shuler, & Pottenger, 2002; Elnahrawy, 2002; Haichao et al., 2006). While talking on a matter, the subject may switch to another and then back to the previous one. In addition to this, speakers talk about a different matter switching the subject in the same conversation. Thence, it can be reached to the conclusion that one or more subjects are being talked in any logins. On the other hand, no subject can be deduced from some conversations. These kinds of conversations are seen as short conversations which are not about a specific matter. Concordantly, determination of the conversation topic bears important difficulties. Table 4 shows the statistics on the number of topics discussed in the collected set of chat conversations. In accordance with the data gathered from the conversations used in training set, 9.3% of the conversations could not be determined. On the other hand, 33.3% of the conversations focused on single topic while 57.4% focused on two or more topics. The conversations whose topics could not be determined are usually very short and consist of 5–6 words. A large majority of the conversations focusing on single topic composed of 15–25 words. Most of the long conversations focused on two or more topics. 4.1. Identifying basic patterns of conversation threads In chat conversations, as a part of determination of conversation subject, the thread and the ending of conversation hold importance (Khan et al., 2002). As a nature of chat conversations, conversation may change and shape continuously. Therefore, before determination of conversation topic, determination of thread and ending can be helpful in determining topic or topics in spite of its simple level. The beginning of the topic can be seen as thread (Khan et al., 2002). It is seen that this process is easier while evaluating conversations which have single topic. When there is more than one topic, it becomes harder to fix thread and ending. In this study, especially threads and endings of conversations which have single topic were accurately determined. As for the conversations which have more than one topic, some 85% success was achieved in determining thread and ending of the subjects in conversations. For determination of the threads and endings of the subjects and threading of conversations which have more than one topic, a detailed and different study is necessary. Before the determination of conversation topic, determination of patterns used in fixing threads and endings was put into practice. At the thread of any topic, direct or indirect expressions which indicate that the conversation or subject has started should be determined (Khan et al., 2002). According to the data gathered from chat conversations, conversations generally start with calling, greeting and asking names and addresses. It is understood that a conversation starts when expressions like ‘‘slm (hi, hello), nbrnasılsın (how are you)” are used. Together with these words, questions or normal sentences may also be used. For example, postings such as ‘‘Slm Ahmet, dün aksßamki maçı izledin mi? (Hi Ahmet, did you watch football match last night?)”, ‘‘Ali, nbr, Nasılsın? (Hello, how are you, Ali?)”, ‘‘Tatilde nereye gideceksin? (Where will you go to holiday?)”, etc. Our approach relies on the fact that utilization

Table 4 Characteristics of conversations used in training set. Determination conversation topic

Numbers

Percentage

Total number of conversations in training set Number of conversations whose subject could not be determined Number of conversations focusing on single topic Number of conversations focusing on two or more topics

75 7

100 9.3

25 43

33.3 57.4

8708

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

of these patterns indicates that a new thread started. This kind of patterns can be evaluated as direct patterns. Also they can be named as ‘‘starting patterns”. Here, mis-spellings were taken into consideration in greetings and introduction words. In other words, words which mean same even though they were written differently because of spelling errors were accepted as the same. For example, all ‘‘meraba, mrb, mrh, merhaba” words were taken as ‘‘merhaba (hello). Therefore, utilization of any of these was evaluated as the same meaning. In any conversation text, it is important to know whether the same topic continues or not as well as determination of conversation topic. By this means, determination of one or more topic can be made. Considering the difficulty in determining thread and ending of the subject, it can be realized that direct patterns are not solely sufficient for this process. Therefore, indirect patterns which are purposive for determination of topic continuation should be determined and used. Exemplifying the indirect patterns, utilization of anaphoric relations like ‘‘O (He/She), Bu/S ß u (This, that) can be given. These kinds of patterns are used in continuation of previous topic though they may be a start for a new thread rarely. In addition to this, another kind of indirect patterns is the length of expressions. If expressions are short, this generally does not show the start of a new thread. When a new topic starts, long expressions are used generally. As for the short responses such as ‘‘evet (yeah, yes), hayır (no), katılıyorum (I agree), etc.”, these kinds of patterns are generally in the continuation of a thread. These kinds of patterns can be used in the continuation of a thread as well as ending however not in the threads. These patterns can be named as ‘‘continuing patterns”. In order to determine the ending of the thread, conversation finishing patterns should be determined. In other words ‘‘stopping patterns” should be fixed. Evaluating the data gathered from the conversations, some patterns such as görüsßmek üzere (bye), anlasßıldı (got it), tamam (ok) which are used in finishing conversations were extracted. When these kinds of stopping patterns are seen, it is understood that the thread is completed. In addition to short responses which are used in continuation of conversations are mostly used in stopping the threads. Therefore, when threads are seen in the continuation of short responses, it is decided that a new topic is started. If there was no conversation expression after short responses, it was assessed as the end of thread since most of the time any conversation can end with short responses like ‘‘ok (okay)”, ‘‘peki (all right)”, ‘‘tamam (ok)”, ‘‘anlasßıldı (got it)”. In evaluating the conversations, these patterns were taken into consideration, and thread and ending of topics were tried to be determined. In determining the conversation topic, threads mentioned in conversations were utilized. It helped us in solving this problem to determine the number of topics mentioned in the conversation before dealing with the determination of topic or topics in any conversation helped us in solving this problem.

4.2. Feature selection and topic detection In this paper, a study concerning determination of conversation topic of chat conversation was conducted. In determining conversation topic, preset topics were used. In other words, topics which were commonly spoken or mentioned were determined by analysing conversations used in training set. Taking these facts into consideration, we dwelled on the determination of what could be topic or topics of any conversation. While determining topics, most commonly talked topics in chat mediums were paid attention, and it was tried to determine which of these topics or topic were included in that conversation. In accordance with data gathered from chat conversations, topics mentioned in conversations were classified into five divisions. These were determined as ‘‘sports, love/ marriage, education, slang/swearing, entertainment”. In order to determine topics apart from these a sixth division was constituted under the name of ‘‘others”. Navie Bayes, k-Nearest Neighbor and Support Vector Machine methods were used as classifiers. Classifier applications were implemented through a software named Weka which is open source encoded and on internet (Witten & Frank, 2000). In order to determine topic or topics mentioned in conversations, feature clusters should be determined principally. In the selection of feature, indicative words and terms were determined, and they were gathered together under Indicative Feature Sets for each topic. Components in this cluster can be a single name as well as a phrase. An Indicative Feature Set example is given concerning sports in Table 5. While constituting an indicative cluster related to any topic, all key words which can be indicative for topic were extracted. In addition to this, expressions which are close in meaning or directly related to each other were gathered in a single line. The high dimensionality of text dataset affects negatively classifier algorithms. Therefore, determining words which are related to constitution of Indicative Feature Set is of importance. Indicative cluster for each topic should not be too long or too short. If this cluster is too long, it can contain irrelevant words, and process complexity can increase. On the other hand if it is too short, topic is not represented sufficiently, and performance may be affected negatively (Haichao et al., 2006; Yang, 1999; Yang & Pederson, 1997). Thus, training set for each topic was carefully evaluated, and was represented using TFIDF weight scheme for classification. In determination of topic process, firstly, clusters in which indicative features named as Indicative Feature Sets were constituted was composed. While composing these clusters, texts chosen as training set were used. Conversation topic or topics were tried to be determined using Indicative Feature Sets constituted with processing of training set and extractions obtained from this. In order to do this, characteristics vectors used in training dataset were extracted. With the aid of Indicative Feature Set, preset topic or topics with which the spoken subject overlapped were determined. In

Table 5 A sample indicative Features set for ‘‘sport and slang/swearword topic”. Order

Term 1

Term 2

Term 3

1 2 3 4 ...

Spor (sports) Oyuncu (player) Forvet (striker) Faul (foul)

Futbol (football) Futbolcu (football player) orta saha (midfield) Penaltı (penalty)

Maç (match) Kaleci (goalkeeper) Defans (defence) Ofsayt (offside)

1 2 3 4 ...

Lan (man) Yahu (for god’s sake!) a.k (fuck) Manyak (maniac)

Ulan (buddy!) Yaw (why!) a.q (fuck) Salak (fool)

Laa (buddy!) Yaws (why!) a.g (fuck) Aptal (stupid)

...

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

8709

Fig. 1. Feature selection and topic detection scheme.

that way, conversation topic was tried to be determined by means of classifiers. In topic determination process, single or multiple topic determination process was conducted. The scheme related to the constitution of Indicative Feature Sets and determination of topics by means of test set is presented in Fig. 1. 5. Results In this study the aim was determination of conversation topic in chat conversations. Supervised learning techniques were used so as to determine conversation topic of chat conversations. Of the 154 conversations gathered from internet medium, 75 were used as training set while the rest 79 were used as test dataset. While figuring out training set, Indicative Feature sets composed for each topic were used which are. As for the test process, characteristics vectors of any conversation were extracted, and the preset conversations to which they would be included were determined. For the classification, weka was used that open source classifier software (Witten & Frank, 2000). Results of classification are given in Table 6. When the results are examined, it is seen that the best results were obtained from SVM. In the same way, it is seen that Slang/ Swearword is the topic having the highest accuracy rate. In this topic, a classification of 92% percentage accuracy was achieved. Main reason behind this is the fact that conversations of this topic are more indicative.

Table 6 Result of conversation topic determination.

Sport topic Love/marriage topic Education topic Slang/swearword topic Entertainment topic Other topic

Naïve Bayes (%)

k-NN (%)

SVM (%)

86.4 84.5 86.9 89.8 86.5 87.4

88.6 85.1 87.3 90.2 87.3 88.4

87.1 86.3 87.7 91.7 87.7 89.6

6. Conclusion and future work In this study, a classification based on supervised learning concerning automatic determination of topic or topics mentioned in conversations was made. Chat conversations gathered from internet mediums were examined, and a classification process was made in order to determine conversation topic in these conversations. According to data gathered from conversations, the class to which the conversations belonged was tried to be determined. In classification process, a success in accuracy proportion of 92% was realized. So as to determine topic or topics, firstly, the process of determining the number of topic spoken in conversation was gone through. This process is a subject which should be studied independently. Threads and endings of the topics were tried to be determined by making analysis at basic-levels. In oncoming studies, advanced analysis concerning determination of threads and endings of topics can be made. Thereby, threads and endings of topics can be determined at high rate of accuracy.

References Amasyali, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author. Genre and gender. In 11th international conference on applications of natural language to information systems. NLDB 2006 (pp. 221–226). Bengel, J., Gauch, S., Mitter, E., & Vijayaraghavan, R. (2004). Chattrack: Chat room topic detection using classification. Lecture Notes in Computer Science, 3073, 266–277. Bing, L., Xiaoli, L., Wee, S. L., & Philip, S. Y. (2004). Text classification by labeling words. In Nineteenth national conference on artificial intelligence (pp. 425–430). Bingham, E., Kab, A., & Girolami, M. (2003). Topic identification in dynamic text by complexity pursuit. Neural Processing Letters, 17, 69–83. Elnahrawy, E. (2002). Log-based chat room monitoring using text categorization: A comparative study. In Proceedings of the IASTED international conference on information and knowledge sharing. St. Thomas: US Virgin Islands. Haichao, D., Siu, C. H., & Yulan, H. (2006). Structural analysis of chat messages for topic detection. Online Information Review, 30(5), 496–516. Han, E., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. Lecture Notes in Computer Science, 2035, 53–65. Han, J., & Kamber, M. (2006). Data mining concepts and techniques. New York: Morgan Kaufmann.

8710

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Lecture Notes in Computer Science, 1398, 137–142. Khan, F. M., Fisher, T. A., Shuler, L. A., Tianhao, W., & Pottenger, W. M. (2002). Mining chat-room conversations for social and semantic interactions. Lehigh University Technical Report, LU-CSE-02-011. Kolenda, T., Hansen, L. K., & Larsen, J. (2001). Signal detection using ICA: Application to chat room topic spotting. In Proceedings of the 3rd international conference on independent component analysis and signal separation (ICA2001) (pp. 540–545). Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401–412. Kose, C., & Ozyurt, O. (2006). A target oriented agent to collect specific information in a chat medium. Lecture Notes in Computer Science, 4263, 697–706.

Kose, C., Ozyurt, O., & Ikibas, C. (2008). A comparison of textual data mining methods for sex identification in chat conversations. Lecture Notes in Computer Science, 4993, 638–643. Tianhao, W., Khan, F. M., Fisher, T. A., Shuler, L. A., & Pottenger, W. M. (2002). Errordriven boolean-logic-rule-based learning for mining chat-room conversations. Lehigh University Technical Report, LU-CSE-02-008. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval Journal, 1(2), 69–90. Yang, Y., & Pederson, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the fourteenth international conference on machine learning (pp. 412–420). Witten, I. A., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. New York: Morgan Kaufmann (Chapter 8).