Hybrid Text-based Deception Models for Native and NonNative English Cybercriminal Networks Alex V Mbaziira
James H Jones
George Mason University 4400 University Dr. Fairfax, VA
George Mason University 4400 University Dr. Fairfax, VA
E-mail:
[email protected]
E-mail:
[email protected]
ABSTRACT
1. INTRODUCTION
Cybercriminals are increasingly using Internet messaging to exploit their victims. We develop and apply a text-based deception detection approach to build hybrid models for detecting cybercrime in the text Internet communications from native and non-native English speaking cybercriminal networks, where our models use both computational linguistics (CL) and psycholinguistic (PL) features. We study four types of deceptionbased cybercrime: fraud, scam, favorable fake reviews, and unfavorable fake reviews. We build two types of generalized hybrid models for both native and non-native English speaking cybercriminal networks: 2-dataset and 3-dataset hybrid models using Naïve Bayes, Support Vector Machines, and kth Nearest Neighbor algorithms. All 2-dataset models are trained on any two forms of cybercrime in different web genres, which are then used to detect and analyze other types of cybercrime in web genres that were not part of the training set to establish model generalizability. Similarly, the 3-dataset models are trained on any three forms of cybercrime in different web genres, that are also used to detect and analyze cybercrime in a web genre that was not part of the training set. Model performance on the test datasets ranges from 60% to 80% accuracy, with the best performance on detection of unfavorable reviews and fraud, and notable differences emerged between detection in messages from native and non-native English speaking groups. Our work may be applied as provider- or user-based filtering tools to identify cybercriminal actors and block or label undesirable messages before they reach their intended targets.
As Internet-carried text messaging becomes ubiquitous and widely adopted as a mode of communication, cybercriminals are increasingly leveraging these text-based communications for their operations against victims. These incidents are surging due to low barriers to commit crime and high financial incentives. Cybercriminals use deception to trick and exploit their victims through text messages, resulting in various types of cybercrime such as fraud, scams and fake online reviews [15]. Most cybercrime is committed in English since this is the most dominant language on the web for text messages [27]. However, not all cybercriminals are native English speakers, and the number of non-native English speaking cybercriminals benefiting from cybercrime is growing [9]. There are linguistic variations between native and non-native English cybercriminal networks. In this paper, we demonstrate how hybrid models can be applied to detect and analyze cybercrime in native and non-native English cybercriminal networks. We use three well-studied algorithms in binary text classification that investigated early forms text-based cybercrime like spam [11, 21, 24]. These algorithms are Naïve Bayes (NB), Support Vector Machines (SVM) and kth Nearest Neighbor (kNN). NB discriminates between binary classes by computing posterior probabilities for each of the two classes in a given dataset [25]. SVM uses a maximal hyperplane to linearly discriminate instances into binary classes [3] while kNN uses a distance function to determine the closest known instances to a given unknown instance.
CCS Concepts
In this work, we also extend a previous paper [18] on detecting deception and cybercrime using CL and PL processes. This work proves can train models from combined web genres that generalize deception and cybercrime. This work further demonstrates that it is possible to build hybrid cybercrime detection models targeting cybercriminal networks characterized as native and non-native English speaking cybercriminals. The remainder of the paper is organized as follows: Section 2 presents a literature review on deception detection and cybercrime while Section 3 describes the methodology and experimental set up for data collection and analysis. In Section 4, we present the results of the experiments and discuss findings. Section 5 has conclusions for the paper.
• Social and professional topics➝Computer crime • Computing methodologies➝Natural language processing • Computing methodologies➝Machine learning • Computing methodologies➝Supervised learning by classification
Keywords Cybercrime; computational linguistics; psycholinguistics; deception; natural language processing; machine learning. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. ICCDA '17 , May 19-23, 2017, Lakeland, FL, USA © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5241-3/17/05…$15.00
1.1 Key Contributions Our first contribution is training generalizable hybrid cybercrime detection models from four datasets categorized in three web genres: email, social media and websites as shown Table 1. For 2dataset hybrid models, we combine any two datasets to generate the training model and test on other datasets. Similarly, three datasets are combine to generate 3-dataset hybrid models.
http://dx.doi.org/10.1145/3093241.3093280
23
For our second contribution, we build 2-dataset and 3-dataset hybrid models for detecting cybercrime in non-native English speaking cybercriminal networks. One dataset is collected from non-native English speaking cybercriminal networks while the other three datasets are collected from native English speaking cybercriminal networks. We derive dataset-specific features to build hybrid models to detect and analyze cybercrime in an actual non-native English speaking cybercriminal network.
obfuscate messages to make them deceptive. We use more CL and PL features linked to deception and cybercrime.
3. METHODOLOGY AND EXPERIMENT SET UP 3.1 Data Description We use four datasets to study cybercrime in the form of: fraud, scam, favorable and unfavorable fake reviews in web genres. For the Facebook web genre, we generated the dataset from publicly leaked emails of actual non-native English speaking cybercriminal network using an online data theft service [23]. The email web genre, we use emails made public by the Federal Energy Regulatory Commission [7] in the Enron scandal and the 89 emails which were part of the court evidence made public by the DoJ when prosecuting the two Enron executives for securities and wire fraud. For the website web genre we use two datasets for favorable and unfavorable reviews [20].
2. RELATED WORK There is limited work on detecting deception and cybercrime in text-based messages in native and non-native English speaking cybercriminal networks. Early work on deception detection in text messages attempts to extend similar work on non-verbal communication [28] and machine learning. Our paper complements prior research by identifying lexical, syntactic, and psycholinguistic features which are relevant to real world problems in deception and cybercrime.
For ground truth, we manually verified all Facebook messages containing work-at-home scams, lottery scams, advertisements for hacking services, and carding and truthful messages. A team of three graduate students were involved in labeling exercise and a majority vote was considered for each instance. All emails used as evidence to prosecute the former Enron corporation executives for securities and wire fraud were labelled as deceptive because all the charges for the case were not appealed by the convicts [13]. However, other emails were made public by the Federal Energy Regulatory Commission were labeled as truthful. The public datasets for hotel reviews were already labelled [20]. However, we considered only truthful reviews with transactional data like deals on for hotel bookings, services, prices for meals, valet services as assurance that a reviewer was a guest at that hotel [12]. We manually identified 84 out of 400 truthful unfavorable reviews and 78 out of 400 truthful favorable reviews.
We reviewed research on psychological processes linked to cybercrime with respect to text-based communication. Prior work in PL identifies features like frequency of lexical items, average word and sentence length, first person pronouns and exclusive words that can be linked to deception and psychological processes [19, 26]. For our work, we identify features for PL processes that are relevant to cybercrime and deception in text messages, which we further complement with CL features to build our hybrid models. In [8] are some linguistic features for detecting deception in written statements and interviews for criminal investigations but not cybercrime. Some deception features identified are frequencies of lexical items, verb phrases, non-prompted negation, and moderating adverb phrases which are all computational linguistic features. Our work uses more CL features in addition to PL features to map linguistic psychological processes to deception and cybercrime.
From each dataset, we randomly sampled 100 instances to obtain a training set for each web genre. For the test set we randomly sample 20 instances which are not part of the training set as shown in Table 1. The size of the training and test sets in our experiments is restricted by the ground truth data of 89 Enron court emails for the deceptive class as well as 84 truthful unfavorable reviews and 78 favorable reviews with transactional data. For the hybrid models, two datasets are combined to generate training set for 2-dataset hybrid models. Similarly, in the 3-dataset hybrid models, we combine three datasets to generate training model.
In [10] is an approach for detecting deception in favorable fake reviews using n-grams and deeper syntactical features. Deeper syntactical features used like: weight, location, price, etc., are completed with n-grams to build the learning models. We do not use n-gram analysis in our approach since such models are not robust [5, 22]. We also use additional dataset for unfavorable product reviews and test our models on other datasets. In [4] is scam detection in Twitter that also uses n-grams and component analysis. Semi-supervised learning is used to detect scams. Our paper uses supervised learning to detect cybercrime in web genres. We also do not n-grams because such models are not robust even when principal components are applied [9].
All the instances in the four datasets are preprocessed and normalized using WEKA’s normalization filter. For the Facebook data, we preprocessed the data by removing non-English posts, non-ASCII characters, and text-based emoticons while for the Enron data we removed all email headers and prior thread content. The favorable and unfavorable reviews consisted of plain text messages which did not require any preprocessing. We also use 10-fold cross validation on the training sets when building the models.
In [16], cluster analysis is to study deception in emails using selfreferences, exclusive words, negative emotion and action verbs. Our paper uses more PL features which we complement with CL features to build our cybercrime models. We also use a fraud dataset from the Department of Justice (DoJ) for our models. Work on stylometry studies deception and authorship attribution for disputed documents. The paper in [1] uses n-grams, n-gram words, syntactical features, function words, and specific keywords used by spammers in addition to nine CL features. Another paper on adversarial stylometry in [2] uses n-gram words and characters and some CL features to detect imitation and obfuscation of documents. Our paper differs uses real world data to study deception and cybercrime hence do not use any tools to imitate or
3.2 Feature Selection and Engineering We identify CL features from CL processes linked to cybercrime and deception detection which are: verbs, modifiers, average sentence length, average word length, pausality, modal verbs, emotiveness, lexical diversity, redundancy, characters, punctuation marks, sentences, adjectives, adverbs, nouns and function words [28].
24
PL processes relevant to cybercrime are: analytical, words per sentence, 6-letter words, I, we, you, she/he, affect, positive emotion, negative emotion, insight, causation and certainty words.
English Cybercriminal Networks Table 2 summarizes the predictive accuracy of the 2-dataset hybrid models for cybercrime detection. The EN+FB model is trained on fraud and scam and detects only favorable fake reviews with 60% accuracy. The EN+NR model is trained on fraud and unfavorable fake reviews to detect scam and favorable fake reviews and detects favorable fake reviews with 60% accuracy. The EN+PR model is trained on fraud and favorable fake reviews and detects unfavorable fake reviews with 70% accuracy. The FB+NR and it is trained on scam and unfavorable fake reviews and model performance ranges from 60% to 80% accuracy. The FB+PR model is trained on scams and unfavorable reviews and detects fraud and unfavorable reviews with a performance range of 60% to 80% accuracy. The NR+PR model is trained favorable and unfavorable reviews and detects fraud with 60% accuracy.
We drop all features of hybrid models for non-native English cybercriminal networks in datasets whose normalized average was below 0.1 average.
3.3 Evaluating Classifier Performance We use precision (P), recall (R), F-measure (F), and Receiver Operating Characteristic (ROC) curves. Precision measures the fraction of the relevant deceptive instances that the classifier declared deceptive, while recall measures the number of relevant deceptive messages that are correctly predicted. F-measure is the harmonic mean for precision and recall, and ROC curves illustrate the tradeoff between true positive rate and false positive rate of the classifier.
Table 3 summarizes performance of the classifiers for the 2dataset models for detecting cybercrime in native English speaking cybercriminal networks. Evaluation of these classifiers reveals that the classifiers perform and generalize well in detecting cybercrime in other web genres.
Table 1. Description of random sample
Fraud
# Train Set 100
# Test Set 20
Scam
100
20
100
20
100
20
Dataset
Web Genre
Cybercrime
Enron (EN)
Email
Facebook (FB)
Facebook
Unfavorable Reviews (NR) Favorable Reviews (PR)
Website Website
Unfavorable Fake Reviews Favorable Fake Reviews
Table 3. Evaluating classifiers for 2-Datasets for native English cybercriminal networks Model EN + FB EN + FB EN + FB EN + NR EN + NR EN + NR EN + PR EN + PR EN + PR FB + NR FB + NR FB + NR FB + PR FB + PR FB + PR NR + PR NR + PR NR + PR
Table 2. Predictive accuracy of 2-dataset models for native English cybercriminal networks
Model EN + FB EN + FB EN + FB EN + NR EN + NR EN + NR EN + PR EN + PR EN + PR FB + NR FB + NR FB + NR FB + PR FB + PR FB + PR NR + PR NR + PR NR + PR
Classifier NB SVM KNN NB SVM KNN NB SVM KNN NB SVM KNN NB SVM KNN NB SVM KNN
Unfav. fake reviews 50% 50% 40%
Fav. fake reviews 60% 30% 50% 50% 60% 50%
50% 70% 70%
scam
40% 40% 50% 40% 40% 40%
50% 50% 50% 60% 80% 70% 60% 60% 60%
fraud
70% 80% 50% 70% 60% 60%
Classifiers NB SVM KNN NB SVM KNN NB SVM KNN NB SVM KNN NB SVM KNN NB SVM KNN
P 0.94 0.75 0.78 0.67 0.78 0.75 0.73 0.81 0.79 0.60 0.65 0.61 0.63 0.73 0.66 0.67 0.76 0.68
R 0.94 0.75 0.77 0.67 0.77 0.75 0.72 0.81 0.79 0.60 0.65 0.61 0.61 0.72 0.66 0.66 0.76 0.68
F 0.94 0.75 0.76 0.66 0.77 0.75 0.72 0.81 0.79 0.59 0.65 0.61 0.60 0.71 0.66 0.66 0.76 0.67
ROC 0.94 0.75 0.75 0.73 0.77 0.75 0.82 0.81 0.79 0.59 0.65 0.60 0.65 0.72 0.65 0.71 0.76 0.74
Table 4. Evaluating training models for 2-dataset models for Non-Native English cybercriminal networks Model
50% 40% 40%
Classifier P
R
F
ROC
EN + NR NB
0.77 0.74 0.73 0.77
EN + NR SVM
0.74 0.73 0.73 0.73
EN + NR KNN
0.72 0.10 0.71 0.71
EN + PR NB
0.76 0.71 0.70 0.78
4. RESULTS
EN + PR SVM
0.85 0.85 0.85 0.85
We train 2-dataset and 3-dataset hybrid models using three well studied classification algorithms, namely Naïve Bayes, Support Vector Machines, and kth Nearest Neighbor, to detect cybercrime in native English and non-native English cybercriminal networks. In this section, we present results of performance of these models.
EN + PR KNN
0.75 0.74 0.74 0.80
NR + PR NB
0.57 0.55 0.50 0.61
NR + PR SVM
0.64 0.64 0.63 0.64
NR + PR KNN
0.62 0.60 0.58 0.62
4.1 2-Dataset Hybrid Models for Native
25
Table 6. Evaluating training models for 3-Dataset models for native English cybercriminal networks
4.2 2-Dataset Hybrid Models for Non-Native English Speaking Cybercriminal Networks We build three hybrid models for predicting scam in non-native English speaking cybercriminal networks. The EN+NR is trained on fraud and unfavorable fake reviews and detects scams with 60% accuracy with the NB classifier. The EN+PR, is trained on fraud and favorable fake reviews and detects scam with 60% accuracy with NB. The NR+PR, is trained on favorable and unfavorable fake reviews and detects scam with 60% accuracy with both the NB and kNN classifiers. Evaluation of these classifiers reveals that the classifiers perform and generalize well in detecting cybercrime in other web genres in non-native English speaking cybercriminal networks as shown in Table 4.
4.3 3-Dataset Hybrid Models for Native English Speaking Cybercriminal Networks
Model EN+PR+NR EN+PR+NR EN+PR+NR
Table 5 shows predictive accuracy of the 3-dataset hybrid models for native English cybercriminal networks. The EN+FB+PR model is trained on fraud, scam and favorable fake reviews and detects unfavorable fake reviews using NB, SVM and kNN classifiers with 60%, 70% and 70% accuracy respectively.
F 0.647 0.997 0.725
ROC 0.723 0.997 0.833
Findings on model performance reveals that more classifiers for the hybrid models detect fraud and unfavorable reviews because they have more patterns of deception and cybercrime. The first pattern reveals that cybercriminals are less committal in their textbased communication hence use less verbs and modal verbs. This pattern was observed in fraud, scams and unfavorable fake reviews. The second pattern is that cybercriminals are verbose in their text messages hence use more punctuation marks and function words. This pattern was observed in scams, favorable and unfavorable fake reviews. The third pattern is that cybercriminals are vague and ambiguous in their text messages hence use more function words, adverbs and adjectives. We observed this pattern in scams, fraud and unfavorable reviews. The fourth pattern is that cybercriminals avoid being held accountable by using less selfpronouns in their text messages. We observed this pattern in scam and fraud. The fifth pattern is that cybercriminals use messages with low cognitive complexity which are characterized by fewer analytical, shorter, insight and causation words. We observed this pattern in scams, favorable and unfavorable fake reviews. The sixth pattern is that cybercriminals are emotional hence use more emotion words. We observed this pattern in fraud, favorable and unfavorable fake reviews.
We train one 3-dataset hybrid model to detect scam in a nonnative English speaking cybercriminal networks. This model, which is EN+PR+NR, is trained on fraud, favorable and unfavorable fake reviews and detects scam with 60% accuracy while the SVM classifier detects scam with 70%. Table 7 summarizes performance of the classifiers for the 3dataset models for detecting cybercrime in non-native English speaking cybercriminal networks. Evaluation of these classifiers reveals that the classifiers also perform well on the training models in detecting and analyzing scam. Table 5. Evaluating training models for 3-Dataset models for native English cybercriminal networks fraud
R 0.65 0.997 0.727
The size of the datasets we use to build and evaluate our models are small, however, there is no theoretical guarantee that binds the number of instances of a learning model to good generalization hence deductions for such a model are correct [6]. Similarly, there is also no widely accepted threshold for the size of test data sets that can generalize well to training models [17].
4.4 3-Dataset Hybrid Models for Non-Native English Speaking Cybercriminal Networks
unfav. fake reviews 60% 70% 70%
P 0.656 0.997 0.732
All models whose predictive accuracies for the truthful and deceptive class was less 50% were disregarded because the human accuracy rate of detecting deception in text messages is 50% [12].
Table 6 summarizes performance of the classifiers for the 3dataset models for detecting cybercrime in native English speaking cybercriminal networks. Evaluation of these classifiers reveals that the classifiers also perform well.
Classifier NB SVM KNN NB SVM KNN NB SVM KNN
Classifier NB SVM KNN
5. DISCUSSION
FB+PR+NR model is trained on scam, favorable and unfavorable fake reviews to detect and analyze fraud. Both the NB and SVM classifiers detect fraud with 70% accuracy while the kNN classifiers detect fraud with 60% accuracy.
Model EN+FB+PR EN+FB+PR EN+FB+PR FB+NR+PR FB+NR+PR FB+NR+PR EN+FB+NR EN+FB+NR EN+FB+NR
Model Classifier P R F ROC EN+FB+NR NB 0.66 0.66 0.65 0.68 EN+FB+NR SVM 0.71 0.70 0.70 0.70 EN+FB+NR KNN 0.71 0.70 0.70 0.70 EN+FB+PR NB 0.67 0.66 0.65 0.71 EN+FB+PR SVM 0.75 0.75 0.75 0.75 EN+FB+PR KNN 0.73 0.72 0.72 0.81 FB+NR+PR NB 0.62 0.62 0.61 0.63 FB+NR+PR SVM 0.64 0.64 0.64 0.64 FB+NR+PR KNN 0.63 0.62 0.62 0.66 Table 7. Evaluating training models for 3-Dataset models for native English cybercriminal networks
fav. fake reviews
6. CONCLUSION
70% 70% 60%
We also demonstrate that we can build hybrid deception models that can discriminate between deception and cybercrime from benign text-based communication in both native and non-native English speaking cybercriminal networks. We use datasetspecific features for models to detect cybercrime in non-native English speaking cybercriminal networks. The generalizability of our approach suggests that we can train models to detect
50% 50% 60%
26
deception across different activities, enabling the detection of new but criminal activity without prior knowledge of the new activity's details.
[14] Hancock, J.T., Curry, L.E., Goorha, S. and Woodworth, M. 2007. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication. Discourse Processes. 45, 1 (Dec. 2007), 1–23.
7. ACKNOWLEDGMENTS
[15] ISIS has mastered a crucial recruiting tactic no terrorist group has ever conquered: 2015. http://www.businessinsider.com/isis-is-revolutionizinginternational-terrorism-2015-5. Accessed: 2016-03-16.
We would like to thank all the anonymous reviewers whose comments helped us improve this paper.
8. REFERENCES [1] Afroz, S., Brennan, M. and Greenstadt, R. 2012. Detecting Hoaxes, Frauds, and Deception in Writing Style Online. 2012 IEEE Symposium on Security and Privacy (SP) (May 2012), 461–475.
[16] Keila, P.S. and Skillicorn, D.B. 2005. Detecting Unusual Email Communication. Proceedings of the 2005 Conference of the Centre for Advanced Studies on Collaborative Research (Toranto, Ontario, Canada, 2005), 117–125.
[2] Brennan, M., Afroz, S. and Greenstadt, R. 2012. Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity. ACM Trans. Inf. Syst. Secur. 15, 3 (Nov. 2012), 12:1–12:22.
[17] Matykiewicz, P. and Pestian, J. 2012. Effect of Small Sample Size on Text Categorization with Support Vector Machines. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (Stroudsburg, PA, USA, 2012), 193– 201.
[3] Chang, C. and Lin, C.-J. 2001. LIBSVM: a Library for Support Vector Machines.
[18] Mbaziira, A. and Jones, J. 2016. A Text-based Deception Detection Model for Cybercrime. International Conference on Technology and Management. (Jul. 2016).
[4] Chen, X., Chandramouli, R. and Subbalakshmi, K.P. 2014. Scam detection in Twitter. Data Mining for Service. Springer. 133–150.
[19] Newman, M.L., Pennebaker, J.W., Berry, D.S. and Richards, J.M. 2003. Lying words: predicting deception from linguistic styles. Personality & Social Psychology Bulletin. 29, 5 (May 2003), 665–675.
[5] Chen, Y., Zhou, Y., Zhu, S. and Xu, H. 2012. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (Sep. 2012), 71–80.
[20] Ott, M., Choi, Y., Cardie, C. and Hancock, J.T. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (Stroudsburg, PA, USA, 2011), 309–319.
[6] Domingos, P. 2012. A Few Useful Things to Know About Machine Learning. Commun. ACM. 55, 10 (Oct. 2012), 78– 87. [7] Enron Email Dataset: 2015. http://www.cs.cmu.edu/~enron/. Accessed: 2016-03-29.
[21] Pearl, L. and Steyvers, M. 2012. Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC. 27, (2012), 183–196.
[8] Exploiting Verbal Markers of Deception Across Ethnic Lines: An Investigative Tool for Cross-Cultural Interviewing: 2015. https://leb.fbi.gov/2015/july/exploiting-verbal-markers-ofdeception-across-ethnic-lines-an-investigative-tool-forcross-cultural-interviewing. Accessed: 2016-11-27.
[22] Reynolds, K., Kontostathis, A. and Edwards, L. 2011. Using Machine Learning to Detect Cyberbullying. 2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA) (Dec. 2011), 241–244. [23] Sarvari, H., Abozinadah, E., Mbaziira, A. and McCoy, D. 2014. Constructing and Analyzing Criminal Networks. IEEE Security and Privacy Workshops. (2014), 8. [23] Shojaee, S., Murad, M.A.A., Azman, A.B., Sharef, N.M. and Nadali, S. 2013. Detecting deceptive reviews using lexical and syntactic features. 2013 13th International Conference on Intelligent Systems Design and Applications (ISDA) (Dec. 2013), 53–58.
[9] Exploring Underweb forums: How cybercriminals communicate: http://www.techrepublic.com/blog/itsecurity/exploring-underweb-forums-how-cybercriminalscommunicate/. Accessed: 2016-11-27. [10] Feng, V.W. and Hirst, G. 2013. Detecting Deceptive Opinions with Profile Compatibility. International Joint Conference on Natural Language Processing (Nagoya, Japan, 2013), 338–346.
[24] Tan, P.-N., Steinbach, M. and Kumar, V. 2014. Introduction to Data Mining. Dorling Kindersley.
[11] Firte, L., Lemnaru, C. and Potolea, R. 2010. Spam detection filter using KNN algorithm and resampling. 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) (Aug. 2010), 27–33.
[25] Tausczik, Y.R. and Pennebaker, J.W. 2010. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology. 29, 1 (Mar. 2010), 24–54.
[12] Fitzpatrick, E., Bachenko, J. and Fornaciari, T. 2015. Automatic Detection of Verbal Deception. Synthesis Lectures on Human Language Technologies. 8, 3 (Sep. 2015), 1–119.
[26] The digital language divide: 2014. http://labs.theguardian.com/digital-language-divide/. Accessed: 2016-11-27.
[13] Former Enron CEO Jeffrey Skilling Resentenced to 168 Months for Fraud, Conspiracy Charges: 2013. https://www.justice.gov/opa/pr/former-enron-ceo-jeffreyskilling-resentenced-168-months-fraud-conspiracy-charges. Accessed: 2017-04-02.
[27] Zhou, L., Burgoon, J.K., Twitchell, D.P., Qin, T. and Nunamaker, J.F. 2004. A Comparison of Classification Methods for Predicting Deception in Computer-Mediated Communication. Journal of Management Information Systems. 20, 4 (2004), 139–165.
27