Recognize User Identity in Twitter Social Networks via Text Mining Sara Keretna Centre for Intelligent Systems Research Deakin University Australia
[email protected]
Ahmad Hossny Centre for Intelligent Systems Research Deakin University Australia
[email protected]
Abstract—The social networks became very effective in the recent time. Many people use it to communicate, lead and manage mass people activity either to support or oppose different causes. This raised the issue of verifying the owners of social accounts in order to eliminate the effect of any fake account on mass people. The proposed research is to recognize the original account versus the fake account using the writeprint, which is the writing style biometric. We stated a set of features to be extracted using text mining techniques, then trained using supervised machine learning to build the knowledge base. The recognition starts by extracting the features as well, then measuring the similarity of the feature vector with all feature vectors in the knowledge base, and taking the most similar vector as the verified account. Keywords-text mining ; identity recognition ; social networks; machine learning
I.
INTRODUCTION
Recognizing the identity of the user in social networks is a rising issue because of the substantial increase in the number of fake accounts, parody accounts and the accounts used for hate speech and violence. This problem causes moral and legal issues that affect the social life of innocent people. Celebrities of media and politics domains, for example, are regular victims of electronic identity framing. The increasing rate of using social networks contributed in boosting this problem up due to the simplicity of creating fake accounts. Many researches attempt to introduce new methods for identity recognition to overcome this problem. Research in ‘identity recognition’ intersects with many fields of science including biometrics, text mining, pattern recognition and social networks. The techniques of biometrics are used to identify people from their physical or behavioral features. The current research implements the biometrics concepts with text mining techniques to identify people using the features extracted from their own text. We will apply our experiments on users text collected from social networks. There are various types of social networks. Some of them are of general purpose, which allow its users to share different media types like text, images and videos. Facebook, Twitter, Ning and Google+ are examples of general-purpose social
978-1-4799-0652-9/13/$31.00 ©2013 IEEE
Doug Creighton Centre for Intelligent Systems Research Deakin University Australia
[email protected]
networks. Other social networks are of special purpose like Instagram and Flickr that are used for sharing images. Linkedin is also a special purpose social network specialized in business communications. Social networks are considered a rich source of unstructured digital data with a substantial increasing rate. Although writing styles vary from person to another, every person has his own unique writing style. This style is considered a biometric or text-print. The challenge is to define the features that differentiate between different writing styles. As People use written text frequently to communicate, we are proposing to build an AI based system that learns the writing style of a user using text extracted from his emails, social networks or text messages. Then use the learned features that represent his text-print as a reference to recognize it later. This paper targets to detect the fake identity of the social network user by comparing a claimed text against the original text of the user. Here we propose a technique that analyses short messages from the social network of twitter and extract linguistic features that can distinguish between the writing styles of different users. The rest of this paper is structured as follows: Section 2 discusses the related work and related research. Section 3 explains the proposed system. Section 4 describes the experimental design and the expected results. Section 5 closes with conclusions. II.
BACKGROUND AND LITERATURE REVIEW
The writing style is unique from person to another. It is affected by factors like a person’s culture, educational background and the environment he was raised in. With the extraction of the right set of features and applying classification techniques, a person can be automatically identified using the text he wrote. The process of extracting linguistic features from anonymous text with the aim of identifying the author of that text is called ‘writeprint identification’[1]. Research in writeprint identification is considered a recent area of study. It became more interesting with the increasing usage of social networks. The term ‘writeprint’ was first
introduced in 2006 by Li [2]. It was formally known as ‘Author identification’. Researchers use writeprint identification for analyzing different types of unstructured text like emails [3-5], online product reviews [6-8], news columns [1, 9], text messages and chatting messages [10, 11]. Short text like instant messages and chatting messages involve higher challenges with writeprint identification. That is because there are no enough features to be extracted from the short text compared to a long one similar to news articles. Furthermore, short text usually is informal. This means that it has a high probability of spelling mistakes and the sentences are not well structured. Author identification is a classic type of classification problems. It depends on selecting relevant features of the data using feature extraction techniques. The extracted features will be used as input to the learning module that will produce a classification model as an output. That model will be used to classify anonymous text to the relevant category which, in this case, is the author name. Text classification techniques are divided into three types: rule-based technique, statistical-based technique and machine learning techniques. Each of these techniques could be applied on the lexical level, morphological level, syntactic level or semantic level. Sometimes the text classification uses a hybrid of techniques according to the problem nature. If the domain of the text is limited or constrained, it is easier to analyse the text as explained in Figure 1.
Text Classification Techniques
Semantic Level
Rule based
Syntactic Level
Statistics based Machine Learning based
Morphological Level Lexical Level
Figure 1. Text classification techniques are divided to three categories: rule based, statistics based and machine learning based. Each of these categories can either be implemented on the syntactic level, the morphological level, the semantic level or the lexical level of the text being analyzed.
III.
PROPOSED METHODOLOGY
In this paper, we are proposing a technique to detect the writeprint or the authorship identification by analyzing short anonymous text extracted from the social network of Twitter. We used a supervised machine learning technique to learn the features identifying every user, which have been extracted from a training dataset collected from different accounts on Twitter.
A. Data Collection In order to be able to collect information from social networks, we need to use a data crawler. The crawler that will be used in our experiment is called Python Twitter Tools (PTT)[12]. It is a software tool implemented with python programming language. It provides a method that takes twitter account name as an input and the output is a list of all tweets from this account. For our experiment, we will use the crawler to collect 1000 twitter messages from 30 different accounts. The accounts will be selected from celebrities and popular figures. Each 10 accounts will be from a different domain to analyze the difference in the results. TABLE I. THE NUMBER OF TWITTER ACCOUNTS THAT WILL BE USED IN OUR EXPERIMENT. THE TABLE SHOWS THAT NUMBER OF ACCOUNTS THAT BELONG TO MALES AND FEMALES FOR EACH SELECTED DOMAIN
Politicians Writers Actors Total
Male 5 5 5 15
Female 5 5 5 15
Total 10 10 10 30
B. Preprocessing As twitter has a special nature of its tweets, there should be some preprocessing. For example, twitter does not support embedding pictures, audio or video in its tweets but it supports hosting them externally and attach the link of such media to the text message. This requires excluding the url attachments or parsing it to know its nature. Another issue with twitter is the limited size of the message to be 140 characters only. Some people try to deal with this by using abbreviations similar to using “NY” instead of saying “NewYork”. Another factor is to shorten some words by removing the vowels similar to write the word “story” as “stry” depending on the reader’s intelligence to understand the required term. Another issue is the lack of punctuation and the usage of special characters like hashtags. These issues should all be manipulated before the training and the recognition processes. C. Feature Extraction The feature extraction activity plays a crucial rule in the classification process. Selecting the right set of features can significantly affect the learning process and the final results. We selected a set of features that we believe are relevant to our problem. In order to be able to extract some of these features, we will need a part of speech tagger. We will use the Stanford POS tagger [13] for this task. The selected features are listed in Table 1 below. TABLE II. FEATURES EXTRACTED FROM TWEETS TO BE USED IN THE TRAINING PHASE
Feature Name N_hash_word N_mentions Ex_link Is_pic_exist
Description Percentage of # tags used Percentage of @ tags used Percentage of external links used 1, if images are linked to the twitter message,
0 otherwise Total number of words in the message Does the message use abbreviations Does the message contain summarized form of words NE_types What named entities are used in the message N_noun Number of nouns used N_verb Number of verbs used N_participle Number of participles used N_interjection Number of interjections used N_pronoun Number of pronouns used N_preposition Number of prepositions used N_adverb Number of adverbs used N_conjunction Number of conjunctions used Freq_words Top five most frequently used words N_special_chars Number of symbols like _ for example N_capital Number of capital letters N_words Is_abbreviate Is_rem_vowels
D. Learning The learning process is where a model will be generated to help in classifying anonymous text. The input for this process is a set of twitter messages labeled with user name as input. These messages will be used for training the machine learning model. The learning process is illustrated in Figure 2. A set of features i then extracted for each user in a feature vector. The feature vector will be sent to the learning module that will generate a classification model to be used for classifying anonymous text. The learning module will be built based on inductive logic programming where the output model is a set of rules that will guide the classifier to the message author.
Tweets
Feature Extractor
Learning Module
start the training process by samples of 50, 100, 150 and then 200 tweets. Results will then be compared to detect which sample achieved the best accuracy.
Example; A tweet from David Cameron by 13 May Doing a US phone-in ahead of my meeting with @BarackObama @Whitehouse. Plenty to discuss will keep you updated pic.twitter.com/vCaoP9kwZI
TABLE III. FEATURES EXTRACTED FROM TWEETS TO BE USED FOR THE TRAINING PHASE
Feature N_hash_word N_mentions Ex_link Is_pic_exist N_words Is_abbreviate Is_rem_vowels NE_types N_noun N_verb N_participle N_interjection N_pronoun N_preposition N_adverb N_conjunction N_special_chars N_capital
Value 0 2 1 1 16
Description No # tags in this tweet @whitehouse, @BarackObama Pic.twitter.com/vCaoP9kwZI Pic.twitter.com/vCaoP9kwZI @s, links and symbols are not counted 1 US 0 Words are all spelled correctly [Country] US 3 US, phone-in, meeting 4 Discuss, Keep, Updated, Plenty 1 Doing 0 2 My, You 0 1 Ahead 2 Of, with 2 In: ‘phone-in’ and ‘discuss – will’ 4 Capital letters in mentions and links are not calculated
Web Crawler Trained Data
Feature Vectors
WWW
Rules Database
Figure 2. Feature extraction process will start from the crawler. The crawler will crawl the internet to get all tweets for given users. The tweets will then go to the ‘Feature Extractor’ where the features for the analyzed tweets of each user are represented in a feature vector. Feature vectors for all users will go through a training module that will try to get a functional representation that will help us later to classify anonymoud tweets back to their user.
The data that will be used for learning will be a sample of the data that we extracted using our python crawler. We will
E. Classification The classification process starts when an unlabeled twitter message is given as input. Features will be extracted from the unlabeled message to a feature vector that will go to the classifier built in the learning stage. The output of the classifier will be the suggested author of the input twitter message. The classification process is illustrated in Figure 3. The classifier depends on measuring the similarity between the feature vectors using Jacard’s coefficient calculation. It measures the intersection of the two feature vectors and divides it by the union of them as shown in Equation 1. ,
|
∩
| ⁄|
∪
|
(1)
Feature Extractor
authorship attribution in e-mail forensics," in the Eighth Annual DFRWS Conference, Baltimore, MD, 2008, pp. S42–S51.
Classifier Feature Vector
[6]
J. Sun, Z. Yang, P. Wang, and S. Liu, "Variable Length Character N-Gram Approach for Online Writeprint Identification," in 2010 International Conference on Multimedia Information Networking and Security (MINES), Nanjing, Jiangsu, 2010, pp. 486 - 490.
[7]
J. Sun, Z. Yang, P. Wang, L. Liu, and S. Liu, "Feature Selection for Online Writeprint Identification Using Hybrid Genetic Algorithm," in 2010 International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, 2010, pp. 76 - 79.
[8]
S. Liu, Z. Liu, J. Sun, and L. Liu, "A Method of Online Writeprint Identification Based on Principal Component Analysis," in 2010 International Symposium on Information Science and Engineering (ISISE), Shanghai, 2010, pp. 319 - 321.
[9]
E. F. Legara, C. Monterola, and C. Abundo, "Ranking of predictor variables based on effect size criterion provides an accurate means of automatically classifying opinion column articles," Physica A: Statistical Mechanics and its Applications, vol. 390, pp. 110–119, Jan 2011 2011.
[10]
F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi, "A unified data mining solution for authorship analysis in anonymous textual communications," Information Sciences, vol. 231, pp. 98–112, May 2013 2013.
[11]
T. Kucukyilmaz, B. B. Cambazoglu, C. Aykanat, and F. Can, "Chat mining: Predicting user and message attributes in computer-mediated communication," Information Processing & Management, vol. 44, pp. 1448–1466, July 2008 2008.
[12]
M. Verdone, "Python Twitter Tools (PTT)," 1.9.4 ed, 2013.
[13]
K. Toutanova. (2000). Stanford Log-linear Part-OfSpeech Tagger. Available: http://nlp.stanford.edu/software/tagger.shtml
Matched Identity
Database Figure 3. The classification process starts with an anonymous tweet as an input.The feature vector of this tweet is extracted in the ‘Feaute Extractor’. The Classifier will use the functional representation saved in the database from the learning module to suggest an identity that matches the input tweet.
IV.
CONCLUSION AND FUTURE WORK
We believe that the writing style of each person has unique linguistic features. Being able to extract these features will highly increase the chances of identifying fake accounts in social networks. This will help to overcome malicious activities against social network users. In this paper, we introduced a new idea to recognize identities in social networks from text. We are using twitter messages as our case study, we extract a set of features according to the special nature of twitter that is limited to 140 characters and allows pictures or videos only through external links. The primitive results are very promising and the next step is to apply this technique to thousands of accounts and check if the massive amount of users will decrease the accuracy of the similarity measure. References [1]
Z. Liu, Z. Yang, S. Liu, and Y. Shi, "Semi-random subspace method for writeprint identification," Neurocomputing, vol. 108, pp. 93–102, May 2013 2013.
[2]
J. Li, R. Zheng, and H. Chen. (2006, April 2006) From fingerprint to writeprint. Communications of the ACM - Supporting exploratory search. 76-82. Available: http://dl.acm.org/citation.cfm?id=1121951
[3]
F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi, "Mining writeprints from anonymous emails for forensic investigation," Digital Investigation, vol. 7, pp. 56–64, Oct 2010 2010.
[4]
M. Khonji, Y. Iraqi, and A. Jones, "Mitigation of spear phishing attacks: A Content-based Authorship Identification framework," in 6th International Conference on Internet Technology and Secured Transactions, Abu Dhabi, United Arab, Emirates, 2011, pp. 416 - 421.
[5]
F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi, "A novel approach of mining write-prints for