arXiv:1708.06068v1 [cs.CL] 21 Aug 2017
Vector Space Model as Cognitive Space for Text Classification Barathi Ganesh HB, Anand Kumar M and Soman K Center for Computational Engineering and Networking Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, Amrita University, India
[email protected], m
[email protected], kp
[email protected]
Abstract In this era of digitization, knowing the user’s sociolect aspects have become essential features to build the user specific recommendation systems. These sociolect aspects could be found by mining the user’s language sharing in the form of text in social media and reviews. This paper describes about the experiment that was performed in PAN Author Profiling 2017 shared task. The objective of the task is to find the sociolect aspects of the users from their tweets. The sociolect aspects considered in this experiment are user’s gender and native language information. Here user’s tweets written in a different language from their native language are represented as Document Term Matrix with document frequency as the constraint. Further classification is done using the Support Vector Machine by taking gender and native language as target classes. This experiment attains the average accuracy of 73.42% in gender prediction and 76.26% in the native language identification task.
1
Introduction
The user profiling is an indirect crowd sourcing task, which collect user’s sociolect aspects like age, gender and language variation from their language share (i.e. Tweets, Face book data and Reviews) 1 2 . Knowing the user’s sociolect aspects paves way to enrich the performance of recommendation systems, targeted internet advertising, consumer behavior analysis and forensic science. This work is focused on mining of sociolect aspects like gender and language variation (native language) from the user’s (author’s) tweets by having their word usage as an evidence [1][8]. PAN - 2017 author profiling task is to predict the author’s gender and language variation from their tweets which are 100 in number [1]. This task can be viewed as a text classification problem by having target class as the author’s sociolect aspects and text as their tweets. The primary component in the text classification problem are representation, feature learning and classification. Here we have modelled the Vector Space Model (VSM) as the author’s cognition 3 as cognitive space 4 through the Document - Term Matrix (DTM) representation and feature learning methods. The further prediction of gender and language variation is done through the Support Vector Machine (SVM) based classification. Most of the recent researches are focused on representing the context of the text through distributional and distributed text representation methods [2] [3] [4]. This context representation of text 1
http://pan.webis.de/clef17/pan17-web/author-profiling.html http://ttg.uni-saarland.de/resources/DSLCC/ 3 en.wikipedia.org/wiki/Cognition 4 dictionary.sensagent.com/cognitive%20space/en-en/ 2
1
Figure 1: Word frequency used by the users with respect to the gender alone will not contribute much towards the prediction of authors sociolect aspects, since it is also dependent on their source of cognitive knowledge like education, cultural background, age group, working domain etc [5]. In this Author Profiling task ”How” language sharing expressed by an author is having more impact than the ”What” they actually shared [5]. By observing this, here the Document - Term Matrix (DTM) is used as representation method, in which the fundamental features are words and their frequencies used by the users.
Figure 2: Word frequency used by the users with respect to the native language The Figure 1 and 2 describes the variation in word frequency with respect to the gender and native language of the users. It can be observed that usage of word and its frequency varies with respect to the user’s sociolect aspects. The frequency of the word used by female is significantly greater than the word used by male. As told above cognitive nature is applicable to native language variation also. We have experimented VSM to model this cognition ability. The words in these Figure 1 and 2 are derived by taking common word across the target classes from the list of words with top 20 highest frequency in the final model. The above shown Figure 1 and 2 has been plotted for English language which is also applicable to the other languages.
2
Text Representation - Vector Space Model
Vector Space Model is built from the documents provided. These set of documents are processed to find its equivalent optimized numerical representation in the matrix format. D = d1 , d2 , d3 , ..., dn 2
(1)
Here d represents the sentences or documents (here tweets), D represents its equivalent matrix format (here DTM) and n is the total number of documents. The commonly used methods under VSM are Document - Term Matrix (DTM) and Term Frequency - Inverse Document Frequency Matrix (TF-IDF). These methods are generally referred to as Bag of Word (BOW) method and it follows bag of word hypothesis [6]. In our methodology, TF-IDF is not taken into consideration because, the control over the document frequency and term significance are modeled through minimum document frequency. Given a set of texts as in equation 1, the function f () converts it into the Document Term Matrix (DTM). In this application document refers to the user’s tweet collection, hence the matrix can be viewed as user - term matrix. The function f () measures the event e, where e is the occurrence of terms in the texts. The term may take words, phrases (n-grams) or both. The above can be represented as, D = f (e) if e > k (2) Where D is the DTM with m × n size. m is the number of users, k is the minimum document frequency and n is the total number of unique words or phrases (types) present in the document set. The word with frequency < k is not included in the DTM.
3
Feature Learning
Feature learning is the abstract level representation of original text content. This enhances the model performance by filtering the unwanted information and reduces the computation time required to build the model. In this work, frequency of the word across the document (document frequency) is taken as the threshold to limit the number of unique words in the vocabulary to build the user term matrix [6]. The terms in the vocabulary are constrained by the document frequency. Hence the matrix produced through this can be viewed as user - feature matrix. In the produced matrix row refers to the user and column refer to the word used by the user. The following Feature Scaling table gives the comparison between number of words in user - term matrix and user - feature matrix. Language English Spanish Portuguese Arabic
User - Term Matrix 322412 420563 98490 333646
User - Feature Matrix 14074 6038 20474 17773
Table 1: Feature Scaling
4
Classification
This experiment does not focus on classification rather than the representation of texts. Therefore the user - feature matrix along with its corresponding target classes are used to train the Support Vector Machine (SVM) classifier with Radial Basis Function (RBF) as the kernal. SVM is a well known non probabilistic classification algorithm which is used most often to perform the VSM based representations [7]. SVM with the default parameter in Scikit Learn library is directly used to build the classifier (i.e. C = 1.0, degree=3, gamma=’auto’, kernel=’rbf’ ).
5
Experiments and Observations
This section details the statistics about the given training data set, experimented system and its outcomes. This experiment is conducted with a machine having Intel i7 processor and 16GB of RAM. The data set contains 100 tweets per author and these tweets are written in 4 different languages. Within these different languages, native language of the users also varies. The detailed statistics about the data-set utilized is given in PAN 2017 Author Profiling overview paper [1]. The number of variation in the native language is shown in Data - Set Statistics Table. These 100 tweets per 3
Language English Spanish Portuguese Arabic
# Native Language Classes 6 7 2 4
Total # Tweets 6,000 7,000 2000 4,000
Table 2: Data-set statistics: Classes author is embedded in an xml file and given with the truth file which contains gender and language variation class of the corresponding author. These xml files are passed through xml Python library 5 as a document object model and tweets are extracted out of it. These 100 tweets per author is then concatenated to form a single document per author. By taking white space as the delimiter, tokenization has been performed and Count Vectorizer from SKLEARN python library 6 used to build the DTM. Here document in a row is referred to as the user. The User - Feature Matrix with the class tags are given to the SVM with Radial Basis Function kernal (RBF Kernal) and C = 1 to build the classification models. An independent two classification models are build for the gender prediction and identification of Language variation. To build the classifier SVM from the SKLEARN Python library 7 is utilized. To ensure the performance a 10-fold 10-cross validation is performed and the average accuracy of the 10-validation scores are calculated to find the final accuracy. This can be represented as, correctly predicted authors total authors P10 Accuracyi Average Accuracy = i=1 10
Accuracy =
(3)
(4)
The above said experiment is conducted with minimum document frequency varying between 2 and 25 while representing tweets as User - Feature Matrix. The final classification model is built by selecting the minimum document frequency with highest accuracy across all four languages. Here minimum document frequency acts as a constraint in the selection of words to build a vocabulary. The minimum document frequency is an event of frequency of the word appearing across the documents (i.e. minimum number of users who used the given word). The performance of the model built against the minimum document frequency is given in the Figures 3 and 4. It can be observed that the performance of the models has attained the constant accuracy for the minimum document frequency greater than 4. From this observation in this experiment final model is built with ”minimum document frequency = 10”.
Figure 3: Accuracy of the model with respect to the minimum document frequency 5
docs.python.org/2/library/xml.etree.elementtree.html scikit-learn.org/stable/modules/generated/sklearn.feature extraction.text.CountVectorizer.html 7 scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
6
4
Figure 4: Accuracy of the model with respect to the minimum document frequency
As the minimum document frequency increases the size of the User - Feature Matrix gets reduced. This is because the number of words to form the vocabulary is less. This has a significant impact on time to build the classification model. This is shown in the Figure 5 and 6.
Figure 5: Time taken to build the model with respect to the minimum document frequency
Figure 6: Time taken to build the model with respect to the minimum document frequency The User - Feature Matrix of the test data is built against the selected classification model’s vocabulary and thereafter gender, language variations are predicted. The result obtained against the test data are evaluated by the PAN 2017 Author Profiling shared task committee and it is shown in Table 3 8 . It can be observed that, the highest performance was attained for Portuguese language and lowest for English language. This is because of the significance of number of native language variation within the language. On the whole, this experimentation attains nearly 70% of accuracy for all the languages in both the tasks. 8
http://pan.webis.de/clef17/pan17-web/author-profiling.html
5
Language En Es Pt Ar
Max 0.82 0.83 0.87 0.80
Gender Min 0.54 0.64 0.61 0.59
Ours 0.78 0.72 0.75 0.68
Native Language Max Min Ours 0.90 0.19 0.60 0.96 0.35 0.77 0.99 0.91 0.97 0.83 0.45 0.71
Table 3: Results Statistics
6
Conclusion
The word usage by the users have been represented in the cognitive space by Document - Term Matrix as Vector Space Model. By considering word usage as features, further classification has been carried out using Support Vector Machine. The observed results have attained nearly an average accuracy of 70 % in both the tasks across different languages. From the assumptions made it has been found that these primary results are satisfactory and that Vector Space Model can be considered as the user’s Cognitive Space. The accuracy of the system can be enriched by representing the tweets using Vector Space Models of Semantics (distributional and distributed representation methods). Our further experimentation will be on deriving the Vector Space Models of Semantics representation methods for modelling the user’s cognitive space from the representation method utilized in this paper.
References [1] Rangel, Francisco, Paolo Rosso, Martin Potthast, and Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. Working Notes Papers of the CLEF (2017). [2] Barathi Ganesh HB, Anand Kumar M and Soman KP. Distributional Semantic Representation for Text Classification and Information Retrieval. 2016. [3] Barathi Ganesh HB, Reshma U and Anand Kumar M. Author identification based on word distribution in word space. In Advances in Computing, Communications and Informatics (ICACCI), 2015. [4] Socher, Richard, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, and Andrew Y. Ng. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.. In Advances in Neural Information Processing Systems, 2011. [5] Barathi Ganesh HB, Anand Kumar M and Soman KP. Statistical Semantics in Context Space. 2016. [6] Turney, Peter D., and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37 (2010). [7] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. 2003. [8] Tan, Liling and Zampieri, Marcos and Ljubeˇsic, Nikola and Tiedemann, J¨org. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC) (2014).
6