In social networking websites or applications, people generally use unstructured or .... that moves incrementally through a corpus of text (Wikipedia 2016).
POLITECNICO DI MILANO
School of Industrial and Information Engineering Master of Science in Computer Engineering
Event-based User Profiling in Social Media Using Data Mining Approaches
Supervisor: Dr. Marco Brambilla
Authors: Behnam Rahdari (Student ID: 10480057) Tahereh Arabghalizi (Student ID: 10481546)
Academic Year 2016/17
Abstract
Social Networks have undergone a dramatic growth and influenced everyone’s life in recent years. People share everything from daily life stories to the latest local and global news and events using social media. This rich and continuous flow of usergenerated content has received significant attention from many organizations and researchers and is increasingly becoming a primary source for social and marketing researches to name a few. Accordingly, a great number of works have been conducted to extract valuable information from different platforms. However, there are no specific studies that focus on categorizing social media users based on the texts they share about a specific event. Given that the identification of online users with common interest in a particular event can help event organizers to attract more visitors to future similar events; this thesis study concentrates on examining the similarity between such users from the aspect of textual published contents. In this work different approaches have been proposed and various experiments have been carried out to support an explanation concerning this notion. We take a systematic approach to accomplish this objective by applying topic modeling techniques, using statistical and data mining algorithms, combined with information visualization.
2
Sommario
Negli ultimi anni i Social Networks hanno visto una crescita esponenziale ed hanno influenzato la vita di tutti noi. Le persone condividono di tutto tramite i social media, dalle storie di vita quotidiana alle ultime notizie a livello locale e globale. Il ricco e continuo flusso di notizie generate dagli utenti ha ricevuto un’importante attenzione da diverse organizzazioni e da vari ricercatori, e sta ancora crescendo diventando una fonte primaria per le ricerche di mercato sociali e di marketing, solo per nominarne alcune. Di conseguenza, è stato condotto un grande numero di lavori per estrarre informazioni utili da diverse piattaforme. Nonostante questo, non ci sono studi specifici che si concentrano sulla categorizzazione degli utenti in base al testo, riguardante eventi specifici, che condividono sui social media. Detto questo, l’identificazione degli utenti con interessi comuni, in un evento specifico, può aiutare gli organizzatori dell’evento ad attrarre più visitatori ad eventi simili in futuro. Questo lavoro di tesi si concentra nell’esaminare le similarità tra questi utenti in base ai contenuti testuali che hanno pubblicato tramite social media. Inoltre, sono proposte diverse metodologie, e sono stati effettuati diversi esperimenti per sostenere una spiegazione riguardo questo fenomeno. Abbiamo proceduto con un approccio sistematico per ottenere questo obiettivo, applicando tecniche di modellazione, usando algoritmi statistici e di estrazione dei dati (data-mining), combinando infine questi dati attraverso tecniche di visualizzazione delle informazioni.
3
Contents 1 Introduction ............................................................................................................................... 9
2
1.1
Context ...................................................................................................................... 9
1.2
Problem Statement ................................................................................................. 10
1.3
Proposed Solution ................................................................................................... 10
1.4
Structure of the thesis ............................................................................................ 11
Background ............................................................................................................................12 2.1
2.1.1
Knowledge Discovery and Data Mining .......................................................... 12
2.1.2
Text Similarity ................................................................................................. 15
2.1.3
Information Retrieval Models .......................................................................... 16
2.1.4
Topic Modeling ................................................................................................ 18
2.1.5
Dimensionality Reduction................................................................................ 19
2.1.6
Clustering Process ............................................................................................ 20
2.2
3
4
Relevant Concepts .................................................................................................. 12
Relevant Technologies ........................................................................................... 22
2.2.1
Language Identification and Translation ......................................................... 22
2.2.2
Gender Detection ............................................................................................. 24
2.2.3
Twitter API ...................................................................................................... 25
2.2.4
Instagram API .................................................................................................. 26
2.2.5
Text Normalization Library ............................................................................. 26
2.2.6
Cloud Computing ............................................................................................. 26
Related Work .........................................................................................................................28 3.1
Clustering of People in Social Network Based on Textual Similarity ............... 29
3.2
Clustering a Customer Base Using Twitter Data ................................................. 31
3.3
Clustering Users Based on Interests ..................................................................... 33
3.4
Crowdsourcing Search for Topic Experts in Microblogs .................................... 35
3.5
Using Internal Validity Measures to Compare Clustering Algorithms .............. 36
Event-based User Profiling in Social Media .........................................................................39 4.1
Main Idea ................................................................................................................ 39
4.1.1
Twitter Users.................................................................................................... 39
4.1.2
Instagram Users ............................................................................................... 40
4.2
Motivation ............................................................................................................... 40
4.2.1
Why social media as data source?.................................................................... 40
4
4.2.2 4.3
5
6
Why Twitter and Instagram?............................................................................ 42
Approach ................................................................................................................. 45
4.3.1
Data Extraction ............................................................................................... 45
4.3.2
Data Preprocessing........................................................................................... 48
4.3.3
Data Loading.................................................................................................... 49
4.3.4
Data Analysis ................................................................................................... 49
Experiments and Discussion ..................................................................................................61 5.1
The Floating Piers Datasets ..................................................................................... 61
5.2
Reports of Analysis .................................................................................................. 62
5.2.1
Content Specific Results .................................................................................. 62
5.2.2
User Specific Results ....................................................................................... 69
Conclusions ............................................................................................................................90 6.1
Summary .................................................................................................................. 90
6.2
Critical Discussion ................................................................................................... 91
6.3
Possible Future Works ............................................................................................. 91
Bibliography ..................................................................................................................................92
5
List of Figures Figure 2.1: Knowledge Discovery Process .............................................................................. 12 Figure 2.2: Information Retrieval Models ............................................................................... 17 Figure 2.3: Graphical model representation of LDA ............................................................... 18 Figure 2.4: Steps of clustering process .................................................................................... 20 Figure 2.5: How machine translation works at Yandex ........................................................... 23 Figure 2.6: Twitter Rest API Design ....................................................................................... 25 Figure 2.7: Cloud Computing .................................................................................................. 27 Figure 2.8: Windows Azure Platform ...................................................................................... 27 Figure 3.1: Graph after spectral k-means clustering for real dataset ....................................... 30 Figure 3.2: Graph after spectral k-means clustering for dummy dataset ................................. 30 Figure 3.3: Percentage of followers for a set of chosen influencers. ....................................... 31 Figure 3.4: Silhouette coefficient as a function of number of clusters .................................... 32 Figure 3.5: Representative clusters in R2 ................................................................................. 33 Figure 3.6: Spectral clustering solutions selected by various measures .................................. 37 Figure 4.1: Number of social media users from 2010 to 2020 (in billions) ............................. 41 Figure 4.2: Percentage of adult users who use different social networks ................................ 41 Figure 4.3: Percentage of adult users who use at least one social media, by age .................... 42 Figure 4.4: Comparison between four major photo-sharing networks .................................... 44 Figure 4.5: Architecture design................................................................................................ 45 Figure 4.6: Identify the number of topics for LDA.................................................................. 51 Figure 4.7: Elbow method representation ................................................................................ 56 Figure 4.8: k-nearest neighbor distances to determine eps in DBSCAN ................................. 59 Figure 5.1: The Floating Piers (Project for Lake Iseo, Italy) ................................................... 61 Figure 5.2: Twitter total retweets vs. favorites ........................................................................ 62 Figure 5.3: Most frequent words (a) and hashtags (b) in tweets .............................................. 63 Figure 5.4: Number of tweets for top 10 locations using the field “Place” ............................. 64 Figure 5.5: Instagram total likes vs. comments ....................................................................... 65 Figure 5.6: Most frequent words (a) and hashtags (b) in Instagram posts ............................... 65 Figure 5.7: Distribution of Instagram posts in the world ......................................................... 66 Figure 5.8: Density of Instagram posts – Italy ......................................................................... 66 Figure 5.9: Density of Instagram posts – Brescia .................................................................... 67 Figure 5.10: Density of Instagram posts – Sulzano ................................................................. 68 Figure 5.11: Tweets vs Instagram posts timeline..................................................................... 69 Figure 5.12: Dendrogram representation of Twitter users ....................................................... 71 Figure 5.13: The percentage of user engagement in each cluster ............................................ 71 Figure 5.14: 2D representation of cluster objects .................................................................... 72 Figure 5.15: Word-cloud representation of first cluster based on bio...................................... 73 Figure 5.16: Word-cloud representation of second cluster based on bio ................................. 73 Figure 5.17: Word-cloud representation of third cluster based on bio .................................... 74 Figure 5.18: Hashtag word-cloud for Travel Lovers, Art Lovers and Tech Lovers ................ 74 Figure 5.19: Tweet text word-cloud for Travel Lovers, Art Lovers and Tech Lovers ............ 75 Figure 5.20: List slug word-cloud for Travel Lovers, Art Lovers and Tech Lovers ............... 75 Figure 5.21: Percentage of users whose number of followers lie in each category ................. 76 Figure 5.22: Percentage of users whose number of followings lie in each category ............... 76 6
Figure 5.23: Percentage of users whose number of favorites lie in each category .................. 77 Figure 5.24: Percentage of users whose number of tweets lie in each category ...................... 77 Figure 5.25: Summary statistics of numbers of followers in each cluster ............................... 78 Figure 5.26: Summary statistics of numbers of followings in each cluster ............................. 78 Figure 5.27: Summary statistics of numbers of favorites in each cluster ................................ 79 Figure 5.28: Summary statistics of numbers of tweets in each cluster .................................... 79 Figure 5.29: Language timeline per cluster ............................................................................. 80 Figure 5.30: Gender timeline per cluster ................................................................................. 80 Figure 5.31: Number of Users - Tweets timeline per cluster ................................................... 81 Figure 5.32: Tweet – User ratio timeline per cluster ............................................................... 81 Figure 5.33: Twitter top 20 active users .................................................................................. 82 Figure 5.34: Instagram top 20 active users .............................................................................. 82 Figure 5.35: Twitter top 20 active contributors ....................................................................... 83 Figure 5.36: Instagram top 20 active contributors ................................................................... 83 Figure 5.37: Twitter top 10 influencers using FtF ratio ........................................................... 84 Figure 5.38: Twitter top 10 influencers using UTW ratio ....................................................... 85 Figure 5.39: Twitter top 10 influencers using Influence Ratio ................................................ 86 Figure 5.40: Twitter top 10 influencers per cluster using Influence Ratio .............................. 87 Figure 5.41: Number of followers of top 10 influencers in each cluster ................................. 87 Figure 5.42: Number of followings of top 10 influencers in each cluster ............................... 88 Figure 5.43: Number of tweets of top 10 influencers in each cluster ...................................... 88 Figure 5.44: Number of favorites of top 10 influencers in each cluster .................................. 89
7
List of Tables Table 3.1: The most common topics of expertise as identified from Lists .............................. 35 Table 3.2: Top 5 results by Cognos and Twitter WTF for query “music”............................... 36 Table 3.3: Average relative SI, CH and DB score over data set .............................................. 37 Table 4.1: Twitter extracted features ....................................................................................... 46 Table 4.2: Instagram extracted features ................................................................................... 47 Table 4.3: Topic probabilities by user ..................................................................................... 53 Table 4.4: Top terms of each extracted topic by LDA............................................................. 53 Table 4.5: Formulas for Silhouette, Dunn and Entropy indices............................................... 60 Table 5.1: Evaluation results of cluster validation indices ...................................................... 70
8
Chapter 1
1 Introduction
1.1 Context Social Networks have undergone a dramatic growth in recent years. Such networks provide a powerful reflection of the structure and dynamics of the society of the 21st century and the interaction of the Internet generation with both technology and other people (Sfetcu 2017). Social media has a great influence in our daily lives. People share their opinions, stories, news, and broadcast events using social media. Monitoring and analyzing this rich and continuous flow of user-generated content can yield unprecedentedly valuable information, enabling users and organizations to acquire actionable knowledge. Due to the immediacy and rapidity of social media, news events are often reported and spread on Twitter, Instagram or Facebook ahead of traditional news media. With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for people to post short and instant messages. Because of such wide adoption of Twitter, events like breaking news and release of popular videos can easily capture people’s attention and spread rapidly on Twitter. Therefore, the popularity and importance of an event can be approximately gauged by the volume of tweets covering the event. Moreover, the relevant tweets also reflect the public’s opinions and reactions to events. It is therefore very important to identify and analyze the events on Twitter (Diao 2015). Another social network platform which is very popular is Instagram. 300 million people use the app for sharing of photos every day. Users can also insert a caption for a photo they share, mention other users and use hashtags. Like in Twitter, users can follow the accounts they are interested in and share their posts publicly or privately according to their preference. Considering this, Instagram is one of the best channels that people can share their experiences (especially the ones about events) through pictures as well as textual content such as hashtags. Hashtags have become a uniform way to categorize content on many social media platforms, especially Instagram. Hashtags allow Instagrammers to discover content to view and accounts to follow.
9
Research from Track Maven found that posts with over 11 hashtags tend to get more engagement.
1.2 Problem Statement In social networking websites or applications, people generally use unstructured or semi-structured language for communication. In everyday life conversation, people do not care about the spellings and accurate grammatical construction of a sentence that may leads to different types of ambiguities, such as lexical, syntactic, and semantic. Therefore, extracting logical patterns with accurate information from such unstructured form is a critical task to perform. Text mining, which is a knowledge discovery technique that provides computational intelligence, can be a solution of above mentioned problem (Rizwana Irfan 2015). Social networks, such as Twitter are rich in texts that enable user to create various text contents in the form of comments, posts and social media. Application of text mining techniques on social networking websites can reveal significant results related to person-to-person interaction behaviors. Moreover, text mining approaches such as clustering can be used for finding general opinion about any specific subject, human thinking patterns, and group identification in large-scale systems. In spite of the high amount of research works that have been conducted for extracting information from a particular social network, there are not specific studies that address different formatted social networks to explore profiles and activities of users based on the texts they share about an event. In this thesis study, it is proposed that there may be some similarities in terms of interest and activity between social media users who are engaged in different actions such as posting, liking and replying a text or media about an event. This may give us an idea to improve the current event and also identify potential users with the same interests for similar future events.
1.3 Proposed Solution First step to obtain the objective of this study is to decide which social media platforms should be considered. Since the availability of public posts is the main reason for our preference among many platforms, Twitter and Instagram, which can provide a great number of publicly available posts, are preferred to be used for the following analysis. Second step is to collect the required data including tweets, Instagram posts and their involved users during a specific time interval. Then textual features namely biographies, hashtags, tweet/post texts and twitter lists of which a user is member are cleaned, preprocessed, translated to English and stored in csv files. After the transformation phase we define some steps to perform the analysis in different levels. The first phase of analysis is to explore the main topics in the provided data using topic modeling. Then we perform different analysis on other levels, for example, three clustering algorithms including K-means, Hierarchical and 10
DBSCAN are applied on the outputs of topic modeling process separately. Having all the outcomes of cluster analysis, it is suggested to evaluate the results employing cluster validity measurement techniques such as Silhouette, Dunn and Entropy. Forasmuch as the evaluation outcome, we perform further analyses and investigations to probe the categories of the users and their activities during the event. Finally we model the outcomes of all levels of analysis in order to have a proper visualization of the results.
1.4 Structure of the thesis The thesis is organized as follows: A general overview of relevant concepts and technologies used in this thesis project are reviewed in chapter 2. Chapter 3 is dedicated to the scientific works that have been done to address the similar issues through discussing the associated publications, plus our own strategy with respect to them. In chapter 4, first we describe the main idea of this project and the motivations behind it. All the details of our proposed approach are explained in the following section of this chapter. Chapter 5 is devoted to describe our dataset and the outcomes of the analysis that was performed. It is divided into sections that are relevant to each level of analysis, conducted on our dataset from the social media we used. Finally in chapter 6 we review the study with a short summery of what has been done and a discussion of our results. In addition there are some suggestions for the future work.
11
Chapter 2
2 Background 2.1 Relevant Concepts In this section we discuss the concepts that are relevant to our work.
2.1.1 Knowledge Discovery and Data Mining Knowledge Discovery in Databases (KDD) is the process of identifying valid, novel, useful, and understandable patterns from large datasets. Data Mining (DM) is the mathematical core of the KDD process, involving the inferring algorithms that explore the data, develop mathematical models and discover significant patterns (implicit or explicit) -which are the essence of useful knowledge. The knowledge discovery process (Figure 2.1) is iterative and interactive, consisting of the below steps. Note that the process is iterative at each step, meaning that moving back to adjust previous steps may be required (Oded Maimon 2010).
Figure 2.1: Knowledge Discovery Process
12
2.1.1.1 Data Selection This phase includes finding out what data is available, obtaining additional necessary data, and then integrating all the data for the knowledge discovery into one data set, including the attributes that will be considered for the process. This process is very important because the Data Mining learns and discovers from the available data. This is the evidence base for constructing the models. If some important attributes are missing, then the entire study may fail. From this respect, the more attributes are considered, the better. On the other hand, to collect, organize and operate complex data repositories is expensive and there is a tradeoff with the opportunity for best understanding the phenomena. This tradeoff represents an aspect where the interactive and iterative aspect of the KDD is taking place. This starts with the best available data set and later expands and observes the effect in terms of knowledge discovery and modeling.
2.1.1.2 Data Pre-processing The operations performed in a preprocessing process can be reduced to two main families of techniques: Detection Techniques (DT) to detect imperfections in data sets and Transforming Techniques (TT) oriented to obtain more manageable data sets. DT includes outlier’s detection, missing data detection, influent observations detection, normality assessment, linearity assessment, and independence assessment. On the other hand, TT includes outlier treatment, missing data imputation, dimensionality reduction techniques or data projection techniques, deriving new attributes techniques, filtering and resampling. Additionally, the statistical technique of data cleaning, and the visualization techniques also play an important role in the preprocessing of data (José Luis Díaz 2010).
2.1.1.3 Data Transformation In this step, the generation of better data for the data mining is prepared and developed. Methods here include dimension reduction (such as feature selection and extraction, and record sampling), and attribute transformation (such as discretization of numerical attributes and functional transformation). This step is often crucial for the success of the entire KDD project, but it is usually very project-specific. However, even if we do not use the right transformation at the beginning, we may obtain a surprising effect that hints to us about the transformation needed (in the next iteration). Thus the KDD process reflects upon itself and leads to an understanding of the transformation needed. The main techniques of data transformation include (Äyrämö 2007) : Smoothing (binning, clustering, regression etc.) Aggregation (use of summary operations (e.g., averaging) on data) Generalization (primitive data objects can be replaced by higher-level concepts) 13
Normalization (min-max-scaling, z-score) Feature construction from the existing attributes (PCA1, MDS2)
2.1.1.4 Data Mining The two high-level primary goals of data mining in practice tend to be prediction and description. Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest, and description focuses on finding human-interpretable patterns describing the data. The goals of prediction and description can be achieved using a variety of particular data-mining methods including (Usama Fayyad 1996): Classification is learning a function that maps (classifies) a data item into one of several predefined classes. Regression is learning a function that maps a data item to a real-valued prediction variable Clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data. The categories can be mutually exclusive and exhaustive or consist of a richer representation, such as hierarchical or overlapping categories. More details about clustering algorithms and its validation techniques are elaborated in section 2.4. Summarization involves methods for finding a compact description for a subset of data. A simple example would be tabulating the mean and standard deviations for all fields. Summarization techniques are often applied to interactive exploratory data analysis and automated report generation. Dependency modeling consists of finding a model that describes significant dependencies between variables. Dependency models exist at two levels: (1) the structural level of the model specifies (often in graphic form) which variables are locally dependent on each other and (2) the quantitative level of the model specifies the strengths of the dependencies using some numeric scale. Change and deviation detection focuses on discovering the most significant changes in the data from previously measured or normative values.
2.1.1.5 Interpretation and Evaluation of Patterns This phase involves the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as knowledge (Gonzalo Mariscal 2010). This step focuses on the comprehensibility and usefulness of the induced model.
1 2
Principle Component Analysis Multi-Dimensional Scaling
14
2.1.1.6 Knowledge Representation This is the last step of knowledge discovery process where visualization and knowledge representation techniques namely logical formulas, decision trees, neural networks, etc. are used to present mined knowledge to users.
2.1.2 Text Similarity Text similarity measures play an important role in text related research and applications such as topic detection, information retrieval, document clustering, text classification, etc. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another (Wael H. Gomaa 2013). Lexical similarity is introduced through string-based similarity measures which operate on string sequences and character composition. Some of these measures are mentioned as follows: Manhattan Distance computes the distance that would be traveled to get from one data point to the other if a grid-like path is followed. The Block distance between two items is the sum of the differences of their corresponding components (Wael H. Gomaa 2013). Cosine Similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them (Wael H. Gomaa 2013). Euclidean distance is the ordinary distance between two points. Euclidean distance is widely used in clustering problems, including text clustering. It satisfies all the above four conditions and therefore is a true metric. It is also the default distance measure used with the K-means algorithm (Rugved Deshpande 2014). Jaccard similarity measures similarity as the intersection divided by the union of the objects. For text document, it compares the sum weight of shared terms to the sum weight of terms that are present in either of the two documents but are not the shared terms (Rugved Deshpande 2014). Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. The most famous corpus-based similarity measures are: Hyperspace Analogue to Language (HAL) considers context only as the words that immediately surround a given word. HAL computes an NxN matrix, 15
where N is the number of words in its lexicon, using a 10-word reading frame that moves incrementally through a corpus of text (Wikipedia 2016). Latent Semantic Analysis (LSA) is the most popular technique of CorpusBased similarity. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows (Wael H. Gomaa 2013). Explicit Semantic Analysis (ESA) is a vectorial representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, it represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia (Wikipedia 2016).
Knowledge-Based Similarity is one of semantic similarity measures that bases on identifying the degree of similarity between words using information derived from semantic networks WordNet is the most popular semantic network in the area of measuring the Knowledge-Based similarity between words; WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations (Wael H. Gomaa 2013).
2.1.3 Information Retrieval Models Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) (Christopher D. Manning 2008). For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. Figure 2.2 illustrates the relationship of some common models. Three of the most well-known models are explained in more detail below (Wikipedia 2016).
16
Figure 2.2: Information Retrieval Models
Standard Boolean model: The Boolean model is a simple retrieval model based on Boolean algebra where index term’s significance is represented by binary weights wi,j ∈ {0,1}. Queries are aslo defined as Boolean expressions over index terms. The similarity between document dj and query q can be calculated as:
Vector space model: in this model, documents and queries are represented as vectors. dj = (w1,j, w2,j,…,wt,j) , q = (w1,q, w2,q,…,wt,q) Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. In the classic vector space model the term-specific weights in the document vectors are products of local and global parameters. The model is known as term frequency-inverse document frequency model where weight wi,j is defined as: wt,d = tft,d . idft and tft,d is term frequency of term t in document d and idft is inverse document frequency. Using the cosine the similarity between document dj and query q can be calculated as:
17
Probabilistic model: this model makes an estimation of the probability of finding if a document dj is relevant to a query q. This model assumes that this probability of relevance depends on the query and document representations. Furthermore, it assumes that there is a portion of all documents that is preferred by the user as the answer set for query q. Such an ideal answer set is called R and should maximize the overall probability of relevance to that user. The prediction is that documents in this set R are relevant to the query, while documents not present in the set are non-relevant.
2.1.4 Topic Modeling Topic models are [probabilistic] latent variable models of documents that exploit the correlations among the words and latent semantic themes. A document is seen as a mixture of topics. This intuitive explanation of how documents can be generated is modeled as a stochastic process which is then “reversed” by machine learning techniques that return estimates of the latent variables. With these estimates it is possible to perform information retrieval or text mining tasks on a document corpus (Ponweiser 2012). The most prominent topic model is latent Dirichlet allocation (LDA) which is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document (Recognition 2015). The graphical model of LDA is shown in Figure 2.3. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document:
Figure 2.3: Graphical model representation of LDA
The LDA model assumes the following generative process for a document w = (w1, . . . , wN ) of a corpus D containing N words from a vocabulary consisting of V different
18
terms, wi ∈ {1,… , V } for all i = 1, . . . , N. The generative model consists of the following three steps (Bettina Grun 2011): Step 1: The term distribution β is determined for each topic by β ∼ Dirichlet(δ). Step 2: The proportions θ of the topic distribution for the document w are determined by θ ∼ Dirichlet(α). Step 3: For each of the N words wi (a) Choose a topic zi ∼ Multinomial(θ). (b) Choose a word wi from a multinomial probability distribution conditioned on the topic zi : p(wi |zi , β). β is the term distribution of topics and contains the probability of a word occurring in a given topic.
2.1.5 Dimensionality Reduction Dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. It can be divided into feature selection and feature extraction. Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons:
simplification of models to make them easier to interpret by researchers/users shorter training times to avoid the curse of dimensionality enhanced generalization by reducing overfitting (formally, reduction of variance)
The central premise when using a feature selection technique is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in Principal Component Analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. The main linear technique for dimensionality reduction, Principal Component Analysis (PCA), performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In other words, it uses an orthogonal transformation to convert a set of 19
observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the smaller of (number of original variables or number of observations). This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables (Wikipedia 2016).
2.1.6 Clustering Process As mentioned before, clustering is one of the most useful tasks in data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data. The main concern in clustering process is to reveal the organization of patterns into “sensible” groups, which allow us to discover similarities and differences, as well as to derive useful conclusions about them. The basic steps to develop clustering process are presented in Figure 2.4 and can be summarized as follows (Maria Halkidi 2001):
Figure 2.4: Steps of clustering process
Feature selection: The goal is to select properly the features on which clustering is to be performed so as to encode as much information as possible concerning the task of our interest. Thus, preprocessing of data may be necessary prior to their utilization in clustering task. Clustering algorithm: This step refers to the choice of an algorithm which results in the definition of a good clustering scheme for a data set. Clustering algorithms can be broadly classified into the following types:
20
o Partitional clustering attempts to directly decompose the data set into a set of disjoint clusters. In this category, K-Means is a commonly used algorithm. o Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters. The result of the algorithm is a tree of clusters, called dendrogram, which shows how the clusters are related. By cutting the dendrogram at a desired level, a clustering of the data items into disjoint groups is obtained. o Density-based clustering: The key idea of this type of clustering is to group neighboring objects of a data set into clusters based on density conditions. A widely known algorithm of this category is DBSCAN. o Grid-based clustering is mainly proposed for spatial data mining. Their main characteristic is that they quantize the space into a finite number of cells and then they do all operations on the quantized space. o Fuzzy clustering, which uses fuzzy techniques to cluster data and they consider that an object can be classified to more than one clusters. This type of algorithms leads to clustering schemes that are compatible with everyday life experience as they handle the uncertainty of real data. The most important fuzzy clustering algorithm is Fuzzy C-Means. o Crisp clustering, considers non-overlapping partitions meaning that a data point either belongs to a class or not. Most of the clustering algorithms result in crisp clusters, and thus can be categorized in crisp clustering. o Kohonen net clustering, which is based on the concepts of neural networks.
Validation of the results: The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity. In general terms, there are three approaches to investigate cluster validity: o External Criteria: In this approach the basic idea is to test whether the points of the data set are randomly structured or not. Rand index, Jaccard coefficient, Entropy and Purity can be mentioned as external measures to name as a few. o Internal Criteria evaluate the result with respect to information intrinsic to the data alone. Silhouette index, Davies-Bouldin index (DB), Calinski-Harabasz index (CH) and Dunn index are the most famous measures in this category (Eréndira Rendón 2011). o Relative Criteria evaluate quality of a partition by comparing it to other clustering schemes, resulting by the same algorithm but with different parameter values.
21
Interpretation of the results: In many cases, the experts in the application area have to integrate the clustering results with other experimental evidence and analysis in order to draw the right conclusion.
2.2 Relevant Technologies In this section all the relevant technologies used in this thesis study are discussed.
2.2.1 Language Identification and Translation Language identification (LID) refers to the process of determining the natural language in which a given text is written (Pienaar 2010). In this thesis LID is used as a part of preprocessing phase which aims to uniform the textual content by first detect the language and then translate it into a single language (en-us). Yandex.Translate Application Programming Interface (API) is easy to use automatic translation service provided by Russian Internet Company Yandex. As a statistical machine translation system, it is based on statistics derived from the web sources (Hees 2015). Yandex.Translate - synchronized translation for 91 languages, predictive typing, dictionary with transcription, pronunciation and usage examples, and many other features.
22
Figure 2.5: How machine translation works at Yandex
23
Yandex.Translate has an automated dictionary that sets it apart from the limited number of similar existing services. The technology, developed by a Yandex team of linguists and programmers, combines current statistical machine translation approaches with traditional linguistic tools.The translation model constructs a graph containing all the possible ways to translate a sentence. The language model selects the best translation in terms of the optimal word combinations in natural language.The translation model learns from extensive bilingual parallel corpora. The language model is built from large single-language corpora, and contains all the language's most frequent n-word combinations. N may be from 1 to 7 (usually 5). Yandex uses BLEU metrics to automatically evaluate the quality of machine translation; it determines the percent of n-grams (n