Text Mining on Social Networking using NLP ...

Ain Shams University Faculty of Engineering Computer & Systems Engineering Department

Text Mining on Social Networking using NLP Techniques Thesis Submitted in partial fulfillment of the Requirements for the Degree of Doctorate of Philosophy in Electrical Engineering (Computer and Systems)

By

Walaa Mohamed Medhat Asal B.Sc in Computer and Systems Engineering 2002 Msc in Computer and Systems Engineering 2008

Supervised By Prof. Hoda Korashy Mohamed Computer & Systems Engineering Dept. Faculty of Engineering, Ain Shams University

Dr. Ahmed Hassan Youssef Computer & Systems Engineering Dept. Faculty of Engineering, Ain Shams University

Cairo 2014

I

ABSTRACT Text Mining on Social Networking using NLP Techniques The web has become a very important source of information recently as it becomes a read-write platform. The dramatic increase of online social networks (OSN), video sharing sites, online news, online reviews sites, online forums and blogs has made the user-generated content, in the form of unstructured free text gains a considerable attention due to its importance for many businesses. The web is used by many languages’ speakers. It is no longer used by English speakers only. Text mining becomes necessary nowadays to extract information and discover knowledge from this huge amount of textual data. Working on text data means that we need a better understanding of the text. Natural Language Processing (NLP) techniques could help in better understanding of the text. Sentiment Analysis (SA) is one of the text mining well-known techniques. It is the computational study of people’s opinions, attitudes, and emotions towards individuals, events, or topics covered by reviews or news. The target of SA is to find opinions, identify the sentiments they express, and then classify their polarity. The thesis proposes a framework for preparing and using corpora from OSN and review sites for SA task in two different natural languages (English, and Arabic). The framework consists of three phases. The first phase is the preprocessing and cleaning of data collected, then data annotation. The second phase is applying various text processing (NLP) techniques including removing stopwords, replace the negation words and the following negated words with the antonyms of the negated words, and using selective words of part-of-speech tags (adjectives and verbs) on the prepared corpora. The third phase is text classification using Naïve Bayes and Decision Tree classifiers and two feature selection approaches, unigrams and bigrams. The framework components were analyzed at each stage. It is important to analyze the components of the framework to configure which scenario is better for each corpus used. The analysis is enhanced by applying the framework components on the English language benchmark corpus movie reviews in addition to the prepared corpora from OSN sites and a review site. There is lack of language resources of Arabic language as most of them are under development. In order to use Arabic language in the framework, there are some sources needed as stopword lists, Arabic Wordnet, and Arabic POS tagger. The problem is that the stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used

II

with the OSN Arabic corpora. We compare the efficiency of text classification when using the generated lists along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The other sources are still under development. They are created to fulfill for MSA not Dialectical Arabic which is the language used by the OSN users. The framework was applied and tested on the English language corpora with all its stages. For the Arabic language, the text processing technique of removing stopwords only is applied. The experiments show that the OSN data is extremely unbalanced for both languages. The results show that applying text processing techniques improve the classification accuracy of the NB classifier and reduce the training time of both classifiers. The performance was measured with accuracy, F-measure, and training time criteria. The results also show that Decision tree classifier gives better results for imbalance data for both languages. The experiments on Arabic corpora show that the general lists containing the Egyptian dialects stopwords give better performance than using lists of MSA stopwords only. Key Words: Text mining, Sentiment Analysis, Natural Language processing.

III

Table of Contents Chapter 1: Introduction……………………………………………………………………….1

1.1 Background of Research Work .................................................................... 1 1.2 Thesis Challenges and Contributions ........................................................... 1 1.3 Thesis Layout ............................................................................................... 2 Chapter 2: Text Mining................................................................................ 3

2.1 Introduction .................................................................................................. 3 2.1.1 Algorithms of Text Mining........................................................................................... 3

2.2 Information Extraction of Text .................................................................... 3 2.2.1 Named Entity Recognition ........................................................................................... 4 2.2.2 Relation Extraction ....................................................................................................... 4

2.3 Transfer Learning ......................................................................................... 4 2.4 Text Analytics in Social Media .................................................................... 5 2.4.1 Distinct Aspects of Text in Social Media ..................................................................... 5 2.4.2 Applying Text Analytics in Social Media .................................................................... 6

2.5 Mining Text Streams .................................................................................... 8 2.5.1 Clustering Text Streams ............................................................................................... 8 2.5.2 Classification of Text Streams ...................................................................................... 8 2.5.3 Evolution Analysis of Text Streams ............................................................................. 9

2.6 Opinion Mining and Sentiment Analysis ..................................................... 9 2.6.1 Opinion Mining .......................................................................................................... 10 2.6.2 Document Sentiment Classification ........................................................................... 10 2.6.2.1 Classification based on supervised learning ......................................................... 10 2.6.2.2 Classification based on unsupervised learning ..................................................... 11 2.6.3 Sentence Subjectivity and Sentiment Classification .................................................. 11

2.6.4 Opinion Lexicon Expansion .................................................................... 12 2.6.5 Aspect-based Sentiment Analysis .............................................................................. 13

2.7 Future Trend in Research ........................................................................... 14 Chapter 3: Sentiment Analysis ......................................................... 15

3.1 Introduction ................................................................................................ 15 IV

3.2 Feature Selection Methods ......................................................................... 16 3.3 Sentiment Analysis Techniques ................................................................. 16 3.3.1 Machine learning Approach ....................................................................................... 18 3.3.1.1 Supervised Learning ............................................................................................. 18 3.3.1.2 Weakly, Semi and unsupervised learning ............................................................. 22 3.3.2 Lexicon-based Approach ............................................................................................ 23 3.3.2.1 Dictionary-based Approach .................................................................................. 23 3.3.2.2 Corpus-based Approach ........................................................................................ 23 3.3.2.3 Lexicon-based and NLP Techniques .................................................................... 25

3.4 Building Resources .................................................................................... 25 Chapter 4: The proposed Framework ............................................... 27

4.1 Introduction ................................................................................................ 27 4.2 Literature Review ....................................................................................... 27 4.3 System Architecture ................................................................................... 29 4.3.1 Corpora Preparation .................................................................................................... 29 4.3.1.1 Corpora Preparation for English Language corpora ............................................. 30 4.3.1.2 Corpora Preparation for Arabic Language corpora .............................................. 31 4.3.2 Text Processing Techniques ....................................................................................... 33 4.3.2.1 Replacing Negations with Antonyms for English Language corpora .................. 33 4.3.2.2 Replacing Negations with Antonyms for Arabic Language corpora .................... 34 4.3.2.3 Removing stopwords for English Language corpora............................................ 34 4.3.2.4 Removing stopwords for Arabic Language corpora ............................................. 35 4.3.2.5 Part-of-Speech tagging for English Language corpora ......................................... 37 4.3.2.6 Part-of-Speech tagging for Arabic Language corpora .......................................... 38 4.3.3 Feature Selection ........................................................................................................ 38 4.3.4 Sentiment Classification ............................................................................................. 38 Chapter 5: Tests, Results, and Discussion ........................................40

5.1 Introduction ................................................................................................ 40 5.4 Tests, Results and Discussion of English Corpora .................................... 40 5.4.1 Data Annotation .......................................................................................................... 41 5.4.2 Tests and Results ........................................................................................................ 42 5.4.3 Component Analysis of the SA Framework ............................................................... 49 5.4.4 Discussion ................................................................................................................... 56 5.4.4.1 Corpora Analysis .................................................................................................. 56

V

5.4.4.2 Results Analysis .................................................................................................... 57

5.5 Tests, Results and Discussion of Arabic Corpora ...................................... 59 5.5.1 Data Annotation .......................................................................................................... 59 5.5.1 Tests and Results ........................................................................................................ 60 5.5.3 Discussion ................................................................................................................... 66 5.5.3.1 Corpora Analysis .................................................................................................. 66 5.5.3.2 Specializations of Arabic Language ..................................................................... 67 5.5.3.3 Results Analysis .................................................................................................... 68 Chapter 6: Conclusion and Future Work .......................................... 69

6.1

Conclusion .............................................................................................. 69

6.2 Future Work ............................................................................................... 71 References ....................................................................................... 72 Appendix A: English Stopword List .................................................. 81 Appendix B: Corpus-based Arabic Stopword List ............................. 83 Appendix C: General Egyptian Dialect Stopword List ....................... 93

VI

List of Tables Table 4-1 Number and Percentage of removed English OSN data .........................31 Table 4-2 Number and Percentage of removed Arabic OSN data ..........................33 Table 4-3 Testing the effectiveness of choosing shifters.........................................34 Table 4-4 Testing the effectiveness of choosing tags ..............................................38 Table 5-1 Number of positive and negative English reviews, comments, and tweets from IMDB, Facebook and Twitter .........................................................................41 Table 5-2 Accuracy of Sentiment Analysis on English Movie Reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigrams and bigrams as Feature Selection before and after applying text processing techniques ..................................................................................................................................43 Table 5-3 Classification training time of Sentiment Analysis on English Movie Reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigrams and bigrams as Feature Selection before and after applying text processing techniques ..............................................................................................44 Table 5-4 F-measure of Sentiment Analysis on English Movie Reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigrams and bigrams as Feature Selection before and after applying text processing techniques ..................................................................................................................................45 Table 5-5 Accuracy of Sentiment Analysis on English movie reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as features after applying each path of text processing ...............................50 Table 5-6 Classification training time of Sentiment Analysis on English movie reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as features after applying each path of text processing ..........51

VII

Table 5-7 F-measure of Sentiment Analysis on English movie reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as Features after applying different paths of text processing techniques ....52 Table 5-8 Sample of abbreviations and smiley faces found in English Facebook, Twitter, and IMDB...................................................................................................57 Table 5-9 Maximum accuracies, F-measure and Minimum training time achieved by each corpus with combination of text processing paths, features and classifiers ..................................................................................................................................58 Table 5-10 Number of positive, negative and neutral Arabic reviews, comments, and tweets from Review site, Facebook and Twitter ...............................................60 Table 5-11 Accuracy of Sentiment Analysis on Arabic Reviews, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as FS after removing stopwords from different lists ..................................................................61 Table 5-12 Classification training time of Sentiment Analysis on Arabic Reviews, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as FS after removing stopwords from different lists ...................................62 Table 5-13 F-measure of Sentiment Analysis on Arabic Reviews, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as FS after removing stopwords from different lists ..................................................................63 Table 5-14 Sample of abbreviations and smiley faces found in Arabic Facebook, Twitter, and Reviews ...............................................................................................67

VIII

List of Figures Figure 3-1 SA Steps on Product Reviews ................................................................15 Figure 3-2 Sentiment Analysis Techniques .............................................................17 Figure 3-3 SVM Classification Problem ................................................................20 Figure 3-4 Multi-Layer neural network for non-linear separation ..........................20 Figure 4-1 Sentiment Analysis Framework Architecture ........................................29 Figure 4-2 English Corpora Preparation from IMDB, Facebook, and Twitter .......30 Figure 4-3 Arabic Corpora Preparation from Reviews, Facebook, and Twitter .....32 Figure 4-4 A methodology of generating Egyptian Dialect and corpus-based stopword lists ...........................................................................................................36 Figure 5-1 Text Processing and Sentiment classification of the prepared English corpora......................................................................................................................42 Figure 5-2 Classification accuracy of English Movie Reviews corpus after applying different paths of text processing ..............................................................45 Figure 5-3 Classification training time of English Movie Reviews corpus after applying different paths of text processing ..............................................................46 Figure 5-4 Classification accuracy of English IMDB corpus before and after applying text processing techniques ........................................................................46 Figure 5-5 Classification training time of English IMDB corpus before and after applying text processing techniques ........................................................................46 Figure 5-6 Classification accuracy of English Facebook corpus before and after applying text processing techniques ........................................................................47 Figure 5-7 Classification training time of English Facebook corpus before and after applying text processing techniques ........................................................................47 Figure 5-8 Classification accuracy of English Twitter corpus before and after applying text processing techniques ........................................................................47

IX

Figure 5-9 Classification training time of English Twitter corpus before and after applying text processing techniques ........................................................................48 Figure 5-10 Component Analysis of the Sentiment Analysis framework ...............49 Figure 5-11 Classification accuracy of English Movie Reviews corpus after applying different paths of text processing ..............................................................53 Figure 5-12 Classification training time of English Movie Reviews corpus after applying different paths of text processing ..............................................................53 Figure 5-13 Classification accuracy of English IMDB corpus after applying different paths of text processing .............................................................................53 Figure 5-14 Classification training time of English IMDB corpus after applying different paths of text processing .............................................................................54 Figure 5-15 Classification accuracy of English Facebook corpus after applying different paths of text processing .............................................................................54 Figure 5-16 Classification training time of English Facebook corpus after applying different paths of text processing .............................................................................54 Figure 5-17 Classification accuracy of English Twitter corpus after applying different paths of text processing .............................................................................55 Figure 5-18 Classification training time of English Twitter corpus after applying different paths of text processing .............................................................................55 Figure 5-19 Text Processing and Text classification of Arabic corpora .................60 Figure 5-20 Classification accuracy of Arabic Reviews corpus ..............................64 Figure 5-21 Classification training time of Arabic Reviews corpus .......................64 Figure 5-22 Classification accuracy of Arabic Facebook corpus ............................64 Figure 5-23 Classification training time of Arabic Facebook corpus .....................65 Figure 5-24 Classification accuracy of Arabic Twitter corpus ................................65 Figure 5-25 Classification training time of Arabic Twitter corpus .........................65

X

List of Abbreviations BN  Bayesian Network BOW  Bag of words BR  Building Resources DA  Dialectal Arabic DT  Decision Tree FS  Feature Selection IMDB  Internet Movie Database LSA  Latent Semantic Analysis ML  Machine Learning MSA  Modern Standard Arabic NB  Naïve Bayes NER  Named Entity Recognition NLP  Natural Language Processing OSN  Online Social Networks PMI  Point-wise Mutual Information POS  Part of Speech SA  Sentiment Analysis SC  Sentiment Classification SO  Semantic Orientation SVM  Support Vector Machines TF-IDF  Term Frequency/Independent Term Frequency WOM  Word of Mouth

XI

Chapter 1: Introduction 1.1 Background of Research Work The aim of the research is to find a solution for sentiment Analysis (SA) on Social Network data. There is a big evolution of web sites and web technologies recently. This big growth leads to the existence of huge amount of data on the internet. Online Social Networks (OSN) have seen great evolution in the last decade. They are considered a very good source of information on the web. The web is used by many languages’ speakers. It is no longer used by English speakers only. The need of SA systems that can analyze OSN in other languages than English is compulsory. It is important to find a platform or framework that can deal with these kinds of text data for better analysis.

1.2 Thesis Challenges and Contributions Text mining becomes necessary nowadays to extract information and discover knowledge from this huge amount of textual data on the web especially on social networks. Working on text data means that we need a better understanding of the text. Natural Language Processing (NLP) techniques could help in better understanding of the text. Sentiment Analysis is one of the text mining techniques that analyze people’s opinions about certain topics. The thesis proposes a framework that can help in applying SA on OSN for two natural languages (English and Arabic). The framework is designed for preparing and using corpora from OSN and review sites. The framework consists of three phases. The first phase is the preprocessing and cleaning of data collected, then data annotation. The second phase is applying various text processing (NLP) techniques including removing stopwords, replacing the negation words and the following negated words with the antonyms of the negated words, and using selective words of part-of-speech tags (adjectives and verbs) on the prepared corpora. The third phase is text classification using Naïve Bayes (NB) and Decision Tree (DT) classifiers and two feature selection approaches, unigrams and bigrams. The framework components were analyzed at each stage for English corpora. For Arabic corpora one text processing technique only is used which is removing stopwords. We have proposed a methodology to generate a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN Arabic corpora.

1

In the thesis, we focus on investigating six research questions: -Question 1: What is the difference between Arabic and English languages with respect to (Preparing Corpora, Text Processing)? To investigate the question, we have prepared corpora in both languages to investigate the difference. -Question 2: Is Text Processing important in SA of OSN? To investigate the question, we have tested SA on OSN with and without applying text processing techniques. -Question 3: Does Text Processing give the same effect with any classifier? Which is better? To investigate the question, we tested two classifiers from two different families (NB and DT) -Question 4: Is the data from OSN balanced? If not, does the data imbalance affect the SA task? -Question 5: Is the OSN data different from review data? -Question 6: Is there a difference between Facebook and Twitter users? If yes, what is it? The tests and experiments investigate Question 4, 5, and 6.

1.3 Thesis Layout Chapter 2 gives an introduction to text mining and its various algorithms including the sentiment analysis. Brief details of the various text mining algorithms are presented. Chapter 3 discusses sentiment analysis in respect to the algorithms used and applications. Brief details of sentiment analysis algorithms are presented. For researchers of prior knowledge of text mining and sentiment analysis can skip this part. Chapter 4 presents the proposed framework. A detailed explanation of the implemented components of the framework is presented. The discrepancies between handling English and Arabic corpora are explained. The phase of preparing and classifying corpora from OSN is presented in full details. This chapter investigates the first three research questions. Chapter 5 presents the tests, results and discussions that were made to test the framework. The framework was tested on four different corpora in English language, a benchmark corpus and corpora from IMDB, Facebook, and Twitter. A component Analysis is also presented. The framework was also tested on three different corpora in Arabic language, from a review site, Facebook, and Twitter. The corpora and results are analyzed and discussed. This chapter investigates the last three research questions. Chapter 6 concludes the thesis, answers all research questions, and represents future work.

2

Chapter 2: Text Mining 2.1 Introduction There are large amounts of text data created nowadays in many social network sites, the web and other information-centric applications. The evolution of web 2.0 technologies which allow users to interact with each other, sharing personal information or sharing their opinions about a general topic helps in increasing the amount of data on the web. The development of hardware and software platforms for the web and social networks has created huge repositories of different kind of data i.e. text, image, videos … etc. Text data is a sort of unstructured data which is managed mainly by search engines while structured data are managed by database systems. This emphasizes the role of text mining where we need to analyze this huge amount of text data in order to discover patterns unlike the traditional research that focus on facilitating information access. The main characteristic of text data is its sparsity and high dimensionality i.e. a few hundred words document could be derived from a word corpus of thousands of words. Text can be represented as a bag of words but semantic representation of text helps for more meaningful analysis where mining is done. Most of text mining algorithms still depend on Bag of Words as there is no accurate semantic representation of data. It doesn’t work well in open text domains.

2.1.1 Algorithms of Text Mining These are some of the most common text mining techniques and algorithms as illustrated in [1]: a) Information Extraction: Extract entities and their relations to reveal more semantic information. b) Transfer learning: Transfer the learned knowledge from one domain to the other. c) Mining Text Streams: Mining data from social network sites and news blogs. d) Text Mining on Social Networks. e) Opinion Mining. The following sections contain brief details about each one of them.

2.2 Information Extraction of Text The two main tasks of information extraction are:  Named Entity Recognition (NER).  Relation Extraction.

3

2.2.1 Named Entity Recognition It is the most fundamental task in information extraction as extraction of relations depends on accurate NER. We have to find proper names and specify which of the three main categories (Person, Organization, and Location) that proper name belongs. We should put into consideration the context of the word i.e. JFK could mean a person (American president’s name) or location (American airport). There are two main approaches were used in NER: a) Rule based Approach: depends on rules of recognition i.e Mr. + capital letter  Person’s name. These rules could be put manually at first then learn the machine with rules. This approach was used in [2-6]. b) Statistical learning Approach: use machine learning techniques to assign labels to entities such as  Hidden Markov Models (HMM)  Maximum Entropy Markov Models (MEMM)  Conditional Random Fields (CRF)

2.2.2 Relation Extraction It is based on the task definition of the Automatic Content Extraction (ACE) [7]. It focuses on discovering binary relations between 2 entities which are divided into 3 categories:  Physical  Personal/Social (family relation)  Employment/affiliation

2.3 Transfer Learning Problem Definition: Transfer learning is extracting knowledge from auxiliary domain to improve the learning process in a target domain i.e. transfer knowledge from Wikipedia documents to tweets or a search in English to Arabic. Transfer learning is considered a new cross domain learning technique as it addresses the various aspects of domain differences. It is used to enhance many Text mining tasks like text classification [8], sentiment analysis [9], Named Entity recognition [10], part-of-speech tagging [11]… etc.

Transfer learning helps solving the problem of cross-domain text classification. Support Vector Machines is used as a representative base model to illustrate how the labeled data in auxiliary domains can be used to achieve knowledge transfer from auxiliary domains to the target domain. In SA problem for example where the target is to classify reviews for example to positive and negative; the label space become Y= {+1,-1}. Transfer learning in SA can be applied also to transfer sentiment classification from one domain to the other [12] or building a bridge bet two domains [13]. The other kind of Transfer learning is the Instance-based Transfer. It is identifying a subset of source instances and inserts them into the training set of the target domain data. There are some instances in auxiliary domains are helpful for training the target domain model and others are harmful so, the useful ones should be selected and the others should

4

be discarded. This is done by applying instance weighting on the source domain data according to their importance to learning in the target domain.

2.4 Text Analytics in Social Media Traditional media like newspapers, television or radio represent one way of communication between business and consumers while web 2.0 technologies including social media represent communication from consumer to consumer services; users interact in social media dialogue in a virtual community. Social media is used in sharing and discussing information and experiences between human being in an efficient way. Social media have different format of data as text, images, or video. Text mining in social media use techniques of data mining, machine learning, information retrieval, and natural language processing.

2.4.1 Distinct Aspects of Text in Social Media Social Media text has many characteristics but first let’s talk about a general framework when dealing with text in social media. A traditional text analytics framework consists of three consecutive phases:  Text preprocessing: includes removal of stopwords and stemming [14]. In opinion mining they need syntactic point of view so; it retains the original sentence structure.  Text representation: transform documents into sparse numeric vectors and deal with them using linear algebraic operation such as “Bag of words (BOW)” or “Vector space Models (VSM)”. In BOW model, a word is represented as a separate variable having numeric weight of varying importance. A famous weighting schema is often used called “Term Frequency/Independent Term Frequency (TF-IDF)”



Where tf represent number of word occurrence in a document, df(w) represent number of documents containing a word and N is the number of documents. Knowledge Discovery: we can apply any classification or clustering algorithms on the text after transforming to numeric vector. Illustrative Example: This example was written in [1]. We have three microblogs messages about a movie review. “watching the King’s speech” “I like the king’s speech” “they decide to watch a movie”

After applying Phase 1 they became: “watch king’s speech” “king’s speech” “decid watch movi”

5

After applying phase 2 and using BOW to model the three messages with a TFIDF weight, the corpus can be represented as a words * documents matrix. Each row represents a word and each column represents a message, as shown below:

There are different characteristics of Text in social media which are:  Time sensitivity: the information on blogs may change every several days while on micro blogs or OSN may change several times a day.  Short Length: some OSN as twitter give limit to their messages up to 140 characters so; these messages can’t provide sufficient context information for effective similarity measures [15] as normal text beside the huge amount of data that it gives.  Unstructured phrases: the user enter the text on OSN according to their opinion about some topic therefore the sentence is not structured at all because this is not a formal way of text and we face the problems of acronyms i.e. how r u? Or good9t.  Abundant Information: these are non-content information like the hashtags in tweets.

2.4.2 Applying Text Analytics in Social Media There are number of methods proposed to handle text analytics in social media dealing with the unique characteristics of the text. These are some of them. i) Event Detection They monitor data source and detect event occurrence in that source which may be images, video or text documents [16]. They can detect real-time events like earthquakes on Twitter for example. This is done by training the classifier with messages, features to identify positive and negative sentiments then apply a probabilistic model to identify the location of the event then, they can use this to predict earthquakes [17]. They can also use news sites and news papers to detect the headlines [18]. Considering social media as a network; they can analyze the relations between people and the tags on photos to detect events [19]. ii) Collaborative Question Answering Many sites offer question answering and they become widely used creating a huge data which is organized in question answering database. So, anyone can search the frequently asked questions instead of web search. They search for the question related to the topic in concern. To enhance this method, question answering

6

quality is tested by dividing the questions into two categories; informational (Is coke good?) and conversational (do you like coke?). In [20], they propose a graph based approach to perform question retrieval by segmenting multi-sentence questions. They detect question sentences using a classifier built from both lexical and syntactic features then, use similarity methods to measure the closeness score between the question and context sentences. On the other hand, systems provide corresponding quality question answering pairs from answer’s point of view. iii) Social Tagging They organize, store and search for tags and bookmarks of resources online. The resources themselves aren’t shared not like file sharing, merely the tags that describes them or bookmarks that reference them. The users of tagging services have created a large volume of tagging data which has attracted recent attention from the research community. From the huge amounts of tags; it is difficult for a user to quickly locate the relevant resources he wants. Therefore, the tagging services provide keyword-based search which returns resources annotated by the given tags. The results are not accurate due to the short, unstructured nature of tags. The short length and sparseness of tags and current systems are designed for keywords based search, which failed to capture the semantic relationship between two semantically related tags i.e. when a user searches for an event like ”Egyptian Revolution”, the systems will return results that are tagged as “Egyptian” or “Revolution” but not resources tagged with “Mubarak’s resignation” or “Protest” which are highly related to “Egyptian Revolution”. This “semantic gap” results in many valuable and interesting results overlooked and buried in disorganized resources. The research work in this field has two main targets. The first one is to improve the quality of tag recommendation [21]. The second one is to utilize social tagging resource to facilitate other applications. iv) Bridging the Semantic Gap When processing textual data in social media, BOW approach is limited, as it can only use pieces of information that are explicitly mentioned in the documents [22]. It is inadequate to build the semantic relationship with other relevant concepts due to the semantic gap. For example, “The Dark Knight” and “Batman” are different names of one movie, but they cannot be linked as the same concept without additional information from external knowledge. Researchers have proposed semantic knowledge bases to bridge the widely extant semantic gap in short text representation. Some of the algorithms [23] proposed use information from Wikipedia along with the short messages to enrich them then, they can be used in classification or clustering.

7

2.5 Mining Text Streams Text streams are the input from OSN, news, chats and blogs. The main characteristics of this data is its huge amount and its unstructured nature specially data from OSN. Stream mining techniques are generally dependent upon summarization; there is a need to design methods for online summarization that is suitable for the unstructured nature of text data. In the case of multi-dimensional and time-series data, such summarization often takes the form of methods such as histograms, wavelets, and sketches which can be used in order to create a structured summary of the underlying data [24].

2.5.1 Clustering Text Streams It is widely studied in the context of numerical data, two of the most famous techniques are:  COBWEB: assumes numerical attributes [25].  CLAASIT: assumes real-valued attributes [26]. Both algorithms are considered an extension of the k-means algorithm but applied on streams. We have also:  Online Spherical k-means (OSKM): it is one of the earliest methods form clustering Text streams; they divide the data streams into segments then apply kmeans algorithm on each segment [27]. There are algorithms depend on distributional assumptions of words in documents. Some techniques consider that the data that don’t fit in any cluster as outlier then may form cluster of outliers afterwards. There are also Micro-clusters which create summaries from data points to estimate the assignment of incoming data [28]. There is also Topic modeling and Tracking in Text streams. The techniques in [29, 30] create unsupervised clusters from a text stream and then determine the sets of clusters that match real events i.e. documents which are indentified by human. The other technique in [31] divides documents into groups representing an event then process the documents sequentially to determine whether an incoming document correspond to new event or existing one.

2.5.2 Classification of Text Streams Classification of Text Streams is useful mainly in the context of Information Retrieval tasks i.e. news filtering (assign incoming documents to pre-defined categories) [32] or email spam filtering [33]. We have two scenarios in the forms of training and test data which are:  The training data is available for batch learning, but the test data arrives in the form of a stream.  Both the training and the test data arrive in the form of a stream. The patterns in the training data are continuously changing over time so, the models need to be updated dynamically.

8

The first case is easy to handle as most classifier models are compact and classify individual test instances efficiently. The second case needs to automatically update the training model to capture the continuous change in the input patterns. This could be done by incorporating temporal decay factors into model construction algorithms in order to age out the old data. Then, Naïve Bayes classifier or k-nearest neighbor can be applied as illustrated in [34]. There is also the work in one-class classification of text streams illustrated in [35], in which only training data for the positive class is available, but there is no training data available for the negative class. This is quite common in many real applications in which it is easy to find representative documents for a particular topic, but it is hard to find the representative documents in order to model the background collection. The method works by designing an ensemble of classifiers in which some of the classifiers corresponds to a recent model, whereas others correspond to a long-term model. Neural Networks are good in classifying text streams as neural networks are designed as a classification model with a network of perceptrons and corresponding weights associated with the term-class pairs. There is also a rule-based technique, which can learn classifiers incrementally from data streams called the sleeping-experts systems [36, 37]. This rule-based system uses the position of the words in generating the classification rules. The rules correspond to sets of words which are placed close together in a given text document which are related to a class label.

2.5.3 Evolution Analysis of Text Streams A key problem in the case of text is to determine evolutionary patterns in temporal text streams. An early survey on the topic of evolution analysis in text streams may be found in [38]. These evolutionary patterns are very useful in many applications i.e. summarizing events in news articles and revealing research trends in the scientific literature. An event may have a life cycle in the underlying theme patterns such as the beginning, duration, and end of a particular event. Similarly, the evolution of a particular topic in the research literature may have a life-cycle in terms of how the different topics affect one another. This problem was defined in [39], and contains three main parts:  Discovering the themes in text.  Creating an evolution graph of themes.  Studying the life cycle of themes. A theme is defined as a semantically related set of words, with a corresponding probability distribution, which coherently represents a particular topic or sub-topic.

2.6 Opinion Mining and Sentiment Analysis Sentiment analysis or opinion mining is the computational study of people’s opinions, attitudes, and emotions toward entities, individuals, events, topics and their attributes.

9

2.6.1 Opinion Mining Every Opinion has a target (product, service, event…) which is called entity and an owner (me, him….). The entity could be a hierarchy or a tree i.e. iPhone has components (battery, screen…) and attributes (sound, resolution….) together make the aspects of the iPhone which is a two level tree. Opinions could be [40, 41]:  Regular: opinion about a single entity like the positive and negative sentiment, attitude or emotion about an entity.  Comparative: find similarity or differences between two or more entities. They give formal definition to opinions called quintuples that helps tp see the unstructured data as structured and could be handled by database management systems, to form the quintuples we have to follow the following steps [42]:  Extract all entities and group similar entities into clusters to have ei.  Extract all aspects and group similar aspects into clusters to have aij.  Extract opinion holder and time to have hk and tl.  Determine each opinion on an aspect whether it is positive, negative, or neutral to have ooijkl.  Generate the quintuples (ei, aij, ooijkl, hk, tl). We can generate aspect-based or so called feature-based opinion summary. We need another form of summary other than quintuples as we want to study opinions about entities from huge number of holders. In [43, 44] they produce summaries like shown below

Cellular phone 1: Aspect: GENERAL Positive: 125 Negative: 7 Aspect: Voice quality Positive: 120 Negative: 8 2.6.2 Document Sentiment Classification Problem Definition: Classify a single opinion document d which expresses a single entity e from a single holder h; as a positive or negative opinion or sentiment. We have two major kinds of classification.

2.6.2.1 Classification based on supervised learning Sentiment classification can be formulated as a supervised learning problem with three classes, positive, negative and neutral. Training and testing data used in this kind of problem are mostly on product reviews. There are some supervised learning methods can be applied to sentiment classification i.e. Naive Bayes (NB) classifiers, and support

10

vector machines (SVM). The work in [45] took this approach to classify movie reviews into two classes, positive and negative. It was shown that using BOWs or unigrams as features in classification performed well with either NB or SVM. The main task of sentiment classification is to find an effective set of features. Some of the current features are:  Terms and their frequencies: finding words and their count, we can use weighting scheme like Tf-IDF.  Part of Speech: Finding adjectives as they are important indicators of opinions.  Opinion words and phrases: words that are commonly used to express opinions like the words good, like or bad, hate. Sometimes they are phrases like cost me an arm and a leg.  Negations: the appearance of negation words may change the opinion orientation like not good is equivalent to bad.  Syntactic Dependency: Word dependency is generated from parsing.

2.6.2.2 Classification based on unsupervised learning Opinion words and phrases are the dominating indicators for sentiment classification so, we can use unsupervised learning based on such words and phrases. The method in [93] uses known opinion words for classification, while [46] defines some phrases which are likely to be opinionated. The algorithm in [46] consists of three steps which are:  Extract all phrases containing adjectives or adverbs.  Estimate semantic orientation (SO) of the extracted phrases.  The algorithm computes the average SO of all phrases in the review using probabilistic and mathematical equations, then classifies the review as recommended if the average SO is positive, not recommended otherwise. The main advantage of document level sentiment classification is that it provides a general opinion on an entity, topic, or event. The main disadvantage is that it does not give details on what people liked or disliked and it is not easily applicable to non-reviews like forum and blog postings as they evaluate multiple entities and compare them.

2.6.3 Sentence Subjectivity and Sentiment Classification The same document-level sentiment classification techniques can also be applied to individual sentences. The task of classifying a sentence as subjective or objective is often called subjectivity classification [47, 48]. The subjective sentences are then classified as expressing positive or negative opinions, which is so called sentence-level sentiment classification. There are two sub-tasks are performed:  Subjectivity classification: Determine whether sentence s is a subjective sentence or an objective sentence.  Sentence-level sentiment classification: If s is subjective, determine whether it expresses a positive, negative or neutral opinion.

11

Much of the research on sentence-level sentiment classification assumes that the sentence expresses a single opinion from a single opinion holder. This assumption is only appropriate for simple sentences with a single opinion i.e. The picture quality of this camera is amazing. In compound and complex sentences; a single sentence may express more than one opinion i.e. The picture quality of this camera is amazing and so is the battery life, but the viewfinder is too small for such a great camera. This sentence express both positive and negative opinions (mixed opinions). Sentence-level classification is not suitable for compound and complex sentences. It was pointed out in [49] that not only a single sentence may contain multiple opinions, but also both subjective and factual clauses.

2.6.4 Opinion Lexicon Expansion Opinion words are employed in many sentiment classification tasks but how are they generated? Positive opinion words are used to express some desired states while negative opinion words are used to express some undesired states and there are also opinion phrases and idioms both together are called opinion lexicon. There are three main approaches in order to compile or collect the opinion word list. i) Manual approach The manual approach is very time consuming and it is not usually used alone, but combined with the following automated approaches as the final check because automated methods make mistakes. ii) Dictionary-based Approach The strategy is to first collect a small set of opinion words manually with known orientations, and then to grow this set by searching in the WordNet [50] or thesaurus [51] for their synonyms and antonyms. The newly found words are added to the seed list and the next iteration starts. The iterative process stops when no more new words are found. This approach is used in [43, 52]. After the process completes, manual inspection can be carried out to remove or correct errors. iii) Corpus-based Approach The technique in [53] starts with a list of seed opinion adjectives, and uses them and a set of linguistic constraints to identify additional adjective opinion words and their orientations. One of the constraints is about the conjunction AND, which says that conjoined adjectives usually have the same orientation. Rules or constraints are also designed for other connectives like OR, BUT, EITHER-OR…….; this idea is called sentiment consistency. Learning is applied to a large corpus to determine if two conjoined adjectives are of the same or different orientations, the links between adjectives form a graph then clustering is performed on the graph to produce two sets of words: positive and negative.

12

2.6.5 Aspect-based Sentiment Analysis Classifying text at the document level or at the sentence level does not provide the necessary detail needed opinions on all aspects of the entity which is need in many applications, to obtain these details; we need to go to the aspect level. Aspect-based sentiment analysis introduces a suite of problems which require deeper natural language processing capabilities, and also produce a richer set of results. Recall from section 2.6.1 that the mining objective is to discover every quintuple (ei, aij, ooijkl, hk, tl) in a given document d. We needed five tasks to generate the quintuples so; we will focus here on task 2 and 4 which were:  

Extract all aspects and group similar aspects into clusters. Determine each opinion on an aspect whether it is positive, negative or neutral.

i) Aspect Sentiment Classification The sentence-level and clause-level sentiment classification methods discussed are used here which is applied to each sentence or clause which containing some aspects. The aspects in it will take the opinion orientation of the sentence or clause but it will face difficulty in dealing with mixed opinions in a sentence and opinions that need phrase level analysis. The lexicon-based approach proposed in [43, 54] solves the problem. It basically uses an opinion lexicon, i.e., a list of opinion words and phrases, and a set of rules to determine the orientations of opinions in a sentence. It also considers opinion shifters and but-clauses. The approach works as follows:  Mark Opinion words and phrases: given a sentence they mark all opinion words and phrases; positive words are given score of +1 and negative words are given score of -1.  Handle Opinion Shifters: words and phrases that can shift or change opinion orientations like negation words i.e not.  Handle but-clauses: the word but in English means contrary so the phrase before but has opposite opinion than the phrase after but.  Aggregating Opinions: applies an opinion aggregation function to the resulting opinion scores to determine the final orientation of the opinion on each aspect in the sentence.

where s is a sentence which contains a set of aspects {a1 . . . am} and a set of opinion words or phrases {ow1 . . . own} with their opinion scores.

13

ii) Aspect Extraction Existing research on aspect extraction is mainly carried out in online reviews. There are some unsupervised methods for finding aspect expressions which are nouns and noun phrases. The method in [43] consists of two steps:  Find frequent nouns and noun phrases which are identified by part-ofspeech (POS) tagger.  Find infrequent aspects by exploiting the relationship between aspects and opinion words. There are many other algorithms and enhancements which use some information extraction techniques like Conditional Random Fields (CRF) [55, 56] or Hidden Markov Models (HMM) [57-59].

2.7 Future Trend in Research The rapid growth of online textual data creates an urgent need for powerful text mining techniques. Text data mining is interested in multiple research communities like data mining, natural language processing, information retrieval, and machine learning with applications in many different areas. Many models and algorithms have been developed for various text mining tasks as illustrated in the previous sections. The general future directions in this field are [1]:  Scalable and robust methods for natural language understanding: Understanding text information is fundamental to text mining. The current approaches mostly rely on BOWs but it is desirable to go beyond such a simple representation. Information extraction techniques provide one step forward toward semantic representation, but the current information extraction methods mostly rely on supervised learning and generally only work well when sufficient training data are available. It is thus important to develop effective and robust information extraction and other natural language processing methods that can scale to multiple domains.  Domain adaptation and transfer learning: although supervised learning highly depends on the amount of training data available, it is generally labor-intensive to create large amounts of training data. Domain adaptation and transfer learning methods can solve this problem by attempting to exploit training data that might be available in a related domain or for a related task. Further development of more effective domain adaptation and transfer learning methods is necessary for more effective text mining.  Contextual analysis of text data: Text data is generally associated with a lot of context information such as authors, sources, and time or more complicated information networks associated with text data. In many applications, it is important to consider the context as well as user preferences in text mining. It is thus important to further extend existing text mining approaches to further incorporate context and information networks for more powerful text analysis.

14

Chapter 3: Sentiment Analysis 3.1 Introduction Sentiment Analysis (SA) or Opinion Mining (OM) is the computational study of people’s opinions, attitudes and emotions toward an entity. The entity could be individuals, events, or topics. The topics are most likely represent reviews. The two expressions SA or OM are interchangeable; expressing a mutual meaning. Opinion Mining is extracting and analyzing people’s opinion about an entity while Sentiment Analysis is identifying the sentiment expressed in a text then analyzing it. Therefore; the target of SA is to find opinions, identify the sentiments they express then, classify their polarity as shown in Figure 1.

Figure ‎3-1 SA Steps on Product Reviews

There are three main classification levels in SA: document-level, sentence-level and aspect-level SA. Document-level SA aims to classify an opinion document as expressing a positive or negative opinion or sentiment. It considers the whole document a basic information unit (talking about one topic). Sentence-level SA aims to classify sentiment expressed in each sentence. The first step is to identify whether the sentence is subjective or objective. If the sentence is subjective, Sentence-level SA will determine whether the sentence expresses positive or negative opinions. It was pointed out in [60] that sentiment expressions are not necessarily subjective in nature. However, there is no fundamental difference between document and sentence level classifications because sentences are just short documents [61]. Classifying text at the document level or at the sentence level does not provide the necessary detail needed opinions on all aspects of the entity which is needed in many applications, to obtain these details; we need to go to the aspect level. Aspect-level SA aims to classify the sentiment with respect to the specific aspects of entities. The first step is to identify the entities and their aspects. The opinion

15

holders can give different opinions for different aspects of the same entity like this sentence “The voice quality of this phone is not good, but the battery life is long”. The corpora used in SA are an important issue in this field. The main sources of data are from the product reviews which are important to the Business holders as they can take business decisions according to analyzing users’ opinions about their products. SA is not only applied on product reviews but also can be applied on stock markets [62, 63], news articles ]46[ or political debates [65]. In political debates for example we could configure out people’s opinions on a certain elections’ candidates or political parties and can predict the elections results from political posts. The reviews source are mainly review sites and now arising the Social Network sites and microbloging sites which are considered a very good source of information as customers share and discuss their opinion about a certain product freely. There are many applications and enhancements on SA algorithms were proposed in the last few years. There are numerous number of articles presented every year in the SA fields. The number of articles is increasing through years. This creates a need to have survey papers that summarize the recent research trends and directions of SA. There are some sophisticated and detailed surveys including [61, 66-70]. Those surveys have discussed the problem of SA from the applications point of view and [71] discussed the problem from the SA techniques point of view.

3.2 Feature Selection Methods Sentiment Analysis task is considered as a sentiment classification (SC) problem. The first step in the SC problem is to extract and select text features. Feature Selection methods can be divided to Lexicon-based methods that need human annotation and statistical methods that are automatic methods which are more frequently used. Lexiconbased approaches usually begin with a small set of ‘seed’ words and bootstrap this set through synonym detection or on-line resources to obtain a larger lexicon but they proved to have many difficulties as reported in [72]. Statistical approaches on the other hand are fully automatic. The feature selection techniques treats the documents either as group of words BOWs or as a string which retains the sequence of words in the document, BOW are used more often because of its simplicity for the classification process. The most common feature selection step is the removal of stopwords and stemming (returning the word to its stem or root i.e. flies  fly).

3.3 Sentiment Analysis Techniques Sentiment Analysis techniques can be roughly divided into [73]: Machine Learning Approach (ML): Apply the famous ML algorithms and uses linguistic features. Lexicon-based Approach: It relies on a sentiment lexicon, a collection of known and precompiled sentiment terms. It is divided into dictionary-based approach and corpusbased approach which uses statistical or semantic methods to find sentiment polarity.

16

Hybrid Approach: It is very common with sentiment lexicons playing a key role in the majority of methods. The following Figure illustrates the famous SC techniques and the most popular algorithms used in these fields.

Figure ‎3-2 Sentiment Analysis Techniques

The text classification methods using ML approach can be roughly divided into supervised and unsupervised learning methods. The supervised methods make use of a large number of labeled training documents but when it is difficult to find these labeled training documents; the unsupervised methods are used. The Lexicon-based Approach depends on finding the opinion Lexicon then they are used to analyze the text. There are two methods in this approach; the Dictionary-based approach which depends on finding opinion seed words then search the dictionary of their synonyms and antonyms. The corpus-based approach begins with a seed list of opinion words to find other opinion words in a large corpus which helps in finding opinion words with context specific orientations. This could be done using statistical or semantic methods. There is a brief explanation of both approaches’ algorithms and related articles in the next subsections.

17

3.3.1 Machine learning Approach Machine Learning approach relies on the famous ML algorithms and treats the problem of SA as a regular classification problem that make use of syntactic and/or linguistic features. Text Classification Problem Definition: We have a set of training records where each record is labeled to a class. The classification model relates the features in the underlying record to one of the class labels then, for a given instance of unknown class; the model is used to predict a class label for it. Hard classification is when only one label is assigned to an instance while soft classification is when a probabilistic value of labels is assigned to an instance.

3.3.1.1 Supervised Learning The supervised learning methods depend on the existence of labeled training documents. There are many kinds of supervised classifiers in literature. In the next subsections, we present in brief details some of the most frequently used classifiers in SA.

3.3.1.1.1 Probabilistic Classifiers Probabilistic classifiers use mixture models for classification, the mixture model assumes that each class is a component of the mixture. Each mixture component is a generative model that provides the probability of sampling a particular term for that component. These kinds of classifiers are also called generative classifier. 3.3.1.1.1.1 Naïve Bayes Classifier

The Naive Bayes (NB) classifier is the simplest and most commonly used classifier. NB classification model compute the posterior probability of a class, based on the distribution of the words in the document. The model works with the BOWs feature extraction which ignores the position of the word in the document. It uses Bayes Theorem to predict the probability that a given feature set belongs to a particular label.

P(label) is the Prior probability of a label or the likelihood that a random feature set is the label. P(features|label) is the Prior probability that a given feature set being classified as a label. P(features) is the Prior probability that a given feature set occurring. Given the Naïve assumption that all features are independent; the equation could be rewritten as:

18

3.3.1.1.1.2 Bayesian Network

The main assumption of the NB classifier is the independence of the features; taking the other extreme where the features are fully dependent results to the Bayesian Network (BN) model. BNs are directed acyclic graphs whose nodes represent random variables, and edges represent conditional dependencies. BN is considered a complete model for the variables and their relationships; therefore a complete joint probability distribution (JPD) over all the variables is specified for a model. In Text mining the computation complexity of BN is very expensive that is why it is not frequently used [1]. 3.3.1.1.1.3 Maximum Entropy Classifier

The Maximum Entropy Classifier (known as a conditional exponential classifier) converts labeled feature sets to vectors using encoding. This encoded vector is then used to calculate weights for each feature that can then be combined to determine the most likely label for a feature set. This classifier is parameterized by a set of , which are used to combine the joint-features that are generated from a featureset by an . In particular, the encoding maps each pair to a vector. The probability of each label is then computed using the following equation:

3.3.1.1.2 Linear Classifiers Given

is the normalized document word frequency, vector is a vector of linear coefficients with the same dimensionality as the feature space, and b is a scalar; the output of the linear predictor is defined to be which is the output of the linear classifier. The predictor p is a separating hyperplane between the different classes. There are many kinds of linear classifiers among them is Support Vector Machines [74, 75] which are a form of classifiers that attempt to determine good linear separators between the different classes. 3.3.1.1.2.1 Support Vector Machines Classifiers

The main principle of SVMs is to determine linear separators in the search space which can best separate the different classes. In Figure 3-3 there are two classes x , o and there are three hyperplanes A,B and C. Hyperplane A provides the best separation between the classes, because the normal distance of any of the data points from it is the largest so, it represents the maximum margin of separation.

19

Figure ‎3-3 SVM Classification Problem

Text data is ideally suited for SVM classification because of the sparse nature of text, in which few features are irrelevant, but they tend to be correlated with one another and generally organized into linearly separable categories [76]. SVM can construct a nonlinear decision surface in the original feature space by mapping the data instances non-linearly to an inner product space where the classes can be separated linearly with a hyperplane [77]. 3.3.1.1.2.2 Neural Network

Neural Network consists of a neuron which is its basic unit. The inputs to the neurons are denoted by the vector which is the word frequencies in the ith document. There is a set of weights which are associated with each neuron used in order to compute a function of its inputs . The linear function of the neural network is:

In a binary classification problem it is assumed that the class label of and the sign of the predicted function yields the class label.

is denoted by

Figure ‎3-4 Multi-Layer neural network for non-linear separation

20

Multi layer neural networks are used for non-linear boundaries as shown in Figure 3-4. There are no linear boundaries can separate between classes. These multiple layers is used to induce multiple piece-wise linear boundaries, which is used to approximate enclosed regions belonging to a particular class. The outputs of the neurons in the earlier layers feed into the neurons in the later layers. The training process is more complex because the errors need to be back-propagated over different layers. There are implementations of NNs for text data can be found in [78, 79]. There is an empirical comparison between SVM and neural networks presented in [80] regarding document-level sentiment analysis. They made this comparison because SVM has been widely and successfully used in SA while neural networks have attracted little attention as an approach for sentiment learning. Their experiments indicated that neural networks produced superior results to SVM’s except for some unbalanced data contexts. They have tested three benchmark corpora on Movie, GPS, Camera and Books Reviews from amazon.com and proved that experiments on Movies reviews; neural networks outperformed SVM by a statistically significant difference. They proved that using Information gain (a computationally cheap feature selection Method) can reduce the computational effort of both neural networks and SVM without affecting significantly the resulting classification accuracy.

3.3.1.1.3 Decision Tree Classifiers It is a hierarchical decomposition of the training data space in which a condition on the attribute value is used to divide the data [81]. The condition or predicate is the presence or absence of one or more words. The division of the data space is done recursively until the leaf nodes contain certain minimum numbers of records which is used for the purpose of classification. There are other kinds of predicates which depend on the similarity of documents to correlated sets of terms which may be used to further partitioning of documents. The different kinds of splits are Single Attribute split which uses the presence or absence of particular words or phrases at a particular node in the tree in order to perform the split [82]. Similarity-based multi-attribute split uses documents or frequent words clusters and the similarity of the documents to these words clusters in order to perform the split. Discriminat-based multi-attribute split uses discriminants such as the Fisher discriminate for performing the split [83]. The decision tree (DT) implementations in text classification tend to be small variations on standard packages such as ID3 and C4.5. In [84] they used the successor to the C4.5 algorithm, which is known as the C5 algorithm.

3.3.1.1.4 Rule-based Classifiers The data space is modeled with a set of rules. The left hand side represents a condition on the feature set expressed in disjunctive normal form (DNF) while the right hand side is the class label. The conditions are on the term presence; term absence is rarely used because it is not informative in sparse data.

21

There are number of criteria in order to generate rules, the training phase construct all the rules depending on these criteria. The most two common criteria are [85]: Support: the absolute number of instances in the training data set which are relevant to the rule. Confidence: the conditional probability that the right hand side of the rule is satisfied, if the left-hand side is satisfied. Decision trees and Decision rules both tend to encode rules on the feature space but the decision tree tends to achieve this goal with a hierarchical approach. In [81] they studied the decision tree and decision rule problems within a single framework as a certain path in the decision tree can be considered a rule for classification of the text instance. The main difference between the decision trees and the decision rules is that DT is a strict hierarchical partitioning of the data space, while rule-based classifiers allow for overlaps in the decision space.

3.3.1.2 Weakly, Semi and unsupervised learning The main purpose of text classification is to classify documents into a certain number of predefined categories. In order to accomplish that large number of labeled training documents are used for supervised learning as illustrated before. In text classification sometimes it is difficult to create these labeled training documents while it is easy to collect the unlabeled documents. The unsupervised learning methods overcome these difficulties. There are many research works were presented in this field among them is [86]. They proposed a method that divides the documents into sentences, and categorizes each sentence using keyword lists of each category and sentence similarity measure. The concept of weak and semi-supervision is used in many applications; for example in [87]. They proposed a strategy that works by providing weak supervision at the level of features rather than instances. They obtained an initial classifier by incorporating prior information extracted from an existing sentiment lexicon into sentiment classifier model learning. In their work they were able to identify domain-specific polarity words clarifying the idea that the polarity of a word may be different from a domain to the other. They showed that their approach attains better performance than other weakly-supervised sentiment classification methods and it is applicable to any text classification task where some relevant prior knowledge is available. There are other unsupervised approaches that depend on semantic orientation using the point-wise mutual information (PMI) as in [46] or lexical association using PMI, semantic spaces, and distributional similarity to measure the similarity between words and polarity prototypes as in [88]. Sometimes we can combine supervised and unsupervised approaches together which was done by [89]. They proposed the use of meta-classifiers that combine supervised and unsupervised learning in order to develop a polarity classification system. They worked on a Spanish corpus of film reviews along with its parallel corpus translated into English.

22

First, they generated two individual models using these two corpora then applying ML algorithms (SVM, NB, C4.5 and other). Second, they integrated SentiWordNet into the English corpus generating a new unsupervised model using SO approach. Third, they combine the three systems using a meta-classifier. Their results outperformed using the systems individually and showed that their approach could be considered a good strategy for polarity classification when we work with parallel corpora.

3.3.2 Lexicon-based Approach Opinion words are employed in many sentiment classification tasks but how are they generated? Positive opinion words are used to express some desired states while negative opinion words are used to express some undesired states and there are also opinion phrases and idioms which together are called opinion lexicon. There are three main approaches as illustrated in Chapter 2 in order to compile or collect the opinion word list. Manual approach is one of them which is very time consuming and it is not usually used alone, but combined with the other two automated approaches as the final check because automated methods make mistakes. The two automated approaches are illustrated in the following subsections.

3.3.2.1 Dictionary-based Approach The main strategy of the Dictionary-based approach used in [43, 52] is at first a small set of opinion words is collected manually with known orientations, and then this set is grown by searching in the WordNet [50] or thesaurus [51] for their synonyms and antonyms. The newly found words are added to the seed list and the next iteration starts. The iterative process stops when no more new words are found. After the process completes, manual inspection can be carried out to remove or correct errors. The dictionary based approach has a major disadvantage that it is unable to find opinion words with domain and context specific orientations. Although it has some disadvantages; it is used in some applications like contextual advertising as the work presented in [90].

3.3.2.2 Corpus-based Approach The Corpus-based approach helps to solve the problem of finding opinion words with context specific orientations. Its methods depend on syntactic patterns or patterns that occur together along with a seed list of opinion words to find other opinion words in a large corpus. One of these methods where represented in [53] which starts with a list of seed opinion adjectives, and uses them along with a set of linguistic constraints to identify additional adjective opinion words and their orientations. The constraints are for connectives like AND, OR, BUT, EITHER-OR…….; the conjunction AND for example says that conjoined adjectives usually have the same orientation. This idea is called sentiment consistency which is not always consistent practically. There are also adversative expressions such as but, however which are indicated as opinion changes. In order to determine if two conjoined adjectives are of the same or different orientations;

23

learning is applied to a large corpus then the links between adjectives form a graph and clustering is performed on the graph to produce two sets of words: positive and negative. Using the corpus-based approach alone is not as effective as the dictionary-based approach because it is hard to prepare a huge corpus to cover all English words but this approach has a major advantage that it can help find domain and context specific opinion words and their orientations using a domain corpus. The Corpus-based approach is performed using statistical approach or semantic approach as illustrated in the following subsections.

3.3.2.2.1 Statistical Approach Finding co-occurrence patterns or seed opinion words can be done using statistical techniques and sometimes called corpus statistics. This could be done by deriving posterior polarities using the co-occurrence of adjectives in a corpus as proposed in [91]. It is possible to use the entire set of indexed documents on the Web as the corpus for the dictionary construction to overcome the problem of unavailability of some words by using a corpus that is large enough [46]. The polarity of a word can be identified by studying the occurrence frequency of the word in a large annotated corpus of texts [92]. If the word occurs more frequently among positive texts, then its polarity is positive. If it occurs more frequently among negative texts, then its polarity is negative. If it has equal frequencies then it is a neutral word. The similar opinion words frequently appear together in a corpus; this is the main observation that the state of the art methods are based on. Therefore, if two words appear together frequently within the same context, they are likely have the same polarity so, the polarity of an unknown word can be determined by calculating the relative frequency of co-occurrence with another word. This could be done using PMI as in [46]. Latent Semantic Analysis (LSA) is a statistical approach to analyze the relationships between a set of documents and terms in these documents that produces a set of meaningful patterns related to the documents and terms [93]. LSA was used in finding the semantic characteristics from review texts to examine the impact of the various features as in [94].

3.3.2.2.2 Semantic Approach The Semantic Approach gives sentiment values directly too but it relies on different principles for computing the similarity between words; this principle gives similar sentiment values to semantically close words. WordNet for example provides different kinds of semantic relationships between words used to calculate sentiment polarities. WordNet could be used too for obtaining a list of sentiment words by iteratively expanding the initial set with synonyms and antonyms then determine the sentiment polarity for an unknown word by the relative count of positive and negative synonyms of this word [52].

24

The Semantic Approach is used in many applications; it is used to build a lexicon model for the description of verbs, nouns and adjectives to be used in SA as the work presented in [65]. Semantics of electronic word of mouth (eWOM) content is used to examine electronic WOM content analysis as proposed in [95]. They extracted both positive and negative appraisals, and help consumers in their decision making. Semantic methods can be used along with the statistical methods to perform SA task as the work presented in [96] which used both methods to find product weakness from online reviews.

3.3.2.3 Lexicon-based and NLP Techniques Natural Language Processing (NLP) techniques [97] are sometimes used with the lexicon-based approach to find the syntactical structure and help in finding the semantic relations which attract researchers recently. In [98] they used NLP techniques as preprocessing stage before they used their proposed lexicon-based SA algorithm. Their proposed system consists of an automatic Focus Detection Module and a SA Module capable of assessing user opinions of topics in news items which use a taxonomy-lexicon specifically designed for news analysis. Their results were promising in scenarios where colloquial language predominates. In [99] they used NLP form a different perspective. They used NLP techniques to identify tense and time expressions along with mining techniques and ranking algorithm and they proposed a metric to rank product reviews by ‘mentions about experiences’. Their proposed metric has two parameters that capture time expressions related to the use of products and product entities over different purchasing time periods and they identify important linguistic clues for the parameters through an experiment with crawled review data with the aid of NLP techniques.

3.4 Building Resources Building Resources (BR) aims at creating lexica, corpora in which opinion expressions are annotated according to their polarity and sometimes dictionaries. Building resources is not a SA task but it could help to improve it. The main challenges that confronted the work in this category are as illustrated in [70]: Ambiguity of words. Multilinguality: the need to have linguistic resources for different languages. Granularity: the fact that opinions can be expressed in words, sentences or entire phrases. The differences in opinion expression among textual genres. Building Lexicon was presented in [100]. In their work they proposed a random walk algorithm to construct domain-oriented sentiment lexicon by simultaneously utilizing sentiment words and documents from both old domain and target domain. They conducted their experiments on three domain-specific sentiment data sets. Their experimental results indicate that their proposed algorithm improves the performance of automatic construction of domain-oriented sentiment lexicon.

25

Building corpus was introduced in [101]. They proposed OpinionMining-ML, a new XML-based formalism for tagging textual expressions conveying opinions on objects that are considered relevant in the state of affairs. It is a new standard beside Emotion-ML and WordNet. Their work consists of two parts. First, they presented a standard methodology for the annotation of affective statements in text that is strictly independent from any application domain. Second, they regarded instead the domain-specific adaptation that relies on the use of ontology of support which is domain-dependent. They started with dataset of Restaurant Reviews applying query-oriented extraction process. They evaluated their proposal by means of fine-grained analysis of the disagreement between different annotators. Their results indicate that their proposal represents an effective annotation scheme that is able to cover high complexity while preserving good agreement among different people. In [102] they focused on the creation of EmotiBlog, a fine-grained annotation scheme for labeling subjectivity in nontraditional textual genres. They focused on the annotation at different levels: document, sentence and element. They also presented the EmotiBlog corpus; a collection of blog posts composed by 270,000 tokens about 3 topics and in 3 languages: Spanish, English and Italian. They checked the robustness of the model and its applicability to NLP tasks with regards to the 3 languages. They tested their model on many corpora i.e. ISEAR. Their experiments provided satisfactory results. They applied EmotiBlog to sentiment polarity classification and Emotion Detection, and proved that their resource improved the performance of systems built for this task. Building Dictionary was presented in [103]. In their work they proposed a semiautomatic approach to creating sentiment dictionaries in many languages. They first produced high-level gold-standard sentiment dictionaries for two languages and then translated them automatically into third language. Those words that can be found in both target language word lists are likely to be useful because their word senses are likely to be similar to that of the two source languages. They addressed two issues during their work; the morphological inflection and the subjectivity involved in the human annotation and evaluation effort. They worked on news data. They compared their triangulated lists with non-triangulated machine-translated word lists and verified their approach.

26

Chapter 4: The proposed Framework 4.1 Introduction Web 2.0 technologies have seen a big evolution recently. This big growth leads to the existence of huge amount of data on the internet. They are considered a very good source of information on the web. Status Messages posted on Social Media websites present a new and challenging style of text for language technology due to their noisy and informal nature. There are many online social networks (OSN) exist nowadays like Facebook, Blogosphere, or microblogging sites i.e. Twitter [104, 105]. The web is used by many languages’ speakers. It is no longer used by English speakers only. The need of SA systems that can analyze OSN in other languages than English is compulsory. Facebook is a web provision where millions of people can join together to form an online community. There are a huge number of public pages created on Facebook. These pages could be discussing movies, TV shows, political topics, or social topics. These pages are considered a good source of information about users’ opinions concerning these topics. Twitter on the other hand is a popular microblogging service where users create status messages “Tweets” to express opinions about different topics. It is characterized with its short messages up to 140 characters. The data from OSN is characterized by being noisy. It contains a lot of spams and advertising URLs especially in the official pages of a product. It also contains a frequent use of hashtags. The hashtags mentioned in Facebook data doesn’t carry much sentiment. Unlike Facebook; hashtags are used frequently in Twitter and they express sentiments. Using abbreviations and smiley faces in OSN are very frequent. Sometimes they were used in the reviews too. There is a need for many preprocessing and cleaning steps for this kind of data to be prepared. For the popularity of the OSN, they are considered very good source of expressing opinions about certain topics and can be used in the task of SA like reviews. The main target of framework is to prepare and use corpora from OSN and review sites for SA task in two languages English and Arabic. The framework consists of three phases. The first phase is the preprocessing and cleaning of data collected, then data annotation. The second phase is applying various text processing techniques including: removing stopwords, replacing the negation words and the following negated words with the antonyms of the negated words, and using selective words of part-of-speech tags (adjectives and verbs) on the prepared corpora. The third phase is text classification using NB and DT classifiers and two feature selection approaches, unigrams and bigrams.

4.2 Literature Review The target of SA is to find opinions, identify the sentiments they express, and then classify their polarity. SA can be considered also a classification process which is the task

27

of classifying text to represent a positive or negative sentiment [45, 106]. The classification process is usually formulated as a two-class classification problem; positive and negative. Since it is a text classification problem, any existing supervised learning method can be applied, e.g. NB classifier. In the literature, the first paper that has taken this approach to classify the benchmark movie reviews into two classes, positive and negative was [45]. In their work they used unigrams and bigrams as FS techniques. It was shown that using unigrams as features in classification gives the highest accuracy with NB. There are other researches tend to find more effective features to improve the performance of classification such as removing stopwords, Part-of-speech tagging, and handling negations. Stopwords are more typical words used in many sentences like “a”, “the”, “of”… and don’t affect the meaning [1]. Part-of-speech (POS) tagging is converting the sentences into a list of tags which signify whether the word is a noun, adjective, verb, …etc [107]. A good survey of using POS can be found in [108]. There are many researchers used POS tagging and syntactic dependencies as features [109]. Sentiment shifters are the words that reverse the sentiment like the use of negation i.e. not. There are many models for handling negations were proposed [110, 111]. A good survey of negation modeling is found in [112]. Twitter has been tackled much more frequently than Facebook for SA problem. There were many researches on Twitter data. At the beginning, they were extracting tweets by emotion icons (emoticons) to extract positive and negative tweets form a twitter API [113]. Then they downloaded tweets using hashtags [114]. Hashtags could also detect sarcasm i.e. #sarcasm [105]. There were other frameworks proposed in the literature that took nearly the same steps of preprocessing and cleaning and most of them apply removing stopwords and POS tagging as [115, 116]. SA on Arabic language was not tackled in the literature frequently like English. Arabic is spoken by more than 300 million people, and is the fastest-growing language on the web (with an annual growth rate of 2,501.2% in the number of Internet users as of 2010, compared to 1,825.8% for Russian, 1,478.7% for Chinese and 301.4% for English) (http://www.internetworldstats.com/stats7.htm). Arabic is a Semitic language [117] and consists of many different regional dialects. However, these dialects are true native language forms which are used in informal daily communication and are not standardized or taught in schools [7]. Despite this fact but in reality the internet users especially on OSN sites and some of the blogs and reviews site as well, use their own dialect to express their feelings. The only formal written standard for Arabic is the Modern Standard Arabic (MSA). It is commonly used in written media and education. There is a large degree of difference between MSA and most Arabic dialects as MSA is not actually the native language of any Arabic country [118]. Some SA systems were proposed for Arabic language, most of them on MSA. In [119], they applied SA on Twitter data on sentence level. A SA system was proposed and applied on news in MSA[120]. In [121], they proposed a system for SA on social media

28

data. They used POS tag sets and found features in Dialectal Arabic (DA). A good survey on Arabic SA can be found on [118]. There is lack of language resources of Arabic language as most of them are under development. In order to use Arabic language in SA, there are some text processing techniques are needed like removing stopwords or POS tagging. There are some sources of stopword lists and POS taggers are publicly available but they work on MSA not on DA. There are some research work have generated stopword lists but as far as our knowledge no one has generated a stopword list for DA. In [122] they have proposed an algorithm for removing stopwords based on a finite state machine. They have used a previously generated stopword list on MSA. In [123] they have created a corpus-based list from newswire, query sets and a general list using the same corpus. Then, they compared the effectiveness of these lists on the information retrieval systems. The lists are on MSA too. In [124] they have generated a stopword list of MSA from the highest frequent meaningless words that appear in their corpus.

4.3 System Architecture This is a proposed framework for Sentiment Analysis on OSN data. The system consists of the following components as shown in Figure 4-1:  Data Preparation  Text Processing  Feature Selection  Sentiment Classification

Figure ‎4-1 Sentiment Analysis Framework Architecture

We have implemented and tested all the components of the framework on the English language corpora and applied removing stopwords only on Arabic language corpora. This is due to lack of resources of DA. The following subsections contain a detailed explanation of every component in the system.

4.3.1 Corpora Preparation The data downloaded from OSN should be prepared and cleaned from all the noise and spams to be able to be fed to the classifier. There are some common steps for different sources and common steps between English and Arabic corpora. The steps are almost the

29

same except that for the Arabic there are two steps added. These are the translation of English words that appear inside Arabic sentences. The other one is the translation of Franco-arab to Arabic. These are Arabic words written in English letters.

4.3.1.1 Corpora Preparation for English Language corpora Data was download from Twitter and Facebook on the same topic to distinguish the difference in handling the data in both sites and reviews sites as well. We have chosen a hot topic on a movie that was just shown in the theatres and was ranked on the top of the box office in May 2014. The movie was called ”The Amazing Spiderman 2”. We have downloaded related tweets from twitter, comments from the movie’s Facebook page, and users’ reviews from the Internet Movie Database (IMDB1) website. This website is a well known source of movie reviews. Tweets were downloaded about the movie using the hashtag of #amazingspiderman2 from two sites (http://searchhash.com, and http://topsy.com) which were last visited on May 25, 2014. These sites retrieve only the latest 100 tweets related to the hashtag. We have given more than one retrieve call in different times to download more tweets. We have also downloaded the tweets on the movie’s page on Twitter. In order to enhance the study we tested the system on four different corpora. The three different corpora from OSN and IMDB as illustrated above are used. The fourth corpus is the benchmark movie reviews. This is a corpus of classified movie reviews which contains 2000 movie reviews: 1000 positive and 1000 negative. The reviews were originally collected from IMDB review site. Their classification as positive or negative is automatically extracted from the ratings, as specified by the original reviewer [45]. The data downloaded are prepared to be able to be fed to the classifier as shown in Figure 4-2.

Figure ‎4-2 English Corpora Preparation from IMDB, Facebook, and Twitter

1

http://www.imdb.com/

30

The number of comments from Facebook was 2430. After removing the comments that contain URLs only or advertising links, they were reduced to 2266. Some comments were links to user’s reviews. Some were links to certain scenes or related videos on Youtube and some are unrelated comments. Then, after removing non-English comments, they were reduced to 1994. Removing comments expressed by photos only reduced them to 1963. Removing comments that contain mentions to friends with no other words reduced them to 1915. The final number of tweets downloaded and after removing duplicate tweets because they came from more than one source was 1428 tweets. After removing the tweets that contain URLs only or advertising links, they were reduced to 659. Some were links to certain scenes or related videos on Youtube. Many of the tweets downloaded from the movie’s page contained links to cast interviews or photos. Removing non-English tweets reduced them to 561. The number of reviews downloaded from IMDB site was 536. The reviews needed only two steps of preparation as shown in Figure 4-2. The number and percentage of noisy and removed data in OSN is illustrated in Table 4-1. Table ‎4-1 Number and Percentage of removed English OSN data

Facebook Contains URLs Non-Engl Photos Mentions

Number 164 272 31 48

Twitter

Percentage 6.7% 11.1% 1.2% 1.9%

Number 769 98 -----

Percentage 53.8% 6.8% -----

After the preprocessing, cleaning and filtering of the data, they must be annotated to be fed to the supervised classifiers. The first Experiment in the next chapter shows the method of annotation and the number of positive and negative data.

4.3.1.2 Corpora Preparation for Arabic Language corpora We have downloaded data from Twitter, Facebook, and a review site on the same topic in Arabic language for SA. We have chosen a hot topic on the recently shown movies in the theatres for the last festival in first of August 2014. The movies were: ”‫ ”الفيل األزرق‬means “The blue elephant”; “‫ ”صنع فى مصر‬means “Made in Egypt”; “‫ ”الحرب العالمية التالتة‬means “The third world war”; and “‫ ”جوازة ميري‬means “official marriage”. We have downloaded related tweets from twitter, comments from some movies’ Facebook pages, and users’ reviews from the review site elcinema.com. Tweets were downloaded about the movies using the regular search of Twitter as many of the sites that retrieve tweets are closed like (http://searchhash.com/, http://topsy.com/) which was used before to download tweets in English language. We have searched using the movies’ names and downloaded all the tweets that appear at the time of search. There

31

were many unrelated tweets downloaded as some of the movies like “Made in Egypt” and “The third world war” can hold other meanings than the movies’ title. The retrieved tweets are tweets that contain the whole words or any word either in the text or with hashtag. The data downloaded are prepared to be able to be fed to the classifier as shown in Figure 4-3.

Figure ‎4-3 Arabic Corpora Preparation from Reviews, Facebook, and Twitter

The original number of comments from Facebook was 1478. After removing the comments that contain URLs only or advertising links reduced them to 1459. Removing comments expressed by photos only reduced them to 1415. Removing comments that contain mentions to friends with no other words reduced them to 1296. Then, after removing non-Arabic comments, they were reduced to 1261. The original number of tweets downloaded was 1787 tweets. After removing the tweets that contain URLs only or advertising links or some who put links to watch the movie only, they were reduced to 1069. Some were links to certain scenes or related videos on Youtube. After removing unrelated tweets as the search on twitter was just by the movies’ names which can imply other meanings, they were reduced to 862. Removing non-Arabic tweets reduced them to 781. The number of reviews downloaded from the review sites was 32. The reviews needed only two steps of preparation as shown in Figure 4-3. The number and percentage of noisy and removed data in OSN is illustrated in Table 4-2.

32

Table ‎4-2 Number and Percentage of removed Arabic OSN data

Facebook Contains URLs Non-Arabic Photos Mentions Franco-arab

Number 19 35 44 119 229

Twitter

Percentage 1.3% 2.3% 3% 8% 15.5%

Number 718 72 ----18

Percentage 40.1% 4% ----1%

After the preprocessing, cleaning and filtering of the data, they must be annotated to be fed to the supervised classifiers. The first Experiment in next Chapter shows the method of annotation and the number of positive and negative data.

4.3.2 Text Processing Techniques There are three text processing techniques used in the tests presented here in the following subsections in details.

4.3.2.1 Replacing Negations with Antonyms for English Language corpora Negations are words that reverse the sentiment orientation of a sentence. For example consider the sentence “This movie is good” versus “This movie is not good”. In the first one the word “good” is a positive term, so the sentence is positive. When “not” is applied to the clause; the word “good” is being used in a negative context, so the sentence is negative [110]. The basic approach of handling negation was to add artificial words: i.e. if a word “x” is preceded by a negation word, then rather than considering this as an occurrence of the feature “x”, a new feature “NOT x” is created [45]. The approach of handling negation proposed in the framework was different from the literature. It simply replaced the negation words and the following negated words with the unambiguous antonyms of the negated words. We suggested that negated words can come after three shifters (not, don’t, can’t). Whenever one of the shifters appears in the sentence, the word after them and the shifter are replaced with its unambiguous antonym. Unambiguous means if the word has more than one antonym, it will not be replaced. The antonyms were retrieved from Wordnet [50]. We have made an experiment on the benchmark movie reviews to configure which is more better using the shifter of not alone or the three shifters not, don’t, and can’t. Table 4-3 show that using the three shifters is better than the not shifter in most cases or equally performed.

33

Table ‎4-3 Testing the effectiveness of choosing shifters

Classifier

Feature selection Unigram

Naïve Bayes Bigram

Unigram Decision Tree Bigram

Replacing Negations Using not Using not, don’t,‎can’t Using not Using not, don’t,‎can’t Using not Using not, don’t,‎can’t Using not Using not, don’t,‎can’t

Accuracy F-measure Movie Reviews 73.2% 0.71 73.6%

0.72

76.8%

0.76

82.2%

0.82

65.2%

0.64

66.6%

0.66

67%

0.67

66.8%

0.67

4.3.2.2 Replacing Negations with Antonyms for Arabic Language corpora The problem of handling Negations in Arabic language wasn’t tackled in the literature as far as we know. If we try to apply the same proposed method for handling negations applied on English corpora, we will face the problem of existence of Wordnet. The problem of creating Arabic Wordnet is tackled in many researches recently but still under development and it works for MSA. The other issue is that the language used by OSN users is not MSA but it is DA. There are many different regional dialects for Arabic. There is no DA Wordnet is available. The corpora were mostly written with Egyptian dialect by Egyptian users. We suggest that the common Egyptian word for negation is “‫ ”مش‬which means “not” or the word “‫”ما‬ followed by a word that end with “‫ ”تش‬or “‫” ش‬. Whenever these words appear they will be replaced by the antonym of the negated word. This could be implemented when the Egyptian Dialect Wordnet is available.

4.3.2.3 Removing stopwords for English Language corpora Stopwords are common words that generally do not contribute to the meaning of a sentence, specifically for the purposes of information retrieval and natural language processing. The common English words that don’t affect the meaning of a sentence are like “a”, “the”, “of”…. Removing stopwords will reduce the corpus size without losing important information. The English stopword list is general and contains 127 words like (all, just, being…). They are shown in Appendix A.

34

4.3.2.4 Removing stopwords for Arabic Language corpora There are some sources of stopword lists are publicly available but they work on MSA not on DA. Stopwords are more typical words used in many sentences and have no significant semantic relation to the context in which they exist. In order to apply removing stopwords effectively we have generated two stopword lists. One is a corpusbased list shown in Appendix B and the other is Egyptian Dialect list shown in Appendix C. The common strategy for determining a stopword list is to calculate the frequency of appearance of each word in the document collection then to take the most frequent words. The selected terms are often hand-filtered for their semantic content relative to the domain of the documents being indexed, and marked as a stopword list. In order to generate the stopword list for Arabic which is a very rich lexical language; we have done this through many steps. First, we should specify some general conditions for the word to be a stopword: -They give no meaning if they are used alone. -They appear frequently in the text. -They are general words and not used specifically in a certain field. The methodology of generating the stopword lists is shown in Figure 4-4. The methodology consists of three phases as illustrated in the following subsections.

4.3.2.4.1 Calculating words frequency The three corpora are tokenized to words. This phase was done totally automatic using python code and the nltk 2.0 toolkit. The results are not totally meaningful as the tokenization could consider the “comma” as a word if it is not correctly used. There is some manual filtering after tokenization. The reviews corpus give 3781 unique words, the Facebook corpus give 1451 unique words, and the Twitter corpus give 1160 unique words. This shows that despite the number of reviews are much less than the OSN corpora but they are lexically rich. After combining them together and removing the duplicates, the list of all unique words are 4818 words. Then we have calculated the frequency of occurrence of each word from the list of all words in the three corpora combined together.

35

Figure ‎4-4 A methodology of generating Egyptian Dialect and corpus-based stopword lists

4.3.2.4.2 The validity of words to be a stopword To generate the corpus based list, we have taken the most frequent 200 words. These words are not all general and they are domain specific like the words “‫ ”المشاهد‬or “‫”الفيلم‬ which means (the spectator, the movie) respectively. This list contains words in MSA and Egyptian dialect as well. Diacritics could change the meaning of a word i.e. the word “‫ ”المشاهد‬could mean (the spectator or the scenes). The difference could be told through the meaning of the sentence. The OSN users use simple language without diacritics. Since the word is in the context of the corpora, it is more likely to appear frequently expressing both meanings. The problem will occur if a word appeared as a frequent word but outside the context of the corpora. This case didn’t happen here. To generate a general list of Egyptian dialect stopwords, we have taken the most frequent 200 words and remove the semantically recognized words which are likely to be nouns and verbs. Then, to generate a general list of Egyptian dialect, we have added every word in the corpora in Egyptian dialect to the most frequent words that are semantically meaningless. To validate if the word is a stopword or not; if the word is a MSA word we check its existence in the MSA stopword lists. If it doesn’t exist, we check its corresponding meaning in the English stopword list. If the word is in Egyptian dialect, we see its correspondence in the MSA list and if doesn’t exist we check its correspondent meaning in the English stopword list. For example the word “‫”بس‬, its correspondence in MSA is “‫ ”فقط‬and it has a corresponding meaning in the English stopword list too which is “only”. On the contrary, the word “‫ ”الزم‬has no correspondence in the MSA list which should be “‫ ”البد‬but it has a correspondent meaning in the English list which is the word “should”. Therefore, it is considered a stopword. The final list of valid unique words

36

contains 100 words. This phase was done in a semi automatic way that includes manual check.

4.3.2.4.3 Adding possible prefixes and suffixes to the words Arabic is a very rich lexical language which has a large number of prefixes and suffixes that could be added to a word to change its meaning. For example the prefix “‫ ”ال‬which means “the” change the word from indefinite to definite. The suffix “‫ ”هم‬gives the meaning of pronoun “them”. We have added some frequently used prefixes to the words generated in both lists which are (‫ ل‬،‫ ف‬،‫ ب‬،‫ و‬،‫)ال‬. If necessary we give pronoun suffixes which are (‫ هم‬،‫ ها‬،‫ ه‬،‫ ى‬،‫)نا‬. We have added these suffixes to possession words in Egyptian dialect like the word “‫ ”بتاعى‬which means (mine). There is also some letters are written in different forms so we write any word that contains these letters’ possible forms such as (‫ ي‬،‫)ى‬, (‫ ة‬،‫)ه‬, (‫ إ‬،‫ أ‬،‫)ا‬. The last one is according to the word itself. The lists are manually revised for improper words or meaningless words. After adding the prefixes and suffixes, the final corpus-based list contains 1061 words and can be found in (http://goo.gl/JW0jKP). The final general Egyptian dialect list contains 620 words and can be found in (http://goo.gl/263J5L).

4.3.2.5 Part-of-Speech tagging for English Language corpora POS tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a POS tag and signifies whether the word is a noun, adjective, verb, …etc [107]. A Classifier based tagger was used to tag the different corpora. This tagger uses the NB classifier which is trained on the Penn Treebank tagged corpus [125]. It was pointed out in the literature that adjectives are natural indicators of sentiment as proposed by [45]. SA on a corpus containing adjectives only can give good results. There have been some targeted comparisons of the effectiveness of other POS tags including verbs, adverbs and nouns [126-128]. Verbs are chosen as they are good indicator of sentiments too i.e. “love, or hate”. The different corpora were tagged then some selective POS tags were considered in the classification. The selective POS were adjectives (base form, comparative and superlative), and verbs (base form, past tense, gerund or present participle, past participle, non-3rd person singular present, and 3rd person singular present). We have made an experiment on the benchmark movie reviews to configure which is better, using the tags of adjectives only or adjectives and verbs together. Table 4-4 show that using the adjectives and verbs is better than using adjectives only in most cases or equally performed.

37

Table ‎4-4 Testing the effectiveness of choosing tags

Classifier

Feature selection Unigram

Naïve Bayes Bigram


POS tagging adjectives only adjectives and verbs adjectives only adjectives and verbs adjectives only adjectives and verbs adjectives only adjectives and verbs

Accuracy F-measure Movie Reviews 70.2%

0.68

70.8%

0.68

71.6%

0.70

78.6%

0.78

66.8%

0.66

66.4%

0.66

64.6%

0.64

67.4%

0.67

4.3.2.6 Part-of-Speech tagging for Arabic Language corpora There is an Arabic POS tagger available based on MSA and trained on Arabic Treebank but there is lack of Egyptian Dialect POS tagger. That is the reason we didn’t apply this text processing step on Arabic Corpora. We suggest that an Egyptian Dialect Treebank should exist at first. It should be the Treebank that can recognize a word like “‫ ”تحفة‬as an adjective. Then a tagger could be trained on this Treebank and used with social network corpora. This word is a noun which means "antique" but it is used in Egyptian dialect as an adjective that means "wonderful".

4.3.3 Feature Selection There are two Features extraction methods used in the test: -Unigram: treats the documents as BOWs which constructs a word presence feature set from all the words of an instance. -Bigram: is the same as unigram but finds pair of words. These two methods are widely used in the SA field [1]. These tests can be extended by extracting more features i.e. n-grams.

4.3.4 Sentiment Classification There are two supervised learning classifiers used; Naïve Bayes (NB) [129] and Decision tree (DT) [81]. There are many other kinds of supervised classifiers in the literature [1].

38

The two chosen classifiers represent two different families of classifiers. NB is one of the probabilistic classifiers. It is the simplest and most commonly used classifier. DT on the other hand is a hierarchical decomposition of data space and doesn’t depend on calculating probability. The two classifiers were conducted with the nltk 2.0 toolkit. There are some parameters passed in to the DT classifier [107]. The parameters are: -Entropy cutoff: used during the tree refinement process. If the entropy of the probability distribution of label choices in the tree is greater than the entropy_cutoff, then the tree is refined further. But if the entropy is lower than the entropy_cutoff, then tree refinement is halted. Entropy is the uncertainty of the outcome. As entropy approaches 1.0, uncertainty increases and vice versa. Higher values of entropy_cutoff will decrease both accuracy and training time. It was set to ‘0.8’. -Depth cutoff: used during refinement to control the depth of the tree. The final decision tree will never be deeper than the depth_cutoff. Decreasing the depth_cutoff will decrease the training time and most likely decrease the accuracy as well. It was set to ‘5’. -Support cutoff: controls how many labeled feature sets are required to refine the tree. When the number of labeled feature sets is less than or equal to support_cutoff, refinement stops, at least for that section of the tree. Support_cutoff specifies the minimum number of instances that are required to make a decision about a feature. It was set to ‘30’.

39

Chapter 5: Tests, Results, and Discussion 5.1 Introduction We used a HP pavilion desktop computer of model: p6714me-m. The processor is Intel(R) core (TM) i5-2300 CPU @ 2.80 GHZ; RAM is 4GB; and 64-bit operating system. We have calculated the training time using a build-in function written with python code which calculates the processing time in seconds. We have made many experiments to test the components of the SA framework mentioned in the previous chapter on English and Arabic corpora. We have made the tests after splitting 75% of the total number of the data in each corpus for training and 25% for testing data. The standard Accuracy and F-measure were used to evaluate the performance for each test. The accuracy is defined as: the ratio of number of correctly classified data to the total number of data. F-measure is computed by: combining the Precision and Recall in the following way:

where precision is defined as the ratio of number of correctly assigned category C reviews to the total number of reviews classified as category C. Recall is the ratio of correctly assigned category C reviews to the total number of reviews actually in category C. Since F-measure is computed for each category separately, we aggregated the Fmeasure scores by using the Macro average of the F-measure scores. Macro-average gives each category equal weight. The following sections contain the experiments made. The first one is the Data Annotation. The second was the Sentiment Classification using the two FS techniques and testing various stages of text processing techniques.

5.4 Tests, Results and Discussion of English Corpora The following subsections contain the various tests we have made to test and analyze the framework components on English corpora. The first experiment is the data annotation. The second section is the tests and results made. A component analysis of the framework is presented in the third subsection and the fourth subsection contains a discussion and analysis of the corpora and results.

40

5.4.1 Data Annotation We have tested four corpora as mentioned in the previous chapter. The first corpus is the benchmark movie reviews. This is a corpus of classified movie reviews which contains 2000 movie reviews: 1000 positive and 1000 negative. The other corpora were data downloaded from two online social networks (OSN) Twitter and Facebook and a review site called IMDB on the same movie. The reviews from IMDB were previously rated from the site. They were given a degree from 1 to 10. The ratings bigger than 5 are considered positive and less than 5 are considered negative. The ratings equal to 5 are neutral. We have annotated the reviews according to the site rating. For the OSN data, we have used a well known public list of sentiment words called ”The Subjectivity Lexicon” [130]. This is a list of 8222 subjective words labeled with their polarity (positive or negative), their POS, and their subjective strength. They were divided to 4913 negative words and 2718 positive words. The other words were having double polarity or neutral. The results of annotation using the lexicon were not good. Some data wasn’t annotated because the lexicon words were not found and others were wrongly annotated. For example, a word like “emotional” its polarity was negative on the list but actually it was used by the users to express how the movie was emotional and great. We have also manually annotated the corpora. The manual annotation was more reliable as the human being analyzing of data is better than the machine so far. The manual annotation can detect the implicit sentiments as well. Table 5-1 show the number of positive, negative and neutral reviews, comments, and tweets resulted from annotation. Table ‎5-1 Number of positive and negative English reviews, comments, and tweets from IMDB, Facebook and Twitter

IMDB

Facebook

Rating

Manual

No. of positive

375

No. of negative No. of neutral

Twitter Manual

920

Using Lexicon 188

245

Using Lexicon 53

113

201

194

21

8

48

794

1533

295

500

41

5.4.2 Tests and Results For each corpus we have replaced the negation words and the following negated words with the antonyms of the negated words. Then, we have removed stopwords. The last step of text processing is the POS tagging. The four corpora were tagged then some selective POS tags are considered in the classification. The selective POS were adjectives and verbs. These text processing techniques are mentioned in details in the previous chapter and shown in Figure 5-1.

Figure ‎5-1 Text Processing and Sentiment classification of the prepared English corpora

Tables from 5-2 to 5-4 contain the accuracy, training time, and F-measure of the four corpora after and before applying text processing techniques. The precision and recall sometimes give zeroes for a certain polarity. This happens with the extremely unbalanced data Twitter when using DT. The precisions and recalls for negative polarity give zeroes. Therefore, the F-measure of Twitter when using DT as shown in Table 5-4 is for positive polarity only.

42

Table ‎5-2 Accuracy of Sentiment Analysis on English Movie Reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigrams and bigrams as Feature Selection before and after applying text processing techniques

Accuracy Classifier

Feature selection

Unigram Naïve Bayes Bigram


Text Processing Without processing With processing Without processing With processing Without processing With processing Without processing With processing

Movie Reviews

IMDB

Facebook

Twitter

72.80%

69.10%

83.27%

67.64%

74.80%

77.23%

71.53%

83.82%

81.60%

55.30%

80.07%

45.60%

79.80%

77.23%

67.61%

69.11%

68.80%

79.67%

83.63%

91.17%

67.00%

76.42%

82.92%

91.17%

68.60%

79.67%

81.85%

91.17%

66.80%

76.42%

82.20%

91.17%

43

Table ‎5-3 Classification training time of Sentiment Analysis on English Movie Reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigrams and bigrams as Feature Selection before and after applying text processing techniques

Time (sec) Classifier

Feature selection




Movie Reviews

IMDB

Facebook

Twitter

2.69

0.362

0.056

0.024

0.95

0.140

0.025

0.01

25.30

2.170

0.240

0.085

3.29

0.440

0.075

0.026

985.71

16.4

7.033

0.898

473.30

7.46

2.362

0.288

46760.6

93.87

32.603

3.188

1319.85

22.31

7.97

0.938

44

Table ‎5-4 F-measure of Sentiment Analysis on English Movie Reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigrams and bigrams as Feature Selection before and after applying text processing techniques

F-measure Classifier

Feature selection




Movie Reviews

IMDB

Facebook

Twitter

0.706

0.661

0.756

0.522

0.732

0.726

0.593

0.627

0.811

0.540

0.730

0.397

0.794

0.738

0.598

0.535

0.682

0.598

0.536

0.95

0.665

0.513

0.507

0.95

0.683

0.598

0.468

0.95

0.661

0.513

0.470

0.95

The following figures from 5-2 to 5-9 illustrate the accuracies and training times of the four corpora.

Figure ‎5-2 Classification accuracy of English Movie Reviews corpus after applying different paths of text processing

45

Figure ‎5-3 Classification training time of English Movie Reviews corpus after applying different paths of text processing

Figure ‎5-4 Classification accuracy of English IMDB corpus before and after applying text processing techniques

Figure ‎5-5 Classification training time of English IMDB corpus before and after applying text processing techniques

46

Figure ‎5-6 Classification accuracy of English Facebook corpus before and after applying text processing techniques

Figure ‎5-7 Classification training time of English Facebook corpus before and after applying text processing techniques

Figure ‎5-8 Classification accuracy of English Twitter corpus before and after applying text processing techniques

47

Figure ‎5-9 Classification training time of English Twitter corpus before and after applying text processing techniques

We can notice from Tables 5-2 to 5-4 that using DT classifier with bigrams and unigrams give the best accuracy for the reviews data with the least training while NB give better Fmeasure. For Facebook data using DT with unigrams give the best accuracy and for Twitter using DT give best results. Text processing increases the accuracy in case of using NB but for the DT classifier it decreases the accuracy and the training time as well. Figure 5-2 shows that applying text processing techniques on movie reviews corpus increase the accuracy of the NB classifier incase of unigrams but decrease it in case of bigrams. It also shows that that bigrams are better FS than unigrams. NB classifier gives better accuracy than DT. After applying text processing techniques the accuracy is slightly decreased but the training time decreases dramatically with DT as shown in Figure 5-5. The training time of NB is much smaller than DT and the text processing decreases the training time of both classifiers. Figure 5-4 shows that applying text processing techniques on IMDB corpus increase the accuracy of the NB classifier by over 10%. It also shows that that unigrams are better FS than bigrams. DT classifier gives better accuracy than NB. After applying text processing techniques the accuracy is slightly decreased but the training time decreases as shown in Figure 5-5. The training time of NB is smaller than DT and the text processing decreases the training time of both classifiers. Figure 5-6 shows that the accuracy of both classifiers is nearly the same when using Facebook corpus. There is no significant difference between unigrams and bigrams too. Applying text processing techniques decreases the accuracy and decreases the training time of both classifiers as shown in Figure 5-7. The training time of NB is smaller than DT and the text processing decreases the training time of both classifiers. Figure 5-8 shows that using DT gives much better results than NB. Applying text processing techniques increase the accuracy by about 15% for the NB classifier and it didn’t affect the accuracy of DT classifier but decreases the training time as shown in

48

Figure 5-9. The training time of NB is smaller than DT and the text processing decreases the training time of both classifiers.

5.4.3 Component Analysis of the SA Framework The main target of the component analysis is to configure which scenario can give the best performance for each corpus. The best combination of the text processing technique, the feature extraction method and the classifier used that gives the highest accuracy, Fmeasure and the least training time. In order to analyze the components of the framework, the classification was performed after applying each stage of text processing technique as follows: -Path A: testing on original data without any text processing. -Path B: testing on original data after replacing the negation words and the following negated words with the antonyms of the negated words. -Path C: testing on original data after replacing the negation words and the following negated words with the antonyms of the negated words then removing the stopwords. The stopwords couldn’t be removed at first as the shifters will be removed and the negation words can’t be detected. -Path D: testing on original data after replacing the negation words and the following negated words with the antonyms of the negated words then, removing the stopwords then, selective POS tagging of adjectives and verbs. After applying each text processing stage as shown in Figure 5-10, the corpus size is reduced and a new modified version of the original corpus is produced. The classification was done four times after each path as follows: -NB with unigrams -NB with bigrams -DT with unigrams -DT with bigrams

Figure ‎5-10 Component Analysis of the Sentiment Analysis framework

49

Tables from 5-5 to 5-7 contain the accuracy, training time, and F-measure of the four corpora after applying different paths of text processing techniques. Table ‎5-5 Accuracy of Sentiment Analysis on English movie reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as features after applying each path of text processing

Accuracy Classifier

Features

Text Processing

Movie Reviews

IMDB

Facebook

Twitter

Path A

72.80%

69.10%

83.27%

67.64%

Path B

73.60%

78.86%

79.30%

63.23%

Path C

72.80%

82.11%

75.80%

45.60%

Naïve

Path D

74.80%

77.23%

71.53%

83.82%

Bayes

Path A

81.60%

55.30%

80.07%

45.60%

Path B

82.20%

69.10%

72.24%

47.00%

Path C

77.60%

78.05%

70.10%

26.47%

Path D

79.80%

77.23%

67.61%

69.11%

Path A

68.80%

79.67%

83.63%

91.17%

Path B

66.60%

79.67%

82.56%

91.17%

Path C

67.40%

77.23%

82.92%

91.17%

Decision

Path D

67.00%

76.42%

82.92%

91.17%

Tree

Path A

68.60%

79.67%

81.85%

91.17%

Path B

66.80%

79.67%

82.56%

91.17%

Path C

68.60%

77.23%

82.92%

91.17%

Path D

66.80%

76.42%

82.20%

91.17%

Unigram

Bigram

Unigram

Bigram

50

Table ‎5-6 Classification training time of Sentiment Analysis on English movie reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as features after applying each path of text processing

Classifier

Features

Time (sec)

Text Processing

Movie Reviews

IMDB

Facebook

Twitter

Path A

2.69

0.362

0.056

0.024

Path B

2.71

0.362

0.054

0.024

Path C

2.38

0.324

0.052

0.022

Naïve

Path D

0.95

0.140

0.025

0.01

Bayes

Path A

25.30

2.170

0.240

0.085

Path B

25.12

2.160

0.230

0.086

Path C

6.34

0.911

0.149

0.076

Path D

3.29

0.440

0.075

0.026

Path A

985.71

17.030

7.033

0.926

Path B

992.10

16.660

6.930

0.944

Path C

993.53

16.480

6.379

0.793

Decision

Path D

473.30

7.630

2.362

0.294

Tree

Path A

46760.59

97.690

32.603

3.274

Path B

57079.60

95.650

31.100

3.288

Path C

1946.17

38.140

19.953

2.478

Path D

1319.85

22.310

7.970

0.961

Unigram

Bigram

Unigram

Bigram

51

Table ‎5-7 F-measure of Sentiment Analysis on English movie reviews, IMDB, Facebook and Twitter corpora using NB and DT classifiers with unigram and bigram as Features after applying different paths of text processing techniques

Classifier

Features

F-measure

Text Processing

Movie Reviews

IMDB

Facebook

Twitter

Path A

0.706

0.661

0.756

0.522

Path B

0.719

0.730

0.696

0.501

Path C

0.708

0.771

0.686

0.346

Naïve

Path D

0.732

0.726

0.593

0.627

Bayes

Path A

0.811

0.540

0.730

0.397

Path B

0.817

0.669

0.654

0.402

Path C

0.768

0.733

0.647

0.239

Path D

0.794

0.738

0.598

0.535

Path A

0.682

0.598

0.536

0.95

Path B

0.661

0.617

0.489

0.95

Path C

0.669

0.538

0.507

0.95

Decision

Path D

0.665

0.513

0.507

0.95

Tree

Path A

0.683

0.598

0.468

0.95

Path B

0.668

0.617

0.489

0.95

Path C

0.683

0.538

0.507

0.95

Path D

0.661

0.513

0.470

0.95

Unigram

Bigram

Unigram

Bigram

The following figures from 5-11 to 5-18 illustrate the accuracies and training times of the four corpora.

52

Figure ‎5-11 Classification accuracy of English Movie Reviews corpus after applying different paths of text processing

Figure ‎5-12 Classification training time of English Movie Reviews corpus after applying different paths of text processing

Figure ‎5-13 Classification accuracy of English IMDB corpus after applying different paths of text processing

53

Figure ‎5-14 Classification training time of English IMDB corpus after applying different paths of text processing

Figure ‎5-15 Classification accuracy of English Facebook corpus after applying different paths of text processing

Figure ‎5-16 Classification training time of English Facebook corpus after applying different paths of text processing

54

Figure ‎5-17 Classification accuracy of English Twitter corpus after applying different paths of text processing

Figure ‎5-18 Classification training time of English Twitter corpus after applying different paths of text processing

We can notice from Table 5-5 and 5-6 that using NB classifier with bigrams gives the highest accuracy when applied on benchmark data while DT gives the higher accuracy when applied on OSN data and review data. Table 5-7 shows the F-measure of the various tests we have made. We can notice that using NB classifier gives the highest Fmeasure when applied on benchmark data and gives slightly better F-measure than DT when applied on OSN data and review data except for Twitter corpus which is extremely unbalanced. Figure 5-11 shows that NB give higher accuracy than DT in case of testing on benchmark data and that bigrams are better than unigrams when using NB. Path D increase the accuracy in case of using NB with unigrams. The text processing techniques decrease the accuracy with DT classifier slightly but decrease the training time of DT dramatically as shown in Figure 5-12. This figure shows the logarithmic graph of the classifiers’ training time on Movie Reviews. It shows that DT takes much longer training time than NB. The text processing techniques also decrease the training time of NB classifier. Figure 5-13 shows that DT gives higher accuracy than NB without applying any text processing in case of testing on reviews corpus and that unigrams are better than bigrams

55

when using NB. There is no significant difference between unigrams and bigrams when using DT. The text processing techniques increase the accuracy of NB classifier but don’t affect the accuracy of the DT classifier much. The training time of DT is still higher than NB but the difference is not so big like the benchmark corpus. The text processing techniques decrease the training time of both classifiers as shown in Figure 5-14. Figure 5-15 shows that DT gives higher accuracy than NB in case of testing on Facebook corpus. Unigrams are slightly better than bigrams when using both classifiers. The text processing techniques decrease the accuracy of NB classifier but don’t affect the accuracy of the DT classifier much. The training time of DT is higher than NB and the text processing techniques decrease the training time of both classifiers as shown in Figure 5-16. Figure 5-17 shows that DT gives much higher accuracy than NB in case of testing on Twitter corpus and that unigrams are better than bigrams when using NB. There is no significant difference between unigrams and bigrams when using DT. Applying all text processing techniques dramatically increase the accuracy of NB classifier but don’t affect the accuracy of the DT classifier. The training time of DT is still higher than NB and the text processing techniques decrease the training time of both classifiers as shown in Figure 5-18.

5.4.4 Discussion The following two subsections contain analysis for the corpora itself and the results.

5.4.4.1 Corpora Analysis The number of neutral reviews from IMDB represents 8% of the whole data. This is not a big number. We believe that people who write whole reviews on reviews sites are mainly having a complete opinion about the movie and they want to show it. They don’t lean to be neutral. The number of positive reviews represents 69% of the whole data while the number of negative reviews represents 21% of the entire data. The data are obviously unbalanced since the movie was a big hit, not many user’ reviews were negative. The number of neutral tweets represents 52% of the whole data. These are not neutral opinions on the movie too. The neutral tweets are mainly objective sentences that don’t contain any sentiments. Many of the tweets on the movie page were debates between the movie cast and the users i.e. #AskAndrew. Others were tweets expressing the users’ personal feelings like feeling excited to see the movie. The number of positive tweets represents 44% of the entire data and the number of negative tweets represents 3% of the whole data which is an extremely small percentage. We believe that people who mention the hashtag of the movie; do like it. The tweets from the movie page were mainly a retweet from people who loved the movie. They don’t retweet any negative opinions about their movie. There are number of implicit sentiments were presented in the OSN. There were no implicit sentiments in the reviews. The number of implicit sentiments comments in

56

Facebook was 54 representing 2.8% of the whole data. The number of implicit sentiments in tweets was 11 representing 2% of the whole data. Implicit sentiments are sentences who doesn’t imply a direct sentiment like “Tobey Maguire is still my Spiderman”. The actor’s name mentioned here is the hero of the older Spiderman movies. This implies a negative sentiment to the current movie as the user prefers the old movies. Using abbreviations and smiley faces in OSN are very frequent. There are some abbreviations were used also in IMDB. The meaning of these abbreviations and smiley faces were found from different sources on the web (Yahoo answers, Facebook emoticons sites). Table 5-8 contains sample of Abbreviations and smiley faces found in the three corpora. Table ‎5-8 Sample of abbreviations and smiley faces found in English Facebook, Twitter, and IMDB

Abbreviations cgi  computer generated imagery Idk  I don’t know  ‫قلب‬ ^_^  ‫مبسوط‬

Reviews

Found

Found Found

*_*  ‫متشوق‬

Found Found

5.5.3.2 Specializations of Arabic Language Words with the same meaning could be written in different correct ways like the words “‫ حنروح‬،‫”هنروح‬. They both give the future tense of the verb “‫ ”نروح‬which means “we will go”. As we can notice three words in English are just written in one word in Arabic and give the same meaning. The pronouns in English are expressed in Arabic by adding a prefix letter that modify the verb especially when it is used in the middle of the sentence like “‫ نروح‬،‫ ”اروح‬which means (I go, we go) respectively. Some prepositions and causal words are expressed in Arabic with one letter like the words “‫ النى‬،‫ ”انى‬which means (I am, because I am) respectively.

67

The many forms that the Arabic words could take are very common characteristics of MSA which make the dealing with the language is complicated. For DA, it is a tragedy. We have a special dialect for each Arab country and different dialects in the same country. For Egyptian dialect, there are many words that have no resemblance in MSA like the word “‫ ”مفيش‬which means (there is not). It has only a correspondent in MSA which is “‫ ”ال يوجد‬which are complete different words. In the OSN corpora some other dialects appear like the Moroccan word “‫ ”بزاف‬which means (too much) and the Syrian word “‫ ”مليح‬which means (good). The number of other dialects in Facebook corpus represents 1% of the whole corpus which is very small percentage. The number of other dialects in Twitter corpus represents 0.5% of the whole corpus which is extremely small percentage. There were no other dialects in reviews corpus. They used a mix between MSA words and Egyptian dialect words as they are user reviews not formal reviews from critics. The other phenomenon of Arab users is using the Franco-arab. This means that people use English letters for writing Arabic words like the word “de7k” which stands for “‫ ”ضحك‬which means (laugh). The number of Franco-arab comments in Facebook corpus represents 18% of the whole corpus which is not a big percentage. The number of Franco-arab tweets in Twitter corpus represents 3% of the whole corpus which is a small percentage. However, we have to unify the language used for the classifier to perform well. These are not even English words that have meanings so; they must be rewritten in Arabic letter. We have used the website (www.yamli.com). They give variations for each word that have to be chosen from. Sometimes the users don’t even write correct words in Franco-arab. In this case the site translates the letters only which give funny Arabic words. This transformation was manually revised.

5.5.3.3 Results Analysis In case of lexically rich corpus like the reviews, using the corpus-based list decrease the accuracy of classification which is similar to what [123] has found. But in case of OSN where they were not lexically rich the three lists wasn’t varying the accuracies much but still the general lists containing Egyptian dialect stopwords give better results than using MSA stopwords only. The difference in performance between Facebook and Twitter data is due to the degree of imbalance. The nature of the data is the same but Facebook corpus is much more unbalanced than Twitter corpus. Decision Tree is a hierarchical decomposition of data space and doesn’t depend on calculating probability but Naïve Bayes depends on calculating probability for the whole data. Although NB usually gives higher accuracy than DT, but this was not the case when testing these corpora. This is due to the unbalance of the data as the positive class in these cases where much bigger than the negative class. NB calculates the probability on the whole data but DT is more specifically build hierarchy decomposition of data. That is why DT is better for unbalance data as it is more specific than NB. But still DT has longer processing time than NB because it builds the hierarchical decomposition on the whole data but the difference in time is not big as the data size was not so big. In NB tests, the accuracy is better when using unigram which is similar to what [45] has found. In DT tests, unigram and bigrams give nearly similar results.

68

Chapter 6: Conclusion and Future Work 6.1 Conclusion In this thesis we have proposed a framework for preparing and using corpora from OSN and review sites for SA task with two different natural languages (English, and Arabic). The framework consists of three phases. The first phase is the preprocessing and cleaning of data collected, then data annotation which produces prepared corpora. The second phase is applying various text processing techniques including: replacing the negation words and the following negated words with the antonyms of the negated words, removing stopwords, and using selective words of POS tags (adjectives and verbs) on the prepared corpora. The third phase is text classification using Naïve Bayes (NB) and Decision Tree (DT) classifiers and two feature selection approaches, unigram and bigram. We have evaluated the performance of the classifiers by the means of accuracy, F-measure and training time. The text processing techniques were all applied on English language corpora while removing stopwords only is applied on Arabic language corpora due to lack of Arabic language resources. We have selected a specific topic (single movie) in English to download data about from three different sources (IMDB, Facebook, and Twitter). We have also downloaded Arabic data about movies from three different sources (elcinema.com, Facebook, and Twitter). The data were extremely unbalanced. The data contain many spams like advertising URLs, debates, and using of abbreviations and smiley faces. It needed many preprocessing and cleaning steps to be prepared for classification. We have also presented a component analysis of the proposed framework on English language corpora. We aimed at analyzing the effectiveness of applying each stage of the text processing techniques proposed in the framework. The corpora used for testing were the benchmark corpus movie reviews, reviews downloaded from IMDB site, comments from Facebook and tweets from twitter. The data downloaded are all on the same topic (a single movie). These tests were made after splitting 75% of the total number of the data in each corpus for training and 25% for testing data. We have also proposed a methodology for generating a stopword list from OSN corpora. The methodology consists of three phases: calculating the words’ frequency of occurrence, check the validity of a word to be a stopword, and adding possible prefixes and suffixes to the words generated. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN corpora. We compared them with other lists. The lists used in the comparison were: previously generated lists of MSA, the corpus-based generated list, the general generated list of Egyptian dialect, and a combination of the Egyptian dialect list with the MSA list. The results of DT classifier are better than NB classifier for OSN and reviews in general. The accuracies were higher and the F-measure was sometimes higher when the imbalance

69

of data is extreme, otherwise it is convergent. NB is better for the benchmark balanced data. Using bigrams give better results than unigrams for benchmark data. Applying text processing techniques increases the accuracy in some tests with NB classifier but it didn’t affect much the DT classifier. The penalty in accuracy is not much but it gives the least training time. The benchmark movie reviews corpus gives its highest accuracy and highest F-measure after replacing the negation words and the following negated words with the antonyms of the negated words when using NB and bigrams as features. The IMDB corpus gives its highest accuracy and highest F-measure after replacing the negation words and the following negated words with the antonyms of the negated words and removing stopwords when using NB and unigrams as features. Facebook corpus gives its highest accuracy when no text processing techniques is applied when using DT and unigrams as features while it gives its highest F-measure when using NB and unigrams with no text processing. The Twitter corpus gives its highest accuracy and highest F-measure with any combination of text processing and features when using DT classifier. This suggests that DT is better classifier for OSN unbalanced data while NB is better for reviews balanced data. The least training time was always achieved after applying all text processing techniques along with NB and unigrams for all corpora. Applying removing stopwords on Arabic corpora with multiple lists show that the corpus-based list negatively affects the accuracy of classification incase of reviews. Reviews are more lexically rich than OSN corpora. It also shows that the general lists containing the Egyptian dialects words give better performance than using lists of MSA stopwords only. The results of DT classifier are better than NB classifier for these kinds of corpora. Using unigrams give better results than bigrams. To sum up the conclusion these are the answers for the research questions that was asked in the first chapter: -Question 1: The difference between Arabic and English Languages in preparing corpora is that Arabic needs two more steps. Translating the English words that appear in the text and converting the Franco-arab into Arabic words to unify the language for classification. The difference between the two languages in text processing is that there is lack of sources in the Arabic language in addition to the language of the users of OSN is in DA not MSA. The taggers and Wordnet fro Arabic are still under development and works on DA. That is the reason that only one stage of the framework is applied on Arabic which is removing stopwords. A stopword list of Egyptian dialect was created in order to apply removing stopword list on the Arabic data in Egyptian dialect. -Question 2: Text Processing is important in SA of OSN because it reduces time of classification. This decrease is dramatic in case of using hierarchical classifiers as DT. -Question 3: Text Processing doesn’t give the same effect with any classifier. It gives better results with NB than DT.

70

-Question 4: The data from OSN is extremely unbalanced. The data imbalance affects negatively on the SA and text processing increase the results in case of NB but using DT give better results that NB whether applying text processing or not. -Question 5: The OSN data is different from user review data as the review data is lexically rich and doesn’t contain many spams. The OSN data is characterized by being noisy and contains many spams like Advertising URLs and Hashtags. The abbreviations and smiley faces are used much more frequently in OSN than user reviews. -Question 6: There is a difference between Facebook and Twitter users. Facebook has more space to say anything anywhere so; the unrelated data that was mentioned on a movie page is big while the Twitter user when mentioning a hashtag with the movie name they do mean the movie. The short messages of twitter make the user say everything to the point. The use of hashtags in Twitter is more effective than Facebook. Twitter contains more advertising URLs.

6.2 Future Work In the future we plan to test the algorithms with different classifiers and feature selection methods. We plan to try more text processing techniques on Arabic OSN data like POS tagging and try to fulfill the gap of using the Arabic dialect in the OSN data as all resources are build for MSA. We could tackle other dialects other than Egyptian. This could be done by building resources for Dialectal Arabic as POS tagger and Wordnet. A Dialect detector should be designed to distinguish different dialects of Arabic by morphological analysis.

71

References [1]

[2]

[3]

[4]

[5]

[6]

[7] [8] [9] [10]

[11]

[12]

[13]

[14] [15]

Charu C. Aggarwal , ChengXiang Zhai, "Mining Text Data". Springer New York Dordrecht Heidelberg London: © Springer Science+Business Media, LLC’12, 2012. Douglas E. Appelt, Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, "A finite-state processor for information extraction from real-world text," in 13th International Joint Conference on Artificial Intelligence, 1993. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni, "Open information extraction from the Web.," in 20th International Joint Conference on Artificial Intelligence, 2007, pp. 2670–2676. Mary Elaine Califf and Raymond J. Mooney, "Relational learning of patternmatch rules for information extraction.," in 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence, 1999, pp. 328–334. F. Ciravegna, "Adaptive information extraction from text by rule induction and generalisation.," presented at the 17th International Joint Conference on Artificial Intelligence, 2001. Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert, "CRYSTAL inducing a conceptual dictionary.," in 14th International Joint Conference on Artificial Intelligence,, 1995, pp. 1314–1319. N. Habash, "Introduction to Arabic natural language processing," Synthesis Lectures on Human Language Technologies, vol. 3, 2010. T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms: Norwell, MA, USA, 2002. Bo Pang and Lillian Lee, "Opinion mining and sentiment analysis," Found. Trends Information Retrieval, 2008. Tong Zhang and David Johnson, "A robust risk minimization based named entity recognition system.," presented at the seventh conference on Natural language learning at HLT-NAACL, 2003. A. Ratnaparkhi, "A maximum entropy model for part-ofspeech tagging.," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, April 1996. Songbo Tan, Yuefen Wang, "Weighted SCL model for adaptation of sentiment classification," Expert Systems with Applications, vol. 38, pp. 10524-10531, 2011. Qiong Wu, Songbo Tan, "A two-stage framework for cross-domain sentiment classification," Expert Systems with Applications, vol. 38, pp. 14269-14275, 2011. M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, pp. 130–137, 1980. X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, "Learning to classify short and sparse text & web with hidden topics from large-scale data collections," in

72

[16] [17]

[18] [19]

[20]

[21]

[22]

[23]

[24] [25] [26] [27] [28]

[29] [30] [31] [32] [33]

Proceeding of the 17th international conference on World Wide Web, 2008, pp. 91–100. D. Margineantu, W. Wong, and D. Dash, "Machine learning algorithms for event detection," Machine Learning, vol. 79, pp. 257–259, 2010. T. Sakaki, M. Okazaki, and Y. Matsuo, "Earthquake shakes twitter users: realtime event detection by social sensors," in Proceedings of the 19th international conference on World wide web,, 2010, pp. 851–860. C. Macdonald, I. Ounis, and I. Soboroff, "Overview of the trec-2009 blog track," in Proceedings of TREC 2009, 2010. C. Lin, B. Zhao, Q. Mei, and J. Han, "Pet: a statistical model for popular events tracking in social communities," in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 2010, pp. 929–938. K. Wang, Z. Ming, X. Hu, and T. Chua, "Segmentation of multisentence questions: towards effective question retrieval in cQA services," in Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, 2010, pp. 387–394. B. Sigurbjornsson and R. Van Zwol, "Flickr tag recommendation based on collective knowledge," in Proceeding of the 17th international conference on World Wide Web, 2008, pp. 327–336. E. Gabrilovich and S. Markovitch, "Feature generation for text categorization using world knowledge," presented at the International joint conference on artificial intelligence, 2005. X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting wikipedia as external knowledge for document clustering," in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 389–396. C. C. Aggarwal, "Data Streams: Models and Algorithms," 2007. D. Fisher, "Knowledge Acquisition via incremental conceptual clustering," Machine Learning, vol. 2, pp. 139–172, 1987. J. H. Gennari, P. Langley, D. Fisher, "Models of incremental concept formation," Journal of Artificial Intelligence, pp. 11–61, 1989. S. Zhong, "Efficient Streaming Text Clustering," Neural Networks, vol. 18, 2005. C. C. Aggarwal, P. S. Yu, "A Framework for Clustering Massive Text and Categorical Data Streams," presented at the SIAM Conference on Data Mining, 2006. T. Brants, F. Chen, and A. Farahat, "A system for new event detection," presented at the ACM SIGIR Conference, 2003. G. Fung, J. Yu, P. Yu, and H. Lu, "Parameter Free Bursty Events Detection in Text Streams," presented at the VLDB Conference, 2005. Y.Yang, J. Carbonell, and C. Jin, "Topic-conditioned Novelty Detection," presented at the ACM KDD Conference, 2008. D. Lewis, "The TREC-4 filtering track: description and analysis," in Proceedings of TREC-4, 4th Text Retrieval Conference, 1995, pp. 165–180. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, "An experimental comparison of naive Bayesian and keyword based anti-spam

73

[34]

[35] [36]

[37]

[38]

[39]

[40]

[41] [42] [43]

[44]

[45]

[46]

[47]

[48]

[49]

filtering with personal e-mail messages," in Proceedings of the ACM SIGIR Conference, 2000. T. Salles, L. Rocha, G. Pappa, G. Mourao, W. Meira Jr., and M. Goncalves, "Temporally-aware algorithms for document classification," presented at the ACM SIGIR Conference, 2010. Y. Zhang, X. Li, and M. Orlowska, "One Class Classification of Text Streams with Concept Drift," presented at the ICDMW Workshop, 2008. W. Cohen, Y. Singer, "Context-sensitive learning methods for text categorization," ACM Transactions on Information Systems, vol. 17, pp. 141– 173, 1999. Y. Freund, R. Schapire, Y. Singer, and M. Warmuth, "Using and combining predictors that specialize," in Proceedings of the 29th Annual ACM Symposium on Theory of Computing, 1997, pp. 334–343. A. Kontostathis, L. Galitsky,W. M. Pottenger, S. Roy, and D. J. Phelps, "A survey of emerging trend detection in textual data mining," Survey of Text Mining, pp. 185–224, 2003. Q. Mei, X. Zhai, "Discovering Evolutionary Theme Patterns from Text- An Exploration of Temporal Text Mining," presented at the ACM KDD Conference, 2005. N. Jindal and B. Liu, "Identifying comparative sentences in text documents," in Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2006), 2006. N. Jindal and B. Liu, "Mining comparative sentences and relations," in Proceedings of National Conf. on Artificial Intelligence (AAAI-2006), 2006. B. Liu, Sentiment analysis and subjectivity, second ed., 2010. Minging Hu and Bing Liu, "Mining and summarizing customer reviews," in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04), 2004. B. Liu, M. Hu, and J. Cheng, "Opinion observer: analyzing and comparing opinions on the web," in Proceedings of International Conference on World Wide Web (WWW-2005), 2005. B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up?: sentimentclassification using machine learning techniques," in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), 2002. P. Turney, "Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews," in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL-2002), 2002. E. Riloff, S. Patwardhan, and J. Wiebe, "Feature subsumption for opinion analysis," in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2006), 2006. E. Riloff and J. Wiebe, "Learning extraction patterns for subjective expressions," in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2003), 2003. T. Wilson, J. Wiebe, and R. Hwa, "Just how mad are you? finding strong and weak opinion clauses," in Proceedings of National Conference on Artificial Intelligence (AAAI-2004), 2004.

74

[50] [51]

[52] [53]

[54]

[55]

[56]

[57]

[58]

[59]

[60] [61] [62]

[63]

[64]

[65]

G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, "WordNet: an online lexical database," Oxford Univ. Press., 1990. S. Mohammad, C. Dunne., and B. Dorr, "Generating high-coverage semantic orientation lexicons from overly marked words and a thesaurus," in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2009), 2009. S. Kim and E. Hovy, "Determining the sentiment of opinions," in Proceedings of Interntional Conference on Computational Linguistics (COLING’04), 2004. V. Hatzivassiloglou and K. McKeown, "Predicting the semantic orientation of adjectives," in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL’97), 1997. X. Ding, B. Liu, and P. Yu, "holistic lexicon-based approach to opinion mining," in Proceedings of the Conference on Web Search and Web Data Mining (WSDM2008), 2008. N. Jakob and I. Gurevych, "Extracting opinion targets in a singleand crossdomain setting with conditional random fields," in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2010), 2010. J. Lafferty, A. McCallum, and F. Pereira, "Conditional random fields: probabilistic models for segmenting and labeling sequence data," in Proceedings of International Conference on Machine Learning (ICML-2001), 2001. D. Freitag and A. McCallum, "Information extraction with HMM structures learned by stochastic optimization," in Proceedings of National Conf. on Artificial Intelligence (AAAI-2000), 2000. W. Jin and H. Ho, "A novel lexicalized HMM-based learning framework for web opinion mining," in Proceedings of International Conference on Machine Learning (ICML-2009), 2009. W. Jin and H. Ho, "OpinionMiner: a novel machine learning system for web opinion mining and extraction," in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009), 2009. T. Wilson, J. Wiebe, P. Hoffman, "Recognizing contextual polarity in phraselevel sentiment analysis," in Proceedings of HLT/EMNLP, 2005. B. Liu, "Sentiment Analysis and Opinion Mining," Synthesis Lectures on Human Language Technologies, 2012. Liang-Chih Yu, Jheng-Long Wu, Pei-Chann Chang, Hsuan-Shou Chu, "Using a contextual entropy model to expand emotion words and their intensity for the sentiment classification of stock market news," Knowledge-Based Systems, vol. 41, pp. 89-97, 2013. Michael Hagenau, Michael Liebmann, Dirk Neumann, "Automated news reading: Stock Price Prediction based on Financial News Using Context-Capturing Features," Decision Support Systems, 2013. Tao Xu, Qinke Peng, Yinzhao Cheng, "Identifying the semantic orientation of terms using S-HAL for sentiment analysis," Knowledge-Based Systems, vol. 35, pp. 279-289, 2012. Isa Maks, Piek Vossen, "A lexicon model for deep sentiment analysis and opinion mining applications," Decision Support Systems, vol. 53, pp. 680-688, 2012.

75

[66] [67] [68] [69] [70]

[71] [72]

[73]

[74] [75] [76] [77]

[78] [79]

[80]

[81] [82] [83]

[84] [85]

Mikalai Tsytsarau, Themis Palpanas, "Survey on mining subjective data on the web," Data Min Knowledge Discovery, vol. 24, pp. 478–514, 2012. Bo Pang and Lillian Lee, "Opinion mining and sentiment analysis," Found. Trends Information Retrieval, 2008. E. Cambria, B. Schuller, Y. Xia, andC. Havasi, "New Avenues in Opinion Mining and Sentiment Analysis," IEEE Intelligent Systems, vol. 28, pp. 15-21, 2013. R. Feldman, "Techniques and Applications for Sentiment Analysis," Communications of the ACM, vol. 56, pp. 82-89, 2013. Andrés Montoyo, Patricio Martínez-Barco, Alexandra Balahur, "Subjectivity and sentiment analysis: An overview of the current state of the area and envisaged developments," Decision Support Systems, vol. 53, pp. 675-679, 2012. W. Medhat, A. Hassan and H. Korashy, "Sentiment analysis algorithms and applications: A survey," Ain Shams Engineering Journal, 2014. Casey Whitelaw, Navendu Garg, Shlomo Argamon, "Using appraisal groups for sentiment analysis," in Proceedings of the ACM SIGIR Conference on Information and Knowledge Management (CIKM), 2005, pp. 625–631. Diana Maynard and Adam Funk, "Automatic detection of political opinions in Tweets,", in proceedings of the 8t international conference on the semantic web, ESWC’11, pp. 88-99, 2011. C. Cortes, V. Vapnik, "Support-vector networks," presented at the Machine Learning, 1995. V. Vapnik, "The Nature of Statistical Learning Theory," New York, 1995. T. Joachims, "Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization," presented at the ICML Conference, 1997. M. Aizerman, E. Braverman, L. Rozonoer, "Theoretical foundations of the potential function method in pattern recognition learning," Automation and Remote Control, pp. 821–837, 1964. M. Ruiz, P. Srinivasan, "Hierarchical neural networks for text categorization," presented at the ACM SIGIR Conference, 1999. Hwee Tou Ng, Wei Goh, Kok Low, "Feature selection, perceptron learning, and a usability case study for text categorization," presented at the ACM SIGIR, Conference, 1997. Rodrigo Moraes, João Francisco Valiati, Wilson P. Gavião Neto, "Documentlevel sentiment classification: An empirical comparison between SVM and ANN," Expert Systems with Applications, vol. 40, pp. 621-633, 2013. J. R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81– 106, 1986. David D. Lewis, Marc Ringuette, "A comparison of two learning algorithms for text categorization," SDAIR, 1994. Soumen Chakrabarti, Shourya Roy, Mahesh V. Soundalgekar, "Fast and Accurate Text Classification via Multiple Linear Discriminant Projections," VLDB Journal, vol. 2, pp. 172–185, 2003. Y. Li, A. Jain, "Classification of text documents," The Computer Journal, vol. 41, pp. 537–546, 1998. Bing Liu, Wynne Hsu, Yiming Ma, "Integrating Classification and Association Rule Mining," presented at the ACM KDD Conference, 1998.

76

[86]

Youngjoong Ko, Jungyun Seo, "Automatic Text Categorization by Unsupervised Learning," in Proceedings of COLING-00, the 18th International Conference on Computational Linguistics, 2000. [87] Yulan He a, Deyu Zhou, "Self-training from labeled features for sentiment analysis," Information Processing and Management, vol. 47, pp. 606-616, 2011. [88] Read, J., & Carroll, J., "Weakly supervised techniques for domain-independent sentiment classification," in Proceeding of the 1st international CIKM workshop on topic-sentiment analysis for mass opinion, 2009, pp. 45–52. [89] María-Teresa Martín-Valdivia, Eugenio Martínez-Cámara, Jose-M. Perea-Ortega, L. Alfonso Ureña-López, "Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches," Expert Systems with Applications, 2013. [90] Guang Qiu, Xiaofei He, Feng Zhang, Yuan Shi, Jiajun Bu, Chun Chen, "DASA: Dissatisfaction-oriented Advertising based on Sentiment Analysis," Expert Systems with Applications, vol. 37, pp. 6182-6191, 2010. [91] Fahrni A, Klenner M, "Old wine or warm beer: target-specific sentiment analysis of adjectives," In: Proceedings of the symposium on affective language in human and machine, AISB, 2008, pp. 60–63. [92] Leung CWK, Chan SCF, Chung FL, "Integrating collaborative filtering and sentiment analysis: a rating inference approach," In: ECAI’06 workshop on recommender systems, 2006, pp. 62–66. [93] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman, "Indexing by Latent Semantic Analysis," JASIS, vol. 41, pp. 391–407, 1990. [94] Qing Cao, Wenjing Duan, Qiwei Gan, "Exploring determinants of voting for the “helpfulness” of online user reviews: A text mining approach," Decision Support Systems, vol. 50, pp. 511-521, 2011. [95] Mao-Yuan Pai, Hui-Chuan Chu, Su-Chen Wang, Yuh-Min Chen, "Electronic word of mouth analysis for service experience," Expert Systems with Applications, vol. 40, pp. 1993-2006, 2013. [96] Wenhao Zhang, Hua Xu, Wei Wan, "Weakness Finder: Find product weakness from Chinese reviews by using aspects based sentiment analysis," Expert Systems with Applications, vol. 39, pp. 10283-10291, 2012. [97] Igor A. Bolshakov and Alexander Gelbukh, Computational Linguistics (Models, Resources, Applications), 2004. [98] A. Moreo, M. Romero, J.L. Castro, J.M. Zurita, "Lexicon-based Commentsoriented News Sentiment Analyzer system," Expert Systems with Applications, vol. 39, pp. 9166-9180, 2012. [99] Hye-Jin Min, Jong C. Park, "Identifying helpful reviews based on customer’s mentions about experiences," Expert Systems with Applications, vol. 39, pp. 11830-11838, 2012. [100] Songbo Tan, Qiong Wu, "A random walk algorithm for automatic construction of domain-oriented sentiment lexicon," Expert Systems with Applications, pp. 12094-12100, 2011. [101] Livio Robaldo, Luigi Di Caro, "OpinionMining-ML," Computer Standards & Interfaces, 2012.

77

[102] Ester Boldrini, Alexandra Balahur, Patricio Martínez-Barco, Andrés Montoyo, "Using EmotiBlog to annotate and analyse subjectivity in the new textual genres," Data Mining Knowledge Discovery, vol. 25, pp. 603–634, 2012. [103] Josef Steinberger, Mohamed Ebrahim, Maud Ehrmann, Ali Hurriyetoglu, Mijail Kabadjov, Polina Lenkova, Ralf Steinberger, Hristo Tanev, Silvia Vázquez, Vanni Zavarella, "Creating sentiment dictionaries via triangulation," Decision Support Systems, vol. 53, pp. 689-694, 2012. [104] Albert Bifet and Eibe Frank, "Sentiment Knowledge Discovery in Twitter Streaming Data," 2010. [105] A. Go and R. Bhayani, "Exploiting the Unique Characteristics of Tweets for Sentiment Analysis," 2010. [106] M. Gamon, "Sentiment classification on customer feedback data: Noisy data, large feature vectors, and the role of linguistic analysis," in Proceedings of COLING'04, Geneva, Switzerland, 2004, pp. 841-847. [107] J. Perkins, Python Text Processing with NLTK 2.0 Cookbook. Birmingham: Packt Publishing Ltd., 2010. [108] Bo Pang and Lillian Lee, "Opinion mining and sentiment analysis," Foundations and Trends in Information Retrieval, vol. 2, pp. 1–135, 2008. [109] Marilyn A. Walker, Pranav Anand, Rob Abbott, Jean E. Fox Tree, Craig Martell, Joseph King, "That is your evidence?: Classifying stance in online political debate," Decision Support Systems, vol. 53, pp. 719-729, 2012. [110] A. Kennedy and D. Inkpen, "Sentiment Classification of Movie Reviews Using Contextual Valence Shifters," Computational Intelligence, 2006. [111] F.Benevenuto, G. Magno, T. Rodrigues, and V.Almeida, "Detecting Spammers on Twitter," presented at the CEAS 2010 Seventh annual Collaboration, Electronic messaging, AntiAbuse and Spam Conference, Redmond, Washington, US, 2010. [112] M. Wiegand, A. Balahur, B. Roth, D. Klakow, and A. Montoyo, "A Survey on the Role of Negation in Sentiment Analysis," in Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, Uppsala, 2010, pp. 60–68. [113] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment Analysis and Opinion Mining," in LREC, 2010. [114] E. Kouloumpis, T. Wilson and J. Moore, "Twitter sentiment analysis: The good the bad and the omg!," ICWSM, vol. 11, pp. 538-541, 2011. [115] R. Mehta, D. Mehta, D. Chheda, C. Shah and P. M. Chawan, "Sentiment analysis and influence tracking using twitter," International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE), vol. 1, pp. -72, 2012. [116] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, "Sentiment analysis of twitter data," in Proceedings of the Workshop on Languages in Social Media, 2011, pp. 30-38. [117] K. Versteegh, C. Versteegh, The Arabic Language, Columbia University Press, 1997. [118] M. Korayem, D. Crandall, and M. Abdul-Mageed, Subjectivity and Sentiment Analysis of Arabic: A Survey, AMLTA 2012, CCIS 322, pp. 128–139, 2012.

78

[119] Amira Shoukry and Ahmed Refea, "Sentence-level Arabic sentiment analysis," presented at the IEEE international conference on Collaboration Technologies and Systems (CTS), 2012. [120] Muhammad Abdul-Mageeda, Mona Diab, and Mohammed Korayem, "Subjectivity and Sentiment Analysis of Modern Standard Arabic," in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, Portland, Oregon, 2011, pp. 587–591. [121] Muhammad Abdul-Mageeda, Mona Diab, and Sandra Kübler, "SAMAR: Subjectivity and sentiment analysisfor Arabic social media," Computer Speech and Language, 2013. [122] R. Al-Shalabi, G. Kanaan, Jihad M. Jaam, A. Hasnah, and E. Hilat, Stop-word removal algorithm for Arabic language, in 1st International Conference on Information and Communication Technologies: From Theory to Applications, Damascus, 2004, pp. 545-550. [123] Ibrahim Abu El-Khair, "Effects of stop words elimination for Arabic information retrieval: a comparative study," International Journal of Computing & Information Sciences, vol. 4, pp. 119-133, 2006. [124] A. Alajmi, E. M. Saad, and R. R. Darwish, Toward an ARABIC Stop-Words List Generation, International Journal of Computer Applications, vol. 46, 2012. [125] B. Santorini, "Part-of-speech tagging guidelines for the Penn Treebank Project," ed. University of Pennsylvania, School of Engineering and Applied Science, Dept. of Computer and Information Science., 1990. [126] F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato, and V. S. Subrahmanian, "Sentiment analysis: Adjectives and adverbs are better than adjectives alone," in Proceedings of the International Conference on Weblogs and Social Media (ICWSM)'07, 2007. [127] T. Nasukawa and J. Yi, "Sentiment analysis: Capturing favorability using natural language processing," in Proceedings of the Conference on Knowledge Capture (K-CAP), 2003. [128] J. M. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin, "Learning subjective language," Computational Linguistics, vol. 30, pp. 277–308, 2004. [129] A. McCallum and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification," presented at the AAAI Workshop on Learning for Text Categorization, 1998. [130] T. Wilson, J. Wiebe, and P. Hoffmann, "Recognizing contextual polarity in phrase-level sentiment analysis," presented at the HLT-EMNLP-2005, 2005. [131] Kotsiantis, B. Sotiris, I. D. Zaharakis, and P. E. Pintelas, "Supervised machine learning: A review of classification techniques," pp. 3-24, 2007. [132] Chawla, V. Nitesh, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced data sets," CM Sigkdd Explorations Newsletter, vol. 6, pp. 1-6, 2004. [133] David A. Cieslak, and N. V. Chawla, "Learning decision trees for unbalanced data.," Machine Learning and Knowledge Discovery in Databases, pp. 241-256, 2008.

79

[134] Ferri, César, P. Flach, and J. Hernández-Orallo, "Learning decision trees using the area under the ROC curve," ICML, vol. 2, 2002. [135] N. V. Chawla, "C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure," in Proceedings of the ICML, 2003. [136] S. C. Wei Liu, David A. Cieslak and N. V. Chawla, "Robust Decision Tree Algorithm for Imbalanced Data Sets," SDM, 2010, pp. 766-777.

80

Appendix A: English Stopword List All Just Being Over Both Through Yourselves Its Before Herself Had Should To Only Under Ours Has Do Them His Very They Not During Now Him Nor Did This She Each Further Where Few Because Doing Some Are Our Ourselves Out What for

while does above between t be we who were here hers by on about of against s or own into yourself down your from her their there been whom too themselves was until more himself that but don with than those he me

myself these up will below can theirs my and then is am it an as itself at have in any if again no when same how other which you after most such why a off i yours so the having once

81

‫‪Appendix B: Corpus-based Arabic Stopword‬‬ ‫‪List‬‬ ‫لمصر‬ ‫فمصر‬ ‫انا‬ ‫أنا‬ ‫وانا‬ ‫وأنا‬ ‫فانا‬ ‫فأنا‬ ‫جامد‬ ‫ده‬ ‫دة‬ ‫دا‬ ‫وده‬ ‫ودة‬ ‫ودا‬ ‫فده‬ ‫فدة‬ ‫فدا‬ ‫بده‬ ‫بدة‬ ‫بدا‬ ‫لده‬ ‫لدة‬ ‫لدا‬ ‫بس‬ ‫وبس‬ ‫فبس‬ ‫احمد‬ ‫أحمد‬ ‫واحمد‬ ‫وأحمد‬

‫فى‬ ‫في‬ ‫وفى‬ ‫وفي‬ ‫و‬ ‫الفيلم‬ ‫فيلم‬ ‫بالفيلم‬ ‫بفيلم‬ ‫للفيلم‬ ‫لفيلم‬ ‫والفيلم‬ ‫وفيلم‬ ‫فالفيلم‬ ‫ففيلم‬ ‫من‬ ‫ومن‬ ‫فمن‬ ‫بمن‬ ‫لمن‬ ‫على‬ ‫علي‬ ‫وعلى‬ ‫وعلي‬ ‫فعلى‬ ‫فعلي‬ ‫ان‬ ‫أن‬ ‫إن‬ ‫وان‬ ‫وأن‬

‫وإن‬ ‫فان‬ ‫فأن‬ ‫فإن‬ ‫بان‬ ‫بأن‬ ‫بإن‬ ‫الن‬ ‫ألن‬ ‫إلن‬ ‫كان‬ ‫كأن‬ ‫وكان‬ ‫وكأن‬ ‫فكان‬ ‫فكأن‬ ‫لكان‬ ‫لكأن‬ ‫جدا‬ ‫جدا‬ ‫الفيل‬ ‫فيل‬ ‫والفيل‬ ‫وفيل‬ ‫بالفيل‬ ‫بفيل‬ ‫للفيل‬ ‫لفيل‬ ‫مصر‬ ‫ومصر‬ ‫بمصر‬ ‫‪83‬‬

‫والي‬ ‫وإلى‬ ‫وإلي‬ ‫فالى‬ ‫فالي‬ ‫فإلى‬ ‫فإلي‬ ‫انه‬ ‫أنه‬ ‫انة‬ ‫أنة‬ ‫وانه‬ ‫وأنه‬ ‫وانة‬ ‫وأنة‬ ‫فانه‬ ‫فأنه‬ ‫فانة‬ ‫فأنة‬ ‫بانه‬ ‫بأنه‬ ‫بانة‬ ‫بأنة‬ ‫النه‬ ‫ألنه‬ ‫النة‬ ‫ألنة‬ ‫تعجب‬ ‫وتعجب‬ ‫فتعجب‬ ‫حلو‬ ‫مشهد‬ ‫المشهد‬ ‫ومشهد‬ ‫والمشهد‬ ‫فمشهد‬

‫فاحمد‬ ‫فأحمد‬ ‫حلمى‬ ‫حلمي‬ ‫وحلمى‬ ‫وحلمي‬ ‫فحلمى‬ ‫فحلمي‬ ‫صنع‬ ‫اللى‬ ‫واللى‬ ‫فاللى‬ ‫باللى‬ ‫هو‬ ‫وهو‬ ‫فهو‬ ‫االزرق‬ ‫ازرق‬ ‫األزرق‬ ‫أزرق‬ ‫واالزرق‬ ‫وازرق‬ ‫واألزرق‬ ‫وأزرق‬ ‫فاالزرق‬ ‫فازرق‬ ‫فاألزرق‬ ‫فأزرق‬ ‫اوى‬ ‫اوي‬ ‫أوى‬ ‫أوي‬ ‫تحفة‬ ‫تحفه‬ ‫عن‬ ‫وعن‬

‫فعن‬ ‫ما‬ ‫وما‬ ‫فما‬ ‫بما‬ ‫الرواية‬ ‫رواية‬ ‫الروايه‬ ‫روايه‬ ‫والرواية‬ ‫ورواية‬ ‫والروايه‬ ‫وروايه‬ ‫فالرواية‬ ‫فرواية‬ ‫فالروايه‬ ‫فروايه‬ ‫ضحك‬ ‫الضحك‬ ‫مش‬ ‫ومش‬ ‫فمش‬ ‫كريم‬ ‫وكريم‬ ‫فكريم‬ ‫السينما‬ ‫سينما‬ ‫والسينما‬ ‫وسينما‬ ‫فالسينما‬ ‫فسينما‬ ‫الى‬ ‫الي‬ ‫إلى‬ ‫إلي‬ ‫والى‬ ‫‪84‬‬

‫فعالميه‬ ‫عبد‬ ‫وعبد‬ ‫فعبد‬ ‫العزيز‬ ‫عزيز‬ ‫كل‬ ‫وكل‬ ‫بكل‬ ‫فكل‬ ‫لكل‬ ‫هذا‬ ‫وهذا‬ ‫فهذا‬ ‫لهذا‬ ‫بهذا‬ ‫االفالم‬ ‫افالم‬ ‫األفالم‬ ‫أفالم‬ ‫واالفالم‬ ‫وافالم‬ ‫واألفالم‬ ‫وأفالم‬ ‫فاالفالم‬ ‫فافالم‬ ‫فاألفالم‬ ‫فأفالم‬ ‫لالفالم‬ ‫الفالم‬ ‫لألفالم‬ ‫ألفالم‬ ‫الصاوى‬ ‫الصاوي‬ ‫الحرب‬ ‫والحرب‬

‫وال‬ ‫المصرية‬ ‫مصرية‬ ‫المصريه‬ ‫مصريه‬ ‫والمصرية‬ ‫ومصرية‬ ‫والمصريه‬ ‫ومصريه‬ ‫فالمصرية‬ ‫فمصرية‬ ‫فالمصريه‬ ‫فمصريه‬ ‫التى‬ ‫التي‬ ‫والتى‬ ‫والتي‬ ‫فالتى‬ ‫فالتي‬ ‫بالتى‬ ‫بالتي‬ ‫فيه‬ ‫فية‬ ‫وفيه‬ ‫وفية‬ ‫العالمية‬ ‫عالمية‬ ‫العالميه‬ ‫عالميه‬ ‫والعالمية‬ ‫وعالمية‬ ‫والعالميه‬ ‫وعالميه‬ ‫فالعالمية‬ ‫فعالمية‬ ‫فالعالميه‬ ‫‪85‬‬

‫فالمشهد‬ ‫عالمة‬ ‫العالمة‬ ‫عالمه‬ ‫العالمه‬ ‫وعالمة‬ ‫والعالمة‬ ‫وعالمه‬ ‫والعالمه‬ ‫فعالمة‬ ‫فالعالمة‬ ‫فعالمه‬ ‫فالعالمه‬ ‫كده‬ ‫كدة‬ ‫كدا‬ ‫وكده‬ ‫وكدة‬ ‫وكدا‬ ‫فكده‬ ‫فكدة‬ ‫فكدا‬ ‫بكده‬ ‫بكدة‬ ‫بكدا‬ ‫بجد‬ ‫وجد‬ ‫فجد‬ ‫لم‬ ‫فلم‬ ‫ولم‬ ‫مع‬ ‫ومع‬ ‫فمع‬ ‫ال‬ ‫فال‬

‫وقصه‬ ‫فالقصة‬ ‫فقصة‬ ‫فالقصه‬ ‫فقصه‬ ‫بعد‬ ‫وبعد‬ ‫فبعد‬ ‫او‬ ‫أو‬ ‫المشاهد‬ ‫مشاهد‬ ‫والمشاهد‬ ‫ومشاهد‬ ‫فالمشاهد‬ ‫فمشاهد‬ ‫الثالثة‬ ‫الثالثه‬ ‫ثالثة‬ ‫ثالثه‬ ‫حتى‬ ‫حتي‬ ‫وحتى‬ ‫وحتي‬ ‫فحتى‬ ‫فحتي‬ ‫بشكل‬ ‫وبشكل‬ ‫فبشكل‬ ‫حامد‬ ‫زى‬ ‫زي‬ ‫وزى‬ ‫وزي‬ ‫وزى‬ ‫وزي‬

‫الكتر‬ ‫ألكتر‬ ‫ولكن‬ ‫لكن‬ ‫فلكن‬ ‫كبيرة‬ ‫كبيره‬ ‫ابتسامة‬ ‫ابتسامه‬ ‫إبتسامة‬ ‫إبتسامه‬ ‫أفضل‬ ‫افضل‬ ‫رائع‬ ‫الرائع‬ ‫غير‬ ‫وغير‬ ‫فغير‬ ‫بغير‬ ‫لغير‬ ‫الذى‬ ‫الذي‬ ‫والذى‬ ‫والذي‬ ‫بالذى‬ ‫بالذي‬ ‫فالذى‬ ‫فالذي‬ ‫الفيل_االزرق‬ ‫القصة‬ ‫قصة‬ ‫القصه‬ ‫قصه‬ ‫والقصة‬ ‫وقصة‬ ‫والقصه‬ ‫‪86‬‬

‫فالحرب‬ ‫جميل‬ ‫كانت‬ ‫وكانت‬ ‫فكانت‬ ‫لكانت‬ ‫حاجة‬ ‫حاجه‬ ‫وحاجة‬ ‫وحاجه‬ ‫فحاجة‬ ‫فحاجه‬ ‫بحاجة‬ ‫بحاجه‬ ‫لحاجة‬ ‫لحاجه‬ ‫احلى‬ ‫احلي‬ ‫أحلى‬ ‫أحلي‬ ‫خالد‬ ‫وخالد‬ ‫فخالد‬ ‫كنت‬ ‫وكنت‬ ‫فكنت‬ ‫لكنت‬ ‫يا‬ ‫اكتر‬ ‫أكتر‬ ‫واكتر‬ ‫وأكتر‬ ‫فاكتر‬ ‫فأكتر‬ ‫باكتر‬ ‫بأكتر‬

‫بكله‬ ‫بكلة‬ ‫لكله‬ ‫لكلة‬ ‫بين‬ ‫وبين‬ ‫فبين‬ ‫كتير‬ ‫وكتير‬ ‫بكتير‬ ‫النهاية‬ ‫النهايه‬ ‫نهاية‬ ‫نهايه‬ ‫والنهاية‬ ‫والنهايه‬ ‫ونهاية‬ ‫ونهايه‬ ‫فالنهاية‬ ‫فالنهايه‬ ‫فنهاية‬ ‫فنهايه‬ ‫اما‬ ‫أما‬ ‫واما‬ ‫وأما‬ ‫فاما‬ ‫فأما‬ ‫بعض‬ ‫وبعض‬ ‫فبعض‬ ‫لبعض‬ ‫ببعض‬ ‫يكون‬ ‫ويكون‬ ‫فيكون‬

‫مسخرة‬ ‫مسخره‬ ‫دى‬ ‫دي‬ ‫ودى‬ ‫ودي‬ ‫فدى‬ ‫فدي‬ ‫ديه‬ ‫وديه‬ ‫فديه‬ ‫شوفته‬ ‫شوفتة‬ ‫وشوفته‬ ‫وشوفتة‬ ‫فشوفته‬ ‫فشوفتة‬ ‫مروان‬ ‫ومروان‬ ‫فمروان‬ ‫يحيى‬ ‫يحيي‬ ‫ويحيى‬ ‫ويحيي‬ ‫فيحيى‬ ‫فيحيي‬ ‫احسن‬ ‫أحسن‬ ‫لما‬ ‫ولما‬ ‫فلما‬ ‫اى‬ ‫أى‬ ‫اي‬ ‫أي‬ ‫فاى‬

‫فأى‬ ‫فاي‬ ‫فأي‬ ‫باى‬ ‫بأى‬ ‫باي‬ ‫بأي‬ ‫واى‬ ‫وأى‬ ‫واي‬ ‫وأي‬ ‫الى‬ ‫ألى‬ ‫الي‬ ‫ألي‬ ‫مراد‬ ‫ومراد‬ ‫فمراد‬ ‫الدور‬ ‫دور‬ ‫والدور‬ ‫ودور‬ ‫فالدور‬ ‫فدور‬ ‫عشان‬ ‫علشان‬ ‫وعشان‬ ‫وعلشان‬ ‫فعشان‬ ‫فعلشان‬ ‫كله‬ ‫كلة‬ ‫فكله‬ ‫فكلة‬ ‫وكله‬ ‫وكلة‬ ‫‪87‬‬

‫فآخر‬ ‫باخر‬ ‫بأخر‬ ‫بآخر‬ ‫الخر‬ ‫ألخر‬ ‫آلخر‬ ‫االخر‬ ‫األخر‬ ‫اآلخر‬ ‫الممثلين‬ ‫ممثلين‬ ‫والممثلين‬ ‫وممثلين‬ ‫فالممثلين‬ ‫فممثلين‬ ‫الموسيقى‬ ‫الموسيقي‬ ‫موسيقى‬ ‫موسيقي‬ ‫والموسيقى‬ ‫والموسيقي‬ ‫وموسيقى‬ ‫وموسيقي‬ ‫بالموسيقى‬ ‫بالموسيقي‬ ‫بموسيقى‬ ‫بموسيقي‬ ‫تماما‬ ‫تماما‬ ‫سيتى‬ ‫شخصية‬ ‫الشخصية‬ ‫شخصيه‬ ‫الشخصيه‬ ‫وشخصية‬

‫ليكون‬ ‫بيكون‬ ‫عمل‬ ‫العمل‬ ‫وعمل‬ ‫والعمل‬ ‫فعمل‬ ‫فالعمل‬ ‫بعمل‬ ‫بالعمل‬ ‫لعمل‬ ‫اللعمل‬ ‫كبير‬ ‫ممكن‬ ‫وممكن‬ ‫فممكن‬ ‫نيللى‬ ‫ونيللى‬ ‫فنيللى‬ ‫اداء‬ ‫أداء‬ ‫واداء‬ ‫وأداء‬ ‫فاداء‬ ‫فأداء‬ ‫باداء‬ ‫بأداء‬ ‫الداء‬ ‫ألداء‬ ‫اشاهد‬ ‫أشاهد‬ ‫واشاهد‬ ‫وأشاهد‬ ‫الشاهد‬ ‫ألشاهد‬ ‫اول‬

‫أول‬ ‫واول‬ ‫وأول‬ ‫فاول‬ ‫فأول‬ ‫الول‬ ‫ألول‬ ‫باول‬ ‫بأول‬ ‫روعة‬ ‫روعه‬ ‫سالمة‬ ‫سالمه‬ ‫وسالمة‬ ‫وسالمه‬ ‫فسالمة‬ ‫فسالمه‬ ‫بسالمة‬ ‫بسالمه‬ ‫شوفت‬ ‫وشوفت‬ ‫فشوفت‬ ‫هى‬ ‫هي‬ ‫وهى‬ ‫وهي‬ ‫فهى‬ ‫فهي‬ ‫اخر‬ ‫أخر‬ ‫آخر‬ ‫واخر‬ ‫وأخر‬ ‫وآخر‬ ‫فاخر‬ ‫فأخر‬ ‫‪88‬‬

‫ألن‬ ‫والن‬ ‫وألن‬ ‫فالن‬ ‫فألن‬ ‫مرة‬ ‫مره‬ ‫ومرة‬ ‫ومره‬ ‫فمرة‬ ‫فمره‬ ‫مصرى‬ ‫مصري‬ ‫ومصرى‬ ‫ومصري‬ ‫فمصرى‬ ‫فمصري‬ ‫مول‬ ‫ومول‬ ‫فمول‬ ‫اال‬ ‫إال‬ ‫فاال‬ ‫فإال‬ ‫واال‬ ‫وإال‬ ‫االحداث‬ ‫األحداث‬ ‫واالحداث‬ ‫واألحداث‬ ‫فاالحداث‬ ‫فاألحداث‬ ‫الحرب_العالمية_الثال‬ ‫ثة‬ ‫العيد‬ ‫عيد‬

‫والشخصية‬ ‫وشخصيه‬ ‫والشخصيه‬ ‫فشخصية‬ ‫فالشخصية‬ ‫فشخصيه‬ ‫فالشخصيه‬ ‫عالمى‬ ‫عالمي‬ ‫العالمى‬ ‫العالمي‬ ‫وعالمى‬ ‫وعالمي‬ ‫والعالمى‬ ‫والعالمي‬ ‫عليه‬ ‫علية‬ ‫وعليه‬ ‫وعلية‬ ‫فعليه‬ ‫فعلية‬ ‫الزم‬ ‫والزم‬ ‫فالزم‬ ‫لو‬ ‫ولو‬ ‫فلو‬ ‫واحد‬ ‫وواحد‬ ‫فواحد‬ ‫بواحد‬ ‫لواحد‬ ‫المخرج‬ ‫مخرج‬ ‫والمخرج‬ ‫ومخرج‬

‫فالمخرج‬ ‫فمخرج‬ ‫انى‬ ‫اني‬ ‫وانى‬ ‫واني‬ ‫فانى‬ ‫فاني‬ ‫بانى‬ ‫باني‬ ‫النى‬ ‫الني‬ ‫ألنى‬ ‫ألني‬ ‫به‬ ‫بة‬ ‫وبه‬ ‫وبة‬ ‫فبه‬ ‫فبة‬ ‫جيد‬ ‫دخولى‬ ‫ودخولى‬ ‫فدخولى‬ ‫بدخولى‬ ‫لدخولى‬ ‫شريف‬ ‫وشريف‬ ‫فشريف‬ ‫فعال‬ ‫فعال‬ ‫وفعال‬ ‫وفعال‬ ‫كوميدى‬ ‫كوميدي‬ ‫الن‬ ‫‪89‬‬

‫لفقط‬ ‫قد‬ ‫وقد‬ ‫فقد‬ ‫بقد‬ ‫لقد‬ ‫اطفال‬ ‫االطفال‬ ‫أطفال‬ ‫األطفال‬ ‫لالطفال‬ ‫لألطفال‬ ‫واطفال‬ ‫واالطفال‬ ‫وأطفال‬ ‫واألطفال‬ ‫فاطفال‬ ‫فاالطفال‬ ‫فأطفال‬ ‫فاألطفال‬ ‫مبسوط‬ ‫مفيش‬ ‫ومفيش‬ ‫فمفيش‬ ‫ميرى‬ ‫ميري‬ ‫اخراج‬ ‫االخراج‬ ‫إخراج‬ ‫اإلخراج‬ ‫واخراج‬ ‫واالخراج‬ ‫وإخراج‬ ‫واإلخراج‬ ‫فاخراج‬ ‫فاالخراج‬

‫فمثل‬ ‫بمثل‬ ‫لمثل‬ ‫اكثر‬ ‫أكثر‬ ‫واكثر‬ ‫وأكثر‬ ‫فاكثر‬ ‫فأكثر‬ ‫باكثر‬ ‫بأكثر‬ ‫الكثر‬ ‫ألكثر‬ ‫الناس‬ ‫ناس‬ ‫والناس‬ ‫وناس‬ ‫فالناس‬ ‫فناس‬ ‫جوازة‬ ‫وجوازة‬ ‫فجوازة‬ ‫حد‬ ‫وحد‬ ‫فحد‬ ‫بحد‬ ‫لحد‬ ‫حيث‬ ‫وحيث‬ ‫فحيث‬ ‫بحيث‬ ‫لحيث‬ ‫صنع_فى_مصر‬ ‫فقط‬ ‫وفقط‬ ‫بفقط‬ ‫‪90‬‬

‫والعيد‬ ‫وعيد‬ ‫فالعيد‬ ‫فعيد‬ ‫صراحة‬ ‫الصراحة‬ ‫صراحه‬ ‫الصراحه‬ ‫بصراحة‬ ‫بالصراحة‬ ‫بصراحه‬ ‫بالصراحه‬ ‫وصراحة‬ ‫والصراحة‬ ‫وصراحه‬ ‫والصراحه‬ ‫دخولها‬ ‫ودخولها‬ ‫فدخولها‬ ‫بدخولها‬ ‫لدخولها‬ ‫رغم‬ ‫ورغم‬ ‫فرغم‬ ‫برغم‬ ‫طبعا‬ ‫طبعا‬ ‫وطبعا‬ ‫وطبعا‬ ‫فطبعا‬ ‫فطبعا‬ ‫قبل‬ ‫وقبل‬ ‫فقبل‬ ‫مثل‬ ‫ومثل‬

‫تالتة‬ ‫رائعة‬ ‫الرائعة‬ ‫رضا‬ ‫ورضا‬ ‫فرضا‬ ‫شكرا‬ ‫شكرا‬ ‫وشكرا‬ ‫وشكرا‬ ‫شوية‬ ‫شويه‬ ‫وشوية‬ ‫وشويه‬ ‫بشوية‬ ‫بشويه‬ ‫فشوية‬ ‫فشويه‬ ‫فكرة‬ ‫فكره‬ ‫وفكرة‬ ‫وفكره‬ ‫بفكرة‬ ‫بفكره‬ ‫فيها‬ ‫وفيها‬ ‫قوى‬ ‫قوي‬ ‫ليس‬ ‫وليس‬ ‫فليس‬ ‫مبروك‬ ‫ومبروك‬ ‫مستوى‬ ‫مستوي‬ ‫ومستوى‬

‫حلوة‬ ‫حلوه‬ ‫دخلته‬ ‫دخلتة‬ ‫ودخلته‬ ‫ودخلتة‬ ‫فدخلته‬ ‫فدخلتة‬ ‫ذلك‬ ‫وذلك‬ ‫فذلك‬ ‫بذلك‬ ‫لذلك‬ ‫عمرو‬ ‫وعمرو‬ ‫فعمرو‬ ‫عندما‬ ‫وعندما‬ ‫فعندما‬ ‫كما‬ ‫وكما‬ ‫فكما‬ ‫بكما‬ ‫لكما‬ ‫له‬ ‫لة‬ ‫وله‬ ‫ولة‬ ‫فله‬ ‫فلة‬ ‫محمد‬ ‫ومحمد‬ ‫فمحمد‬ ‫استفهام‬ ‫االستفهام‬ ‫التالتة‬ ‫‪91‬‬

‫فإخراج‬ ‫فاإلخراج‬ ‫التصوير‬ ‫تصوير‬ ‫والتصوير‬ ‫وتصوير‬ ‫فالتصوير‬ ‫فتصوير‬ ‫العفريت‬ ‫عفريت‬ ‫والعفريت‬ ‫وعفريت‬ ‫انت‬ ‫أنت‬ ‫وانت‬ ‫وأنت‬ ‫فانت‬ ‫فأنت‬ ‫انها‬ ‫أنها‬ ‫وانها‬ ‫وأنها‬ ‫فانها‬ ‫فأنها‬ ‫بانها‬ ‫بأنها‬ ‫النها‬ ‫ألنها‬ ‫تانى‬ ‫تاني‬ ‫وتانى‬ ‫وتاني‬ ‫لتانى‬ ‫لتاني‬ ‫فتانى‬ ‫فتاني‬

‫فممدوح‬ ‫هما‬ ‫وهما‬ ‫فهما‬ ‫بهما‬ ‫لهما‬ ‫ياسمين‬ ‫وياسمين‬ ‫فياسمين‬ ‫يبقى‬ ‫يبقي‬ ‫ويبقى‬ ‫ويبقي‬ ‫فيبقى‬ ‫فيبقي‬ ‫بتاع‬ ‫بتاعه‬ ‫بتوع‬ ‫بتاعة‬ ‫بتاعنا‬ ‫بتاعها‬ ‫بتاعهم‬ ‫بتاعى‬ ‫بتاعي‬ ‫وبتاع‬ ‫وبتاعه‬ ‫وبتوع‬ ‫وبتاعة‬ ‫وبتاعنا‬ ‫وبتاعها‬ ‫وبتاعهم‬ ‫وبتاعى‬ ‫وبتاعي‬

‫ومستوي‬ ‫فمستوى‬ ‫فمستوي‬ ‫يعنى‬ ‫يعني‬ ‫ويعنى‬ ‫ويعني‬ ‫فيعنى‬ ‫فيعني‬ ‫يوم‬ ‫ويوم‬ ‫بيوم‬ ‫فيوم‬ ‫اعلى‬ ‫أعلى‬ ‫اعلي‬ ‫أعلي‬ ‫واعلى‬ ‫وأعلى‬ ‫واعلي‬ ‫وأعلي‬ ‫فاعلى‬ ‫فأعلى‬ ‫فاعلي‬ ‫فأعلي‬ ‫باعلى‬ ‫بأعلى‬ ‫باعلي‬ ‫بأعلي‬ ‫العلى‬ ‫ألعلى‬ ‫العلي‬ ‫ألعلي‬ ‫انك‬ ‫أنك‬ ‫وانك‬

‫وأنك‬ ‫بانك‬ ‫بأنك‬ ‫فانك‬ ‫فأنك‬ ‫النك‬ ‫ألنك‬ ‫بداية‬ ‫البداية‬ ‫وبداية‬ ‫والبداية‬ ‫فبداية‬ ‫فالبداية‬ ‫بسبب‬ ‫وبسبب‬ ‫فبسبب‬ ‫جديد‬ ‫الجديد‬ ‫جديدة‬ ‫الجديدة‬ ‫دوره‬ ‫دورة‬ ‫فهمى‬ ‫وفهمى‬ ‫ففهمى‬ ‫اسف‬ ‫االسف‬ ‫لالسف‬ ‫ممثل‬ ‫الممثل‬ ‫وممثل‬ ‫والممثل‬ ‫فممثل‬ ‫فالممثل‬ ‫ممدوح‬ ‫وممدوح‬ ‫‪92‬‬

‫‪Dialect‬‬

‫‪Egyptian‬‬

‫‪General‬‬

‫‪C:‬‬

‫‪Appendix‬‬

‫‪Stopword List‬‬ ‫لدة‬ ‫لدا‬ ‫بس‬ ‫وبس‬ ‫فبس‬ ‫اللى‬ ‫واللى‬ ‫فاللى‬ ‫باللى‬ ‫هو‬ ‫وهو‬ ‫فهو‬ ‫اوى‬ ‫اوي‬ ‫أوى‬ ‫أوي‬ ‫عن‬ ‫وعن‬ ‫فعن‬ ‫ما‬ ‫وما‬ ‫فما‬ ‫بما‬ ‫مش‬ ‫ومش‬ ‫فمش‬ ‫الى‬ ‫الي‬ ‫إلى‬ ‫إلي‬

‫إلن‬ ‫كان‬ ‫كأن‬ ‫وكان‬ ‫وكأن‬ ‫فكان‬ ‫فكأن‬ ‫لكان‬ ‫لكأن‬ ‫جدا‬ ‫جدا‬ ‫انا‬ ‫أنا‬ ‫وانا‬ ‫وأنا‬ ‫فانا‬ ‫فأنا‬ ‫ده‬ ‫دة‬ ‫دا‬ ‫وده‬ ‫ودة‬ ‫ودا‬ ‫فده‬ ‫فدة‬ ‫فدا‬ ‫بده‬ ‫بدة‬ ‫بدا‬ ‫لده‬ ‫‪93‬‬

‫فى‬ ‫في‬ ‫وفى‬ ‫وفي‬ ‫و‬ ‫من‬ ‫ومن‬ ‫فمن‬ ‫بمن‬ ‫لمن‬ ‫على‬ ‫علي‬ ‫وعلى‬ ‫وعلي‬ ‫فعلى‬ ‫فعلي‬ ‫ان‬ ‫أن‬ ‫إن‬ ‫وان‬ ‫وأن‬ ‫وإن‬ ‫فان‬ ‫فأن‬ ‫فإن‬ ‫بان‬ ‫بأن‬ ‫بإن‬ ‫الن‬ ‫ألن‬

‫وكانت‬ ‫فكانت‬ ‫لكانت‬ ‫كنت‬ ‫وكنت‬ ‫فكنت‬ ‫لكنت‬ ‫يا‬ ‫اكتر‬ ‫أكتر‬ ‫واكتر‬ ‫وأكتر‬ ‫فاكتر‬ ‫فأكتر‬ ‫باكتر‬ ‫بأكتر‬ ‫الكتر‬ ‫ألكتر‬ ‫لكن‬ ‫ولكن‬ ‫فلكن‬ ‫غير‬ ‫وغير‬ ‫فغير‬ ‫بغير‬ ‫لغير‬ ‫الذى‬ ‫الذي‬ ‫والذى‬ ‫والذي‬ ‫بالذى‬ ‫بالذي‬ ‫فالذى‬ ‫فالذي‬ ‫بعد‬ ‫وبعد‬

‫فكدا‬ ‫بكده‬ ‫بكدة‬ ‫بكدا‬ ‫لم‬ ‫فلم‬ ‫ولم‬ ‫مع‬ ‫ومع‬ ‫فمع‬ ‫ال‬ ‫فال‬ ‫وال‬ ‫التى‬ ‫التي‬ ‫والتى‬ ‫والتي‬ ‫فالتى‬ ‫فالتي‬ ‫بالتى‬ ‫بالتي‬ ‫فيه‬ ‫فية‬ ‫وفيه‬ ‫وفية‬ ‫كل‬ ‫وكل‬ ‫بكل‬ ‫فكل‬ ‫لكل‬ ‫هذا‬ ‫وهذا‬ ‫فهذا‬ ‫لهذا‬ ‫بهذا‬ ‫كانت‬ ‫‪94‬‬

‫والى‬ ‫والي‬ ‫وإلى‬ ‫وإلي‬ ‫فالى‬ ‫فالي‬ ‫فإلى‬ ‫فإلي‬ ‫انه‬ ‫أنه‬ ‫انة‬ ‫أنة‬ ‫وانه‬ ‫وأنه‬ ‫وانة‬ ‫وأنة‬ ‫فانه‬ ‫فأنه‬ ‫فانة‬ ‫فأنة‬ ‫بانه‬ ‫بأنه‬ ‫بانة‬ ‫بأنة‬ ‫النه‬ ‫ألنه‬ ‫النة‬ ‫ألنة‬ ‫كده‬ ‫كدة‬ ‫كدا‬ ‫وكده‬ ‫وكدة‬ ‫وكدا‬ ‫فكده‬ ‫فكدة‬

‫فبعض‬ ‫لبعض‬ ‫ببعض‬ ‫يكون‬ ‫ويكون‬ ‫فيكون‬ ‫ليكون‬ ‫بيكون‬ ‫ممكن‬ ‫وممكن‬ ‫فممكن‬ ‫اول‬ ‫أول‬ ‫واول‬ ‫وأول‬ ‫فاول‬ ‫فأول‬ ‫الول‬ ‫ألول‬ ‫باول‬ ‫بأول‬ ‫هى‬ ‫هي‬ ‫وهى‬ ‫وهي‬ ‫فهى‬ ‫فهي‬ ‫عليه‬ ‫علية‬ ‫وعليه‬ ‫وعلية‬ ‫فعليه‬ ‫فعلية‬ ‫الزم‬ ‫والزم‬ ‫فالزم‬

‫بأي‬ ‫واى‬ ‫وأى‬ ‫واي‬ ‫وأي‬ ‫الى‬ ‫ألى‬ ‫الي‬ ‫ألي‬ ‫عشان‬ ‫علشان‬ ‫وعشان‬ ‫وعلشان‬ ‫فعشان‬ ‫فعلشان‬ ‫كله‬ ‫كلة‬ ‫فكله‬ ‫فكلة‬ ‫وكله‬ ‫وكلة‬ ‫بكله‬ ‫بكلة‬ ‫لكله‬ ‫لكلة‬ ‫بين‬ ‫وبين‬ ‫فبين‬ ‫اما‬ ‫أما‬ ‫واما‬ ‫وأما‬ ‫فاما‬ ‫فأما‬ ‫بعض‬ ‫وبعض‬ ‫‪95‬‬

‫فبعد‬ ‫حتى‬ ‫حتي‬ ‫وحتى‬ ‫وحتي‬ ‫فحتى‬ ‫فحتي‬ ‫زى‬ ‫زي‬ ‫وزى‬ ‫وزي‬ ‫وزى‬ ‫وزي‬ ‫دى‬ ‫دي‬ ‫ودى‬ ‫ودي‬ ‫فدى‬ ‫فدي‬ ‫ديه‬ ‫وديه‬ ‫فديه‬ ‫لما‬ ‫ولما‬ ‫فلما‬ ‫اى‬ ‫أى‬ ‫اي‬ ‫أي‬ ‫فاى‬ ‫فأى‬ ‫فاي‬ ‫فأي‬ ‫باى‬ ‫بأى‬ ‫باي‬

‫بحيث‬ ‫لحيث‬ ‫فقط‬ ‫وفقط‬ ‫بفقط‬ ‫لفقط‬ ‫قد‬ ‫وقد‬ ‫فقد‬ ‫بقد‬ ‫لقد‬ ‫انت‬ ‫أنت‬ ‫وانت‬ ‫وأنت‬ ‫فانت‬ ‫فأنت‬ ‫انها‬ ‫أنها‬ ‫وانها‬ ‫وأنها‬ ‫فانها‬ ‫فأنها‬ ‫بانها‬ ‫بأنها‬ ‫النها‬ ‫ألنها‬ ‫تانى‬ ‫تاني‬ ‫وتانى‬ ‫وتاني‬ ‫لتانى‬ ‫لتاني‬ ‫فتانى‬ ‫فتاني‬ ‫ذلك‬

‫وألن‬ ‫فالن‬ ‫فألن‬ ‫مرة‬ ‫مره‬ ‫ومرة‬ ‫ومره‬ ‫فمرة‬ ‫فمره‬ ‫اال‬ ‫إال‬ ‫فاال‬ ‫فإال‬ ‫واال‬ ‫وإال‬ ‫قبل‬ ‫وقبل‬ ‫فقبل‬ ‫مثل‬ ‫ومثل‬ ‫فمثل‬ ‫بمثل‬ ‫لمثل‬ ‫اكثر‬ ‫أكثر‬ ‫واكثر‬ ‫وأكثر‬ ‫فاكثر‬ ‫فأكثر‬ ‫باكثر‬ ‫بأكثر‬ ‫الكثر‬ ‫ألكثر‬ ‫حيث‬ ‫وحيث‬ ‫فحيث‬ ‫‪96‬‬

‫لو‬ ‫ولو‬ ‫فلو‬ ‫واحد‬ ‫وواحد‬ ‫فواحد‬ ‫بواحد‬ ‫لواحد‬ ‫انى‬ ‫اني‬ ‫أنى‬ ‫أني‬ ‫وانى‬ ‫واني‬ ‫وأنى‬ ‫وأني‬ ‫فانى‬ ‫فاني‬ ‫فأنى‬ ‫فأني‬ ‫النى‬ ‫الني‬ ‫ألنى‬ ‫ألني‬ ‫به‬ ‫بة‬ ‫وبه‬ ‫وبة‬ ‫فبه‬ ‫فبة‬ ‫بيه‬ ‫وبيه‬ ‫فبيه‬ ‫الن‬ ‫ألن‬ ‫والن‬

‫وحد‬ ‫فحد‬ ‫بحد‬ ‫لحد‬ ‫ايه‬ ‫وايه‬ ‫فايه‬ ‫بايه‬ ‫لسه‬ ‫ولسه‬ ‫فلسه‬ ‫بتاع‬ ‫بتاعه‬ ‫بتوع‬ ‫بتاعة‬ ‫بتاعنا‬ ‫بتاعها‬ ‫بتاعهم‬ ‫بتاعى‬ ‫بتاعي‬ ‫وبتاع‬ ‫وبتاعه‬ ‫وبتوع‬ ‫وبتاعة‬ ‫وبتاعنا‬ ‫وبتاعها‬ ‫وبتاعهم‬ ‫وبتاعى‬ ‫وبتاعي‬ ‫خالص‬ ‫ليه‬ ‫وليه‬ ‫فليه‬ ‫ازاى‬ ‫إزاى‬ ‫وازاى‬

‫وأعلى‬ ‫واعلي‬ ‫وأعلي‬ ‫فاعلى‬ ‫فأعلى‬ ‫فاعلي‬ ‫فأعلي‬ ‫باعلى‬ ‫بأعلى‬ ‫باعلي‬ ‫بأعلي‬ ‫العلى‬ ‫ألعلى‬ ‫العلي‬ ‫ألعلي‬ ‫انك‬ ‫أنك‬ ‫وانك‬ ‫وأنك‬ ‫بانك‬ ‫بأنك‬ ‫فانك‬ ‫فأنك‬ ‫النك‬ ‫ألنك‬ ‫هما‬ ‫وهما‬ ‫فهما‬ ‫بهما‬ ‫لهما‬ ‫بشكل‬ ‫وبشكل‬ ‫فبشكل‬ ‫او‬ ‫أو‬ ‫حد‬ ‫‪97‬‬

‫وذلك‬ ‫فذلك‬ ‫بذلك‬ ‫لذلك‬ ‫عندما‬ ‫وعندما‬ ‫فعندما‬ ‫كما‬ ‫وكما‬ ‫فكما‬ ‫بكما‬ ‫لكما‬ ‫له‬ ‫لة‬ ‫وله‬ ‫ولة‬ ‫فله‬ ‫فلة‬ ‫شوية‬ ‫شويه‬ ‫وشوية‬ ‫وشويه‬ ‫بشوية‬ ‫بشويه‬ ‫فشوية‬ ‫فشويه‬ ‫فيها‬ ‫وفيها‬ ‫ليس‬ ‫وليس‬ ‫فليس‬ ‫اعلى‬ ‫أعلى‬ ‫اعلي‬ ‫أعلي‬ ‫واعلى‬

‫وحضرتها‬ ‫وحضرتهم‬ ‫فحضرتك‬ ‫فحضرته‬ ‫فحضرتها‬ ‫فحضرتهم‬ ‫لحضرتك‬ ‫لحضرته‬ ‫لحضرتها‬ ‫لحضرتهم‬ ‫ماكنش‬ ‫مكنش‬ ‫وماكنش‬ ‫ومكنش‬ ‫فماكنش‬ ‫فمكنش‬ ‫اومال‬ ‫زيه‬ ‫زيها‬ ‫زيك‬ ‫زيهم‬ ‫وزيه‬ ‫وزيها‬ ‫وزيك‬ ‫وزيهم‬ ‫كام‬ ‫وكام‬ ‫بكام‬ ‫لكام‬ ‫يال‬ ‫ويال‬ ‫فيال‬

‫ومالوش‬ ‫ومالهوش‬ ‫فملوش‬ ‫فملهوش‬ ‫فمالوش‬ ‫فمالهوش‬ ‫دول‬ ‫ودول‬ ‫فدول‬ ‫بدول‬ ‫لدول‬ ‫مفهوش‬ ‫مفيهوش‬ ‫مافهوش‬ ‫مافهوش‬ ‫ومفهوش‬ ‫ومفيهوش‬ ‫ومافهوش‬ ‫ومافهوش‬ ‫فمفهوش‬ ‫فمفيهوش‬ ‫فمافهوش‬ ‫فمافهوش‬ ‫بعدين‬ ‫وبعدين‬ ‫فبعدين‬ ‫لبعدين‬ ‫حضرتك‬ ‫حضرته‬ ‫حضرتها‬ ‫حضرتهم‬ ‫وحضرتك‬ ‫وحضرته‬

‫‪98‬‬

‫وإزاى‬ ‫فازاى‬ ‫فإزاى‬ ‫برضو‬ ‫برضه‬ ‫بردو‬ ‫برده‬ ‫وبرضو‬ ‫وبرضه‬ ‫وبردو‬ ‫وبرده‬ ‫فبرضو‬ ‫فبرضه‬ ‫فبردو‬ ‫فبرده‬ ‫اه‬ ‫واه‬ ‫فاه‬ ‫أه‬ ‫وأه‬ ‫فأه‬ ‫محدش‬ ‫ومحدش‬ ‫فمحدش‬ ‫كمان‬ ‫وكمان‬ ‫فكمان‬ ‫ملوش‬ ‫ملهوش‬ ‫مالوش‬ ‫مالهوش‬ ‫وملوش‬ ‫وملهوش‬

‫جامعة عين شمس ‪ -‬كلية الهندسة‬ ‫قسم هندسة الحاسبات والنظم‬

‫التنقيب فى البيانات النصية من المواقع اإلجتماعية بإستخدام‬ ‫تقنيات معالجة اللغات الطبيعية‬ ‫الرسالة‬ ‫تم تحريرها للحصول على درجة الدكتوراة فى الهندسة الكهربية‬ ‫(قسم الحاسبات والنظم)‬ ‫المهندسه‬ ‫والء محمد مدحت عسل‬ ‫تحت اشراف‬ ‫األستاذ الدكتور ‪ /‬هدى قرشي محمد‬ ‫قسم هندسة الحاسبات والنظم‪ ،‬كلية الهندسة‪ ،‬جامعة عين شمس‬ ‫الدكتور ‪ /‬احمد حسن يوسف‬ ‫قسم هندسة الحاسبات والنظم‪ ،‬كلية الهندسة‪ ،‬جامعة عين شمس‬ ‫القاهرة ‪4106‬‬

‫‪99‬‬

‫ملخص الرسالة‬ ‫التنقيب فى البيانات النصية من المواقع اإلجتماعية بإستخدام تقنيات معالجة اللغات الطبيعية‬ ‫والء محمد مدحت عسل‬ ‫تم تحريرها للحصول على درجة الدكتوراة فى الهندسة الكهربية (قسم الحاسبات والنظم)‬ ‫كلية الهندسة ‪ -‬جامعة عين شمس – ‪1024‬‬ ‫أصبحت مواقع الشبكة العنكبوتية من اهم مصادر المعلومات مؤخرا النها اصبحت منصة للكتابة و‬ ‫القراءة‪ .‬الزيادة الهائلة فى مواقع شبكات التواصل االجتماعى‪ ،‬مواقع مشاركة الفيديو‪ ،‬المواقع‬ ‫االخبارية‪ ،‬مواقع المشاركات‪ ،‬مواقع المنتديات و المدونات جعلت المحتوى الذى ينتجه المستخدم فى‬ ‫صورة بيانات نصية غير منظمة تسترعى االنتباه الهميتها للعديد من االعمال‪ .‬يستخدم الشبكة‬ ‫العنكبوتية متحدثين للعديد من اللغات الطبيعية‪ .‬لم تعد مقتصرة على متحدثى اللغة االنجليزية فقط‪.‬‬ ‫أصبح التنقيب فى البيانات النصية ضروريا من أجل إستخراج المعلوملت و اكتشاف المعرفة من هذا‬ ‫الكم الهائل من المعلومات‪ .‬اننا نحتاج إلى فهم أفضل للنص من أجل التعامل الجيد مع البيانات النصية‬ ‫فتقنيات معالجة اللغات الطبيعية تساعد على ذلك فهى تساعد على الفهم الجيد للنص‪.‬‬ ‫يعد تحليل المشاعر واحد من تقنيات التنقيب فى النص‪ .‬هو الدراسة الحاسوبية الراء و مواقف و‬ ‫مشاعر الناس تجاه اشياء او احداث او موضوعات من خالل المشاركات او االخبار‪ .‬يهدف تحليل‬ ‫المشاعر الى العثور على االراء و تحديد المشاعر التى تعبر عنها ثم تصنيف قطبيتها‪.‬‬ ‫تقترح هذه الرسالة إطارا إلعداد المكانز من مواقع شبكات التواصل االجتماعي ومواقع المشاركات‬ ‫واستخدامها في عملية تحليل المشاعر‪ .‬ويتكون اإلطار من ثالث مراحل‪ .‬المرحلة األولى تبدأ‬ ‫بتجهيز البيانات التي تم جمعها وتنظيفها‪ ،‬ثم تليها عملية تحديد ميل المشاعر فى البيانات‪ .‬المرحلة‬ ‫الثانية هي تطبيق تقنيات مختلفة فى معالجة النص بما في ذلك إزالة الكلمات الهامشية ‪ ،‬استبدال‬ ‫الكلمات المنفية لتحل محلها مضاد كل كلمة منها‪ ،‬واستخدام كلمات انتقائية لإلشارة الى جزء من‬ ‫الكالم كعالمات مميزة (الصفات واألفعال)‪ .‬المرحلة الثالثة هي تصنيف النص باستخدام أسلوب‬ ‫تصنيف بايز الساذج وشجرة القرارات كما يتم استخدام كلمة واحدة أو كلمتين كأسلوب للتعرف على‬ ‫مميزات الجملة‪ .‬و قد تم تحليل مكونات هذا االطار عند كل مرحلة‪ .‬من المهم تحليل مكونات الطار‬ ‫‪100‬‬

‫الستكشاف اى سيناريو هو االفضل لكل مكنز‪ .‬و لقد عزز هذا التحليل بتطبيق مكونات االطار على‬ ‫مكنز مشاركات االفالم المستخدم كعالمة باالضافة الى المكانز المعدة من مواقع شبكات التواصل‬ ‫االجتماعى و موقع المشاركات‪.‬‬ ‫يوجد نقص فى مصادر اللغة بالنسبة للغة العربية فمعظمها تحت التطوير‪ .‬و لكى نستخدم اللغة‬ ‫العربية مع االطار فهناك بعض المصادر الضرورية كقوائم الكلمات الهامشية او شبكة الكلمات‬ ‫العربية او اإلشارة الى جزء من الكالم كعالمات مميزة للغة العربية‪ .‬تكمن المشكلة في أن قوائم‬ ‫الكلمات الهامشية المعدة مسبقا كانت على اللغة العربية الفصحى و هي ليست اللغة المعتادة‬ ‫المستخدمة في شبكات التواصل االجتماعي‪ .‬لقد قمنا بإنشاء قائمة الكلمات الهامشية باللهجة‬ ‫المصرية و أخرى معتمدة على المكنز الستخدامها مع مكانز شبكات التواصل االجتماعي‪ .‬و قمنا‬ ‫بمقارنة كفاءة تصنيف النص عند استخدام القوائم المعدة مسبقا على اللغة العربية الفصحى و بعد‬ ‫إدماج القائمة باللهجة المصرية مع القائمة باللغة العربية الفصحى‪ .‬طبق تصنيف النص باستخدام‬ ‫أسلوب تصنيف بايز الساذج وشجرة القرارات كما يتم استخدام كلمة واحدة أو كلمتين كأسلوب‬ ‫للتعرف على مميزات الجملة‪ .‬المصادر االخرى مازالت تحت التطوير‪ .‬و لقد انشئوا ليوفوا باللغة‬ ‫العربية الفصحى و ليس للهجات العربية و هى اللغة المستخدمة بواسطة مستخدمى مواقع شبكات‬ ‫التواصل االجتماعى‪.‬‬ ‫لقد طبق و اختبر االطار بكل مراحله على المكانز باللغة االنجليزية‪ .‬اما بالنسبة للغة العربية فقد تم‬ ‫تطبيق ازالة الكلمات الهامشية فقط كتقنية لمعالجة النص‪ .‬تبين التجارب أن بيانات مواقع شبكات‬ ‫التواصل االجتماعى غير متوازنة للغاية‪ .‬بينت النتائج أن تطبيق جميع تقنيات معالجة النص تحسن‬ ‫دقة تصنيف مصنف بايز الساذج كما أنها تقلل من وقت التدريب ألسلوبي التصنيف‪ .‬و لقد تم تقييم‬ ‫االداء عن طريق معيار الدقة و متوسط معيار ‪ F‬و قياس وقت التدريب‪ .‬تظهر النتائج أيضا أن‬ ‫مصنف شجرة القرارات يعطى نتائج افضل للبيانات الغير متزنة فى اللغتين‪ .‬بينت التجارب على‬ ‫مكانز اللغة العربية أن قوائم الكلمات الهامشية التي تحتوى على اللهجة المصرية تعطى أداء أفضل‬ ‫من استخدام قوائم باللغة العربية الفصحى فقط‪.‬‬

‫‪101‬‬

Text Mining on Social Networking using NLP ...

Text Mining on Social Networking using NLP ...

Suggest Documents

Using text mining to support text summarization

Text Classification Using Data Mining

Mining Text Using Keyword Distributions

Mining Text Using Keyword Distributions

Social Media Analysis for Product Safety using Text Mining - arXiv

Social Media Analysis for Product Safety using Text Mining - arXiv

WEB LAYOUT MINING Using NLP: A PARADIGM ... - Semantic Scholar

Text Mining as a Social Thermometer

Sentiment Analysis and Text Mining for Social

User Perception of Security on Social Networking Sites Using Fuzzy ...

The Word on Text Mining

Academic Debate on Using Social Networking Media: Teachers' and ...

User Perception of Security on Social Networking Sites Using Fuzzy ...

Academic Debate on Using Social Networking Media: Teachers' and ...

ScienceDirect Motivations using Social Networking Sites on Quality ...

Using MongoDB for Social Networking Website

Using Social Networking and Learner-Centered ...

USING SOCIAL NETWORKING TOOLS TO PROMOTE ...

Using Social Networking Services to Support Learning

information disseminating through using social networking sites ...

Using MongoDB for Social Networking Website - arXiv

Informal Reasons for using Social Networking sites

Using Social Networking Services to Support Learning

USING SOCIAL NETWORKING TOOLS TO PROMOTE ...