Many people face a difficulty of writing correct spelling of a particular word especially on writing blogs, comment on social networking sites, emails etc. Incorrect ...
INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE
ISSN NO: 0972-1347
Evaluation and Classification of Positive and Negative Words from Noisy Data *Mohammad Shabaz, Dr. Urvashi Department of CSE, Chandigarh University
Abstract Many people face a difficulty of writing correct spelling of a particular word especially on writing blogs, comment on social networking sites, emails etc. Incorrect spellings cannot be detected easily especially when performing sentimental analysis after tokenization of sentences into words. Therefore in order to detect these incorrect words, the word tokenization on their minimum character size (MCS) is required. This paper reveals the procedure of how the word tokenization occurs so that positive and negative words can be evaluated and classified. This helps us to perform sentimental analysis on unambiguous or noisy data set. Keywords: Positive words, Sentimental Analysis, Negative words, Tokenization, Incorrect Spellings. Introduction Basic methodology to perform sentimental analysis on text data can be done using AS Sentimental approach in which the sentences or paragraphs are tokenize into words [9].The problem occur when a particular word is not correctly spelled, thus we are not able to apply any filter for classifying these incorrect spelled words into positive a nd negative repository[1]. Before classifying these incorrect or noisy data, we have to establish two repositories which consist of a Positive and Negative words. For our experimentation we have used the defined repository of AS [9]. Firstly stemming on a particular sentence is done which results in the formation of adjectives and adverbs from the sentence complete the process of tokenization of words and target those words from the repository which matches most of the characters of targeted words [2]. The process of correcting the incorrect spelled words helps us to better understand what the speaker wants to convey. In communication process it is very important to recognize the message of sender [3]. Let us discuss the applications of the Classification of Positive and Negative words from Noisy Data in different fields. 1. Pattern Recognition and Machine Learning: To acquire knowledge from Noisy data, it is required to clean and transform it into meaningful information than only we are able to recognize pattern [4]. To classify the data set into Positive and Negative words helps us to find the patterns of what sender wants to convey or communicate. To apply the filter of word tokenization helps us to recognize the patterns and helps to detect emotions/sentiments [5].
Volume 5, Issue 6, June 2018
333
http://ijics.com/
INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE
ISSN NO: 0972-1347
2. Gathering Information from Semi-Structured Data: Semi-Structured data such as emails or blogs are written with free English where there is lots of incorrect spelling mistakes occur. To collect the information from this data, it is required to apply this classification and separate the Positive and Negative words from it [6]. 3. Extracting Information from Research Articles: In order to remove plagiarism, some authors spell the words incorrectly, which is unethical and creates difficulty for reader to understand what the author wants to convey [7]. Our approach also helps in the same to correct the particular word after matching correct words on the basis of minimum character size (MCS). 4. Word Discovery and Repository Establishment: While matching the words, if a certain word is not matched with any of the word from defined repository, we can create a junk repository in which these unmatched words are thrown [8]. We can then manually check the validity of that particular word and put it into the desired Positive and Negative words repository. Literature Review and Related Work Mohammad Shabaz and Ashok Kumar [9]: The authors perform sentimental analysis using a novel approach of AS. They extract the data from social networking sites and tokenize them into words and compare these words with the defined repository of positive and negative words and then using association rule mining to evaluate the sentiment count in numeric form. The main limitation of this article is the validity of word. If a certain word is incorrectly spelled, the algorithm is unable to classify it and throws the word out. Mohammad Arif, Mandeep Singh and Rajdavinder Boparai [10]: The authors uses cross platform languages to detect sentiments. This paper mentioned the difficulty of incorrect spelling of words and works hard to remove all such language barriers which act as speed breakers in the smooth running of the process of finding sentiments. The main limitation of this article is Unicode matching. Different regional languages have different syntax thus to classify them into particular repository is difficult to achieve. Kathuria Anchal , Upadhyay Saurav [11]: The authors uses different approaches such as Machine Learning approach, Support Vector Machine, Lexical-based approach and classify the nature of data-set whether it is supervised or unsupervised. This article reveals that the results achieved on performing sentimental analysis are better while using Machine Learning approach. Carlo Strapparava and Rada Mihalcea [12]: The authors have found that problems faced by the students includes load of subjects, assignments, secession tests, sleeping deficiency. Later on, they performed a review on other approaches executed on different datasets like Movies, novels, elections etc. Most work has been performed to detect sentiments is on English language. The main limitation in the article is noisy nature of data-set. Wiebe, J. [13]: The author states about the English language which is universally accepted and much of the work of sentimental analysis is done using it.
Volume 5, Issue 6, June 2018
334
http://ijics.com/
INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE
ISSN NO: 0972-1347
Proposed work and Methodology The methodology for classifying the positive and negative words from noisy data has been discussed in the following points: 1. Get the data in any format and save it on any file (say text-file). 2. Set the Positive (p) and Negative (n) word repository. 3. Set the value of minimum character size (MCS) say mcs. 4. Tokenize the word into mcs say x. E.g “happy” becomes “hap”. 5. Start traversing in Positive (p) and Negative (n) word repository. 6. If (x matches with p) 7. Put x in Positive word repository. 8. If(x matches with n) 9. Put x in Negative word repository. 10. If (x not matches with either n or p) 11. Put x in junk file. 12. Repeat the process until end of file. The data flow diagram with input and output can be discussed in fig 1 which reveals the complete procedure for the proposed methodology.
Volume 5, Issue 6, June 2018
335
http://ijics.com/
INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE
ISSN NO: 0972-1347
Results The proposed methodology is implemented on r-studio to tokenize the words using “quanteda” package. The following results are generated which are shown in Table 1. S.No
Text
1 2
He is a very good boy and happey person Asifa has been brutaeally rapied and murdurered.
3 4 5
Lovea is blind Shutup you Mentally sick person. Internet is a resourceful place for information and Knowledge.
Volume 5, Issue 6, June 2018
336
Minimum character size (MCS) 3 3
Positive word 2 0
Negative word 0 3
3 4 5
1 0 1
0 2 0
http://ijics.com/
INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE
ISSN NO: 0972-1347
Table 1: Results obtained on Evaluation and Classification of Positive and Negative words from Noisy Text. Conclusion Incorrect spellings cannot be detected easily especially when performing sentimental analysis after tokenization of sentences into words. This article reveals the methodology of tokenization of words on the basis of minimum character size (MCS) which helps to reduce the spelling mistake barrier that we are facing while performing sentimental analysis on noisy text data. This article results in the classification of positive and Negative word from noisy data which helps us to better understand the communication between the sender and receiver and their sentiments for each other. Acknowledgement I would like to thank with whole heartedly Dr. Sanjeet Singh and Dr. S.S. Seghal for their continue support and encouragement. References [1] Pang B, Lee L,” A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts”, Association for Computational Linguistics, Stroudsburg, PA, USA, 2004. https://doi.org/10.3115/1218955.1218990. [2] R. Piryani, D. Madhavi, and V. K. Singh, “Analytical mapping of opinion mining and sentiment analysis research during 2000–2015,” Information Processing & Management, vol. 53, no. 1, 2017, pp. 122–150. https://doi.org/10.1016/j.ipm.2016.07.001. [3] Medhat Walaa, Hassan Ahmed, Korashy Hoda, “Sentiment analysis algorithms and applications: A survey”,Ain Shams Engineering Journal, Elsevier, Vol. 5, May 2014. [4] Z. Nanli, Z. Ping, L. Weiguo and C. Meng, "Sentiment analysis: A literature review", 2012 International Symposium on Management of Technology (ISMOT), 2012. https://doi.org/10.1109/ISMOT.2012.6679538. [5] Moreo A, Romero M, Castro JL, Zurita JM. “Lexicon-based com- mentsoriented news sentiment analyzer system” Expert Syst Appl, 39:9166–80, 2012. https://doi.org/10.1016/j.eswa.2012.02.057. [6] Muhammaad Zubair, Aurangzeb Khan, Shakeel Ahmad, Fazal-MasudKundi and Asghar, 2014.A Review of Feature Extraction in Sentiment Analysis. ISSN 2090-4304 Journal of Basic and Applied Scientific Research.
Volume 5, Issue 6, June 2018
337
http://ijics.com/
INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE
ISSN NO: 0972-1347
[7] Walter, K., & Mihaela, V., “Sentiment analysis for hotel reviews”, computational linguisticsapplications, Jacharanka Conference, 2011. [8] Jiang L, Yu M, Zhou M, Liu X, Zhao T,” Target-dependent twitter sentment classification” Association for Computational Linguistics: Human Language Technologies-Volume 1, 2011, 151–160 [9] Mohammad Shabaz and Ashok Kumar, “AS: a novel Sentiment analysis approach”, International Journal of Engineering and Technology, vol. 2.27, pp. 46-49, 2018. [10] Mohammad Arif, Mandeep Singh and Rajdavinder Boparai, ”Emotion detection on cross platform languages ”, International Journal of Engineering and Technology, vol. 2.27, pp. 2731, 2018. [11] Kathuria Anchal , Upadhyay Saurav, ”A Novel Review of various Sentimental Analysis Techniques”, International Journal of Com-puter Science and Mobile Computing, Vol. 6 No. 4, 2017. https://doi.org/10.4304/jait.5.2.53-57. [12] Carlo Strapparava, Rada Mihalcea, “Learning to Identify Emotions in Text”, SAC’08 Fortaleza, Brazil 2008 https://doi.org/10.1145/1363686.1364052. [13] Wiebe, J., “Learning subjective adjectives from corpora”, Proceed- ings of the 17th National Conference on Artificial Intelligence (AAAI-2000), Austin, Texas, 2000
Volume 5, Issue 6, June 2018
338
http://ijics.com/