Sample style

22 downloads 166003 Views 601KB Size Report
Therefore, we are using Twitter API Connector node. ... Twitter Search. ... N. (2015), Comparison of Data Mining Tools, International Journal of Advanced.
1

Highlighting the Most Important Issue of Social Media in Bangladesh using Data Mining N. Ranjan Dev1 and M. S. Hossen2 1,2

Department of Information & Communication Technology (ICT), Comilla University, Bangladesh 1 [email protected], [email protected]

Abstract. In this modern era, we are surrounded with large collection of data, where most of them are not necessary for us. It is the greatest challenge to find out our essential information from those big datasets. For this reason, it becomes very vital to implement data mining on our datasets. Nowadays, importance of text mining and sentiment analysis is growing rapidly. People spend most of the times on social media to share their thoughts and mindsets. Concentrating our attention on these social media for different issues like terrorist attacks, we can easily be alert on certain situation and overcome the upcoming problems. Therefore, it is the demand of this modern life to focus on social media that makes our life secure. In this paper, we have extracted important keywords depending on present situation from social datasets imposing text mining and sentimental analysis techniques. Hence, we have used KNIME (Konstanz Information Miner) as our simulation tool, which has an extended, flexible, user-friendly graphical user interface.

Keywords: Data mining tools, Social Media, Text Pre-processing, Simulation

1 Introduction and Related Work Nowadays social media become a part of our life. We can connect to other people by updating statuses, uploading pictures, videos, and any details of our life. As people spends huge time on social media for socialism, communication, education, and connecting with new people, there may be some dark sides too, because social medias are used by criminals for illegal activities. Recently the interest to study on sentiment analysis, which is a special field of text mining, is growing up day by day. With increasing the scope of natural language processing, interest on sentiment analysis is effected. There are various types of text mining studies available. General text mining refers to determining classification problems, in which textual data will be classified into some predefined classes. On the other hand, textual data can be classified in subjective and objective. The goal of this study is to find out the input sentence whether it is an opinion or not. This kind of study can be classified in two categories such as, supervised learning and unsupervised learning. Most of the techniques fall on supervised learning. Those techniques use naïve Bayesian, decision tree, and some other learner and predictor algorithms. The main challenge of this field is to train the machine with training data. In section 2, we have discussed about our simulation tool (i.e., KNIME), system workflow of KNIME (Konstanz Information Miner). Section 3 elaborates the discussion of our proposed model. Result and discussion has been included in section 4. Finally, Section 5 discusses the concluding remarks with future works of this research.

2 Simulation Tool There are various available data mining tools. Among these, we consider three most popular data mining tools, such as, Rapid Miner, Weka, KNIME (Konstanz Information Miner) for a comparison that will show why we have used KNIME as our simulation tool. In section 2.1 we are going to briefly discuss the reason for selecting KNIME.

2.1 Reason for Choosing KNIME Since Rapid Miner is a propitiatory software, so most of the researchers can not be able to use it easily. It does not concentrate on deep learning and several machine-learning algorithms. On the other hand, Weka is very popular, but it has some disadvantages too. Firstly, it is slow for more resource demanding implementation. Secondly, it supports huge procedures, but lacking of graphical user interface. Finally, it does not user friendly

2 for clustering and descriptive statistic methods. While KNIME is flexible and provides an extended extension repository with a nice graphical user-friendly interface.

2.2 System Workflow Thiel and Berthold (2012) explained if we are going to text mine with KNIME, then KNIME suggests some steps to follow. There are six steps:

Figure 1. Work Flow

2.2.1 IO First, the text documents have to be parsed. The text and the structure, if available, have to be extracted and represented in a data structure so that text-processing nodes can be easily handled. The KNIME data types, namely, DocumentCell and DocumentValue represents a document internally. 2.2.2 Enrichment In the enrichment step, semantic information is added by named entity recognition and tagging. The Enrichment category contains nodes, which assign part of speech tags and recognize standard named entities. Examples of these are name of persons, organizations, or locations, biomedical named entities such as names of genes or proteins and chemical structures. All enrichment nodes require an input data table containing exactly one column of DocumentCells. The output data table consists of one column of DocumentCells as well, whereas the recognized entities in the documents have been tagged. 2.2.3 Preprocessing In pre-processing stage, several pre-processing nodes are used. They process the DocumentCells and append the processed DocumentCells with original DocumentCells if required. 2.2.4 Frequencies After pre-processing is finished, frequencies of terms in documents and the complete corpus can be computed. 2.2.5 Transformation Before regular KNIME nodes can be applied on the documents, the textual data has to be transformed into numerical data. This can be done by the “Document vector” or “Term vector” node. These nodes create a binary or numerical vector representation for each term or document, based on the filtered bag of words input data table. The vector representation can then be used by standard data mining methods. 2.2.6 Visualization KNIME supports several formats of output, e.g., plot, graph, chat, output document, tag cloud etc.

3 Description of the Proposed Model Our intension behind the study is to build a model that will collect social data from a social media with the permission of the social media authority, process those data using some text mining techniques and retrieve an image as output showing important keywords.

3

Figure 2. Proposed Model

3.1 Social Data Collection First, we will collect social data from a social networking site. Here we are using twitter as our social site. Therefore, we are using Twitter API Connector node. To configure it we need to fill these text fields, such as API key, API secret, Access token, Access token secret. Then we gather social data searching from social site (twitter) using appropriate query. This will gather data depending on our query. At this stage, we need the user information, whose tweets are collected after the query. We have to connect it with Twitter API Connector and Twitter Search. After gathering social data from social media, now it is essential to store it on our local memory using table writer node. We have to specify the location where the new table file will be stored.

3.2 Data input Now the stored file will be parsed by some the KNIME nodes. Table reader will parse the data file. Column Filter node will remove unwanted columns from the data files.

3.3 Preprocessing Now we need to clean our gathered data using some preprocessing nodes. Here POS tagger node joins parts of speech tag value to particular keywords. N chars Filter filters unwanted words having unwanted length. Number Filter filters all unwanted numbers from the document. Case converter converts all characters from upper case to lower case or vice-versa. Stop word Filter filters the stop words as specified on the node. Porter stemmer stem the verbs to convert them in their base form. Tag filter specifies tagged value added tags from the documents, that how we get a pure bag of words that is very essential for our success.

3.4 Output Here we used keygraph keyword extractor to extract keywords from the documents and assigning a numerical value depending on the corresponding document. The Color Manager node specifies colors for highest numerical value and for lowest numerical value. Other terms are assigned any color between two colors. Tag

4 cloud shows a limited number (as we configured) of terms on the output with their specified colors. It creates a cloud of tags with taking tags from higher numerical values.

4 Result and Discussion We fixed our goal to find out those buzzwords, about which people talking on social media. Finally, we get an image highlighting buzzwords from higher priority. Variation in color and font size of those words expresses their importance and variation of occurrence in the document. If anyone explores the image output of project, he or she can tell which words are very busy in social media in that time at a specific subject. If we look at the output image, we see that “bangladesh” keyword has the biggest font size than any other keywords in the image. It happens because we queried tweets using the query “bangladesh”. That is why, our model shows bangladesh keyword as the most important keyword in the output. Next that we see, india has the next higher font size. As Bangladesh and India has signed several bilateral agreements, it is easily understood that many people are talking about India relating Bangladesh. We can notice some keywords such as, rohingya, refuge, genocide, muslim theses have also bigger font size. We know people from all over the world are talking about the genocide occurring in Myanmar. They are requesting Bangladesh Govt. to accommodate and provide the basic living needs to the rohingya refuges. As most of them are expressing their sentiments and comments on social media, our designed model identified these keywords as important buzzwords on social media relating Bangladesh and highlights those in output image with higher font size. We know Sajeeb Wazed, son of current prime minister of Bangladesh Sheikh Hasina, who serves as advisor on information and communication technology to the Govt. of Bangladesh. He is very reactive on various social media; as a result, his name is showing on the output highlighting with bigger font. Nowadays ict division of Bangladesh Govt. is arranging various camps to enrich the knowledge of ICT among the people to make digital Bangladesh. They circulate their camping information on social media. As a result @ictdivision, camp, digit are also highlighted here.

Figure 3. Simulation Output

5 Conclusion and Future Works In this paper, we have proposed a model to highlight the current issue on social media in Bangladesh. Hence, after implementing our proposed model on KNIME (Konstanz Information Miner) as a simulation tool, we observed some keywords with higher importance. This is just a beginning of our research work. In near future, we would like to use social media for social security related purposes. For doing this, we will design a model, which will identify positive words, negative

5 words, and abused words, harass mental words, offensive words, terroristic motivational words from our collected social data and identify those people who are suspicious.

References Saravanapriya K. (2014). A Study on Free Open Source Data mining Tools, International Journal Of Engineering And Computer Science, 3, 9450-9452. Christa S., Madhuri K. L., Suma V., (2012). A Comparative Analysis of Data Mining Tools in Agent Based Systems, International Conference on Systemics, Cybernetics and Informatics, Hyderabad. Chauhan N., Gautam N. (2015), Comparison of Data Mining Tools, International Journal of Advanced Technology in Engineering and Science, 3, 291-298. Ohsawa, Benson, Yachida (1998). KeyGraph: Automatic Indexing By Co-occurrence Graph Based on Building Construction Metaphor, IEEE International Forum on Research and Technology Advances in Digital Libraries, 12-18. Matsuo, Ishizuka (2004). Keyword Extraction From a Single Document using Word Co-occurrence Statistical Information, International Journal on Artificial Intelligence Tools, 13, 157-169. Berthold M. R., Cebron N., Dill F., Gabriel T. R., Tobi. KNIME – The Konstanz Information Miner Version 2.0, Nycomed Chair for Bioinformatics and Information Mining, 11, 26-31. Thiel K., Berthold M., (2012). Technical Report- The KNIME Text Processing Feature: An Introduction.