Characterizing User-Generated Text Content Mining: a ... - Springer Link

54 downloads 959 Views 254KB Size Report
relates the application of text mining to UGTC in the Portuguese language. The systematic mapping ... Whereas data mining is largely language independent, text mining involves a significant ..... were the Python NLTK, LingPipe and Freeling. Two primary ... Facebook and Twitter are important sources of. UGTC, however ...
Characterizing User-Generated Text Content Mining: a Systematic Mapping Study of the Portuguese Language Ellen Souza1, Dayvid Castro1, Douglas Vitório1, Ingryd Teles1, Adriano L. I. Oliveira2, Cristine Gusmão3 1

MiningBR Research Group, Federal Rural University of Pernambuco (UFRPE), Serra Talhada, PE, Brazil [email protected], [email protected], [email protected], [email protected] 2 Centro de Informática, Federal Unversity of Pernambuco (CIn-UFPE), Recife, PE, Brazil {eprs,alio}@cin.ufpe.br 3 Programa de Pós-graduação em Engenharia Biomédica, Centro de Tecnologia e Geociências - Federal Unversity of Pernambuco (CTG-UFPE), Recife, PE, Brazil [email protected]

Abstract. Unstructured data accounts for more than 80% of enterprise data and is growing at an annual exponential rate of 60%. Text mining refers to the process of discovering new, previously unknown and potentially useful information from a variety of unstructured data including user-generated text content (UGTC). Given that Portuguese language is one of the most common languages in the world, and it is also the second most frequent language on Twitter, the goal of this work is to plot the landscape of current studies that relates the application of text mining to UGTC in the Portuguese language. The systematic mapping review method was applied to search, select, and to extract data from the included studies. Our manual and automated searches retrieved 6075 studies up to year 2014, from which 35 were included in the study. Text classification concentrates 79% of all text mining tasks, having the Naïve Bayes as the main classifier and Twitter as the main data source. Keywords: Text Mining, Text Classification, Opinion Mining, User-Generated Content, Portuguese Language.

1 Introduction The growth of social media and user-generated content (UGC) on the Internet provides a huge quantity of information that allows discovering the experiences, opinions, and feelings of users or customers [1]. The volume of data generated in social media has grown from terabytes to petabytes. According to [2], about 80% of corporate data are stored in non-structured way, mainly in text format and are growing at an annual exponential rate of 60%. However, unstructured texts cannot be simply processed by machines, which typically handle Ó Springer International Publishing Switzerland 2016 Á. Rocha et al. (eds.), New Advances in Information Systems and Technologies, Advances in Intelligent Systems and Computing 444, DOI 10.1007/978-3-319-31232-3_96

1015

1016

E. Souza et al.

text as simple sequences of character strings. Specific processing methods, techniques and algorithms are required in order to extract knowledge from text [3]. Text mining or knowledge discovery from text (KDT) was mentioned for the first time in 1995 by Feldman and Dagan as a machine supported analysis of text. It is the process of extracting knowledge from a large amount of unstructured data and it is also defined as an extension of data mining. However, in contrast to data mining, text mining focuses on the extraction of knowledge from a large number of documents written in natural language from various data sources, including UGC. According to the Organization for Economic Co-operation and Development, User-Generated or User-Created Content is defined as: i) content made publicly available over the internet, ii) which reflects a certain amount of creative effort, and iii) which is created outside of professional routines and practices. Types of UGC are: text, novel and poetry; photo and images; music and audio; video and film; citizen journalism; educational content; mobile and virtual content. In this article, we focus on texts that are generated by users, that is, user-generated text content (UGTC). Whereas data mining is largely language independent, text mining involves a significant language component, justifying its study associated with one target language. Most text mining tools focus on processing English documents [4], but many other languages, including Spanish and Portuguese, have also been considered. Given that Portuguese is among the most spoken languages in the world, with almost 270 million people1 speaking some variant of the language, research interests on Portuguese processing is shared mainly with Portugal and Brazil [5]. Therefore, the growing interest for the Portuguese is also related to the fact that the language is the second most used on Twitter, which is one of the main sources of UGTC [6]. Thus, combining the guidelines to perform Systematic Mapping [7] and Systematic Reviews Studies [8], the goal of this article is to characterize the current researches that report the use of text mining for UGTC in the Portuguese language, driven by the following general research question (RQ): What is the current state of text mining in the Portuguese language for UGTC? The automated and manual search procedures retrieved 6075 papers published up to the year 2014, from which 35 were included in this study. The data2 extracted from the primary studies were systematically structured and analyzed to answer the historical, descriptive and classificatory research questions presented below: RQ1: what is the evolution in the number of publications up to year 2014? RQ2: which individuals, organizations, and countries are the main contributors in the research area? RQ3: what are the adopted text mining tasks? RQ4: what are the techniques, algorithms, methods and tools applied? RQ5: what are the characteristics of UGTC data sources and how they were evaluated? The remainder of this article is structured as follows: Section 2 provides the related work. Section 3 details the systematic mapping study protocol. In Section 4, a 1

Brazil (202.656.788), Mozambique (24.692.144), Angola (24.300.000), Portugal (10.813.834), Guinea-Bissau (1.693.398), East Timor (1.201.542), Equatorial Guinea (722.254), Macau (587.914), Cabo Verde (538.535) and São Tomé e Príncipe (190.428). Data extracted from US/CIA - The World Factbook (July, 2014) 2 Data is available in: http://bit.ly/1MX58hY

Characterizing User-Generated Text Content Mining …

1017

comprehensive set of results is presented. Section 5 discusses the results, limitations and threats to validity. Finally, Section 6 contains the conclusions and directions for future work. Due to lack of space, the list of primary studies was not included in this article, it is available online2.

2 Related Work Although we have made an extensive search, we did not found any text mining systematic mapping for the Portuguese language and more specifically from UGTC. However we found several language independent text mining surveys [2, 4, 9], a paper [5] describing the computational linguistics area in Brazil, a survey [10] of automatic term extraction for Brazilian Portuguese and a systematic review [11] of user-generated content (UGC) applied to tourism and hospitality. In [5], an overview of the computational linguistics or natural language processing (CL/NLP) in Brazil is presented. According to the authors, research in Brazil is varied and deals mainly with Portuguese, English and Spanish processing. They also state that research on text mining is mostly carried out by non-computational linguistics researchers, but instead by researchers from general artificial intelligence and database areas. They estimate that Brazil has about 250 researchers in CL/NLP area. The largest CL/NLP research group in Brazil is the Interinstitutional Center for Research and Development in Computational Linguistics (NILC), which includes researchers mainly from University of São Paulo, Federal University of São Carlos and State University of São Paulo. The authors also state that the Brazilian Symposium on Information and Human Language Technology (STIL) is the main event in South America and the International Conference on Computational Processing of Portuguese Language (PROPOR) is the main conference with focus on Portuguese language, giving equal space to research on text and speech processing. In [10], a survey of the state of the art in automatic term extraction (ATE) for the Brazilian Portuguese language is presented. According to the authors, there are still several gaps to be filled, for instance, the lack of consensus regarding the formal definition of meaning of ‘term’. Such gaps are larger for the Brazilian Portuguese when compared to other languages, such as English, Spanish, and French. Examples of gaps for Brazilian Portuguese include the lack of a baseline ATE system and the use of more sophisticated linguistic information, such as the WordNet and Wikipedia knowledge bases. In [11], a systematic review was conducted to examine how UGC data have been used in empirical tourism and hospitality research. 122 articles were systematically surveyed. The main sources of UGC data are consumer review websites and blogs. Twitter was classified as a blog. Texts were the dominant UGC data type.

3 Review Method Secondary studies review all the primary studies relating to a specific research question with the aim of integrating and synthesizing evidence related to a specific

1018

E. Souza et al.

subject [8]. The systematic mapping study, also referred to scope studying, provides a structure of the type of research reports and results that have been published by categorizing them. It often gives a visual summary, the map, of its results [7]. Fig. 1 shows the adopted systematic mapping process. The first step comprises the definition of research protocol. The second, third, and fourth steps encompass the primary studies identification, selection, and evaluation in accordance with the inclusion and exclusion criteria established in the review protocol. In the fifth step, data from the included studies is extracted and synthesized in order to answer the research questions. We searched the literature looking for full papers (primary studies) that reported text mining applications for UGTC in the Portuguese language. Primary studies that met at least one of the following exclusion criteria were removed from the study: (i) written in a language other than English or Portuguese; (ii) not available on online scientific libraries; (iii) keynote speeches, workshop reports, books, theses, and dissertations;

Fig. 1 Systematic Mapping Process based on [7]

3.3.

Data Sources and Search Strategy

Automated and manual search processes were combined to achieve high coverage. The automated search was constructed based on two search terms extracted from the general research question presented in Section 1 (see Fig. 2). Synonyms for both terms were extracted from the literature and, as we were looking for primary studies written also in Portuguese language, the translation of terms for Portuguese was also included in the final query. This search retrieved studies from all kind of text sources from which we selected only the ones generated by users, that is, the UGTC.

Fig. 2 Generic Search String

Primary studies published up to year 2014 were analyzed using the same procedure for both search strategies. Six researchers divided into three groups applied the inclusion and exclusion criteria’s on all retrieved papers after reading the title,

Characterizing User-Generated Text Content Mining …

1019

abstract and keywords. For the 661 potentially relevant studies, the researchers reapplied the inclusion criteria and exclusion criteria after reading the full paper. This resulted in a list of 203 studies, from which 35 relate to the use of text mining for UGTC in Portuguese. Table 1 contains the manual (M) and automated (A) data sources details. Table 1. Manual and Automated Data Sources Data Source

Type Retrieved Included Studies Studies Computational M 217 22

International Conference on the Processing of Portuguese (PROPOR) Text Mining and Applications (TEMA) track of Portuguese Conference on Artificial Intelligence Brazilian Workshop of Social Network Analysis and Mining (BRASNAM) Brazilian Symposium on Information and Human Language Technology (STIL) ACM symposium on Document engineering (DocEng) Linguateca Database (www.linguateca.pt) Message Understanding Conferences (MUC) Text Analysis Conference (TAC) Text REtrieval Conference (TREC) Document Understanding Conference (DUC) IEEE Xplore Digital Library ACM Digital Library Science Direct Scopus Portal de Periódicos Capes SciELO Scientific Electronic Library Online TOTAL

UGTC

1

M

34

6

1

M

99

11

8

M

251

44

3

M M M M M M A A A A A A

273 1312 159 322 1715 167 306 277 159 552 229 2 6075

1 30 19 29 4 21 15 1 203

1 6 11 1 2 1 35

4 Results In this section, we present the main findings of our review, organized according to the five specific research questions. 4.1 RQ1: what is the evolution in the number of publications up to year 2014? As shown in Fig. 3, the three first primary studies were published in 2009 and the number of studies has grown over the years, despite the drop in 2010 and 2011. Primary studies were classified according to the Portuguese language variant: the European Portuguese (from Portugal) represents 6%, while Brazilian comprises 77% of all studies. 6% make use of text written in both Brazilian and European Portuguese. 11% did not provide the Portuguese variant information. Table 2 lists the Portuguese variant dataset used for each primary studies.

1020

E. Souza et al.

Fig. 3 Temporal distribution of primary studies Table 2. List of primary studies according to the Portuguese language variant Variant Primary Studies Both UGC09, UGC10 Brazilian UGC01, UGC02, UGC03, UGC05, UGC07, UGC08, UGC11, UGC12, UGC13, UGC14, UGC15, UGC16, UGC17, UGC18, UGC20, UGC21, UGC23, UGC24, UGC25, UGC26, UGC27, UGC28, UGC29, UGC30, UGC31, UGC33, UGC34 European UGC06, UGC32 N/A UGC04, UGC19, UGC22, UGC35

4.2 RQ2: which individuals, organizations, and countries are the main contributors in the research area? As expected from Fig. 3, Brazil has a greater number of researchers in the field. Renata Vieira from UNISINOS (Table 3) and UFMG (Table 4) appear as the main author and the main organization, respectively. In addition to Brazil (BR) and Portugal (PT), research interests on Portuguese processing is shared with other countries like the USA and Canada as primary studies (UGC04, UGC05, UGC10, UGC12, UGC14, UGC24, UGC26, UGC35) propose multilanguage approaches. Table 3. Number of articles published by main researchers Quant 6 4 4 3

Author Renata Vieira Wagner Meira Jr. Marlo Souza Karin Becker

Institution UNISINOS-BR UFMG-BR UFRGS-BR UFRGS-BR

Quant 3 3 3 3

Author Larissa A. Freitas Eugénio de Oliveira Adriano Veloso Luís Sarmento

Institution PUCRS-BR Univer.of Porto-PT UFMG-BR Univer.of Porto-PT

Table 4. Number of researchers per organization Quant. 23 15 11 9

Organization UFMG PUCRS UFRJ UP

Country Brazil Brazil Brazil Portugal

Quant. 8 8 8

Organization UFRGS Ulisboa USP

Country Brazil Portugal Brazil

Characterizing User-Generated Text Content Mining …

1021

4.3 RQ3: what are the adopted text mining tasks? Four primary studies have performed two research with different text mining tasks (e.g. classification and information extraction) resulting in 39 text mining task occurrences (Table 5). Text Classification appears as the main task for UGTC in Portuguese Language. Three primary studies (UGC02, UGC23, and UGC27) reported the use of balanced classes while eleven (UGC01, UGC05, UGC06, UGC13, UGC17, UGC18, UGC21, UGC22, UGC28, UGC33, UGC35) used unbalanced classes. The Opinion Mining subtask, also known as Sentiment Analysis, represents 62% of all tasks. Two primary studies (UGC11, UGC12) also evaluated the sentiment or opinion variation over time, also known as Sentiment Drift. Eighteen papers reported the usage of lexical resource to perform the sentiment analysis. The main used lexical resources were: SentiLex-PT, SentiWordNet, OpLexicon and Sentimeter-BR. Table 5. Text Mining tasks and subtasks Task Classification

% Subtask 79 Language Identification Opinion Mining

Others Information 13 Extraction Summarization 2 Topic 3 Tracking Visual Text 3 Mining

% 6.5

Primary Studies UGC04, UGC10

74

UGC01, UGC02, UGC03, UGC06, UGC11, UGC12, UGC13, UGC15, UGC17, UGC18, UGC19, UGC21, UGC22, UGC25, UGC26, UGC27, UGC28, UGC29, UGC30, UGC32, UGC33, UGC34, UGC35 19.5 UGC05, UGC08, UGC09, UGC14, UGC16, UGC23 UGC07, UGC14, UGC15, UGC20, UGC24 -

UGC31 UGC09

-

UGC31

4.4 RQ4: what are the techniques, algorithms, methods and tools applied? 69% of all primary studies performed at least one type of Natural Language Processing (NLP) (see Table 6). The main tools used for text preprocessing and NLP were the Python NLTK, LingPipe and Freeling. Two primary studies reported the use of the TreeTagger-PT for Part-Of-Speech (POS) tagging. For Named Entity Recognition (NER), the CRF tagger, FS-NER and GeoNames were adopted. Table 7 presents the algorithms or methods used in the text analysis step. Naïve Bayes and Weka appears as the most used classifier and most used tool, respectively. Python and Java were the most used programing language in this step.

1022

E. Souza et al.

Table 6. List of adopted pre-processing techniques used in primary studies % Applied 69 Stopword Removal Filtering Stemming POS NER Tokenization Sentence Splitter Lemmatization Chunk N/A

31

Primary Studies UGC01, UGC03, UGC09, UGC15, UGC16, UGC18, UGC21, UGC25, UGC29, UGC33 UGC01, UGC04, UGC09, UGC14, UGC16, UGC18, UGC25, UGC28 UGC01, UGC03, UGC09, UGC13, UGC18, UGC25, UGC33 UGC15, UGC19, UGC20, UGC25, UGC31, UGC32, UGC35 UGC06, UGC10, UGC13, UGC14, UGC24, UGC28 UGC04, UGC10, UGC19, UGC31 UGC19, UGC22, UGC28, UGC31 UGC19, UGC20, UGC25, UGC35 UGC31 UGC05, UGC07, UGC08, UGC11, UGC12, UGC17, UGC23, UGC26, UGC27, UGC30, UGC34

Table 7. List of algorithms and methods used in primary studies Algorithms/Methods Naive Bayes

% 43

SVM

31

Decision Tree Rule-Based Pattern-Based N-grams

14 17 9 29

Others

51

Primary Studies UGC01, UGC02, UGC03, UGC04, UGC10, UGC16, UGC18, UGC21, UGC25, UGC29, UGC30, UGC33, Multinomial Naive Bayes {UGC12, UGC25, UGC26} UGC05, UGC09, UGC13, UGC21, UGC23, UGC32, UGC33, SMO {UGC01, UGC28, UGC29, UGC30} UGC29, C4.5 {UGC30}, RF {UGC16, UGC21, UGC23} UGC10, UGC11, UGC12, UGC19, UGC20, UGC31 UGC06, UGC20, UGC22 UGC02, UGC04, UGC08, UGC10, UGC23, UGC28, UGC29, UGC30, UGC32, UGC33 k-Nearest Neighbor {UGC09, UGC21}, Neural Network {UGC33, UGC21}, Filtered Space Saving {UGC09}, Hoeffding Adaptive Trees {UGC11}, Incremental Lazy Associative Classifier {UGC11}, Latent Semantic Indexing {UGC15}, Mapreduce paradigm {UGC33}, OneR classification algorithm {UGC28}, Online Rule Extraction {UGC12}, Pareto-Efficient Selective Sampling, {UGC11}, Topic Fuzzy Fingerprints {UGC09}, Zipping classifier {UGC04}, Genetic Algorithm {UGC21}, Regular Expression {UGC03, UGC23, UGC28}

4.5 RQ5: what are the characteristics of UGTC data sources and how they were evaluated? A total of 46 data sources were employed among the 35 primary studies. Social networks appear as main sources for UGTC in Portuguese (Table 8). Twitter represents more than 50% of all data sources. Text domain is varied, but Politics, Sports and Technology have greater interest. Two primary studies (UGC05, UGC10) reported the use of publicly datasets, both containing twitter data. The precision, recall and f-measure trio was used by almost half of the primary studies to evaluate their results. Eight primary studies reported the adoption of cross validation for

Characterizing User-Generated Text Content Mining …

1023

estimating the classifier performance. Mostly (66%) primary studies built manually their gold standard. Table 8. UGTC Data Sources Quantity 25 2 1

Data Source Twitter Booking.com, Buscapé, Portuguese newspapers, Tripadvisor, Folha de São Paulo Apontador, Cinema com Rapadura, CinePlayer, e-bit, Emails, Facebook, Fórum, Google Play, MySpace, Omelete, Portuguese newspapers

5 Discussion We could observe an increasing interest in opinion mining, partly due to its potential applications, such as: marketing, public relations and political campaign. Portuguese is spoken mainly in Portugal and Brazil, with Brazil having approximately 20 times the population of Portugal. Choosing a random Tweet in Portuguese, there is a 95% chance of it originating in Brazil [12]. Facebook and Twitter are important sources of UGTC, however the first one is less used in text classification as it often contains pictures and the analysis of the text by itself is not effective [13]. As most of UGTC in Portuguese comes from social networks, more than 90% of text is short, written in an informal way, with grammatical errors, spelling mistakes, as well as ambiguous and ironic. Although 69% of works have reported the use of NLP, none have reported the use of word sense disambiguation. Therefore, the most used term weighting scheme, the TF-IDF (term frequency – inverse document frequency), is considered less discriminative for text classification [14]. Even when good results are achieved, the used datasets are rarely published. This makes it difficult to implement improvements, as well as comparisons on which technique performs better for a particular dataset. Therefore, less than 50% of all 35 primary studies have fully answered the five research questions. Important data for comparison like text domain and type, class details and language variant were not available. We did not find studies that have reported the use of clustering task for UGTC in Portuguese, as well as a unique tool for all mining tasks. There are some threats to the validity that are worthy of note: (i) it is possible that some relevant studies were not included throughout the searching process. This threat was mitigated by performing an extensive search, as well as, double-checking from two researchers; (ii) as studies were classified based on personal judgment, it is possible that some studies may have been incorrectly classified. To mitigate this threat, the classification step was executed for more than one researcher; (iii) digital databases do not have a compatible search rules and show some instability when presenting results. We mitigated this threat by running the search in several digital databases more than one time by different researchers.

1024

E. Souza et al.

6 Conclusion This paper plots the landscape of current studies relating to the application of text mining techniques for UGTC in the Portuguese language. The strength of this paper is to promote growth in the research of text mining in the Portuguese Language. We think that the reported data on this paper may help researchers and practitioners to discover what has been achieved and where the gaps are in this field area. The lack of some relevant data and published datasets make further analysis in the research area difficult. This work is part of an ongoing broader research as shown in the general research question (Section 1). We are mapping not only the use of text mining techniques for UGTC in the Portuguese language, but for all kind of texts. To increase coverage we plan to apply snowball techniques on included primary studies.

Acknowledgment Ellen Souza is supported by FACEPE (IBPG-0765-1-0311).

References 1. Marine-Roig, E., Anton Clavé, S.: Tourism analytics with massive user-generated content: A case study of Barcelona. J. Destin. Mark. Manag. 1–11 (2015). 2. Delen, D., Crossland, M.D.: Seeding the survey and analysis of research literature with text mining. Expert Syst. Appl. 34, 1707–1720 (2008). 3. Hotho, A., Andreas, N., Paaß, G., Augustin, S.: A Brief Survey of Text Mining. (2005). 4. Tan, A.: Text Mining : The state of the art and the challenges Concept-based. Proc. PAKDD 1999 Work. Knowl. Disocovery from Adv. Databases. 65–70 (1999). 5. Pardo, T., Gasperin, C., Caseli, H., Nunes, M. das G. V.: Computational Linguistics in Brazil : an overview. Proc. NAACL HLT 2010 Am. 1–7 (2010). 6. Poblete, B., Garcia, R., Mendoza, M., Jaimes, A.: Do All Birds Tweet the Same ? Characterizing Twitter Around the World. Society. 1025–1030 (2011). 7. Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic Mapping Studies in Software Engineering. (2007). 8. Kitchenham, B., Charters, S.: Guidelines for performing Systematic Literature Reviews in Software Engineering. Tech. Rep. EBSE-2007-01, (2007). 9. Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. Ldv Forum. (2005). 10.da Silva Conrado, M., Felippo, A., Salgueiro Pardo, T., Rezende, S.: A survey of automatic term extraction for Brazilian Portuguese. J. Brazilian Comput. Soc. 20, 12 (2014). 11.Lu, W., Stepchenkova, S.: User-Generated Content as a Research Mode in Tourism and Hospitality Applications: Topics, Methods, and Software. J. Hosp. Mark. Manag. (2015). 12.Laboreiro, G., Bošnjak, M., Sarmento, L., Rodrigues, E.M., Oliveira, E.: Determining language variant in microblog messages. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing - p. 902. ACM Press, USA (2013). 13.Evangelista, T.R., Padilha, T.P.P.: Monitoramento de Posts Sobre Empresas de ECommerce em Redes Sociais Utilizando Análise de Sentimentos. (2013). 14.Takçı, H., Güngör, T.: A high performance centroid-based classification approach for language identification. Pattern Recognit. Lett. 33, 2077–2084 (2012).

Suggest Documents