A DATA MINING-‐BASED METHODOLOGY FOR BRAND MONITORING IN SOCIAL NETWORKS
Leandro A. Silva1, Orlando Bisacchi Coelho, Bruno Okamoto, Maurilio Santos, and Rodrigo Sakami Mackenzie Presbyterian University (UPM)
1 Introduction Social networks are important for producing and disseminating peoples’ opinions on different subjects (Boyd & Ellison, 2007). The positive or negative opinions there expressed are used for for evaluating organizations, products and services (Pang & Lee, 2008). Therefore, many organizations actively monitor their brand’s reputaions, collecting their customers’ opinions and actions as expressed in social networks (Aggarwal, 2011). Organizations need to react in the most timely and creative manner to market changes and trends (Turban et al., 2011). Their decision taking process has to be swift, operating in near real time, sometimes, even more when the threats and opportunities are of a strategic level. The current work studies an actual business case: the way Coca-‐Cola company, in Brazil, reacted to a threat to the perceived quality of its main product: the spread in Twitter of a report that a dead rat was found inside a factory sealed Coca-‐cola bottle. Using data mining, this work monitors the dissemination of related messages in the network, and identifies messages that have positive or negative content (for the company). By doing that the current work advances a brand monitoring methodology that can be used by any organisation to track its standing on social networks.
2 Methodology The proposed methodology is outlined in Figure 1. Brand relevant posts are collected either directly from social networks or from aggregators, websites that collect contents from multiple social networks (Boyd &Ellison, 2007; Elisson et al., 2009). The structure of each retrieved HTML document is used for identifying the relevant part of the message. The messages are then preprocessed: HTM tags are removed and tokenization, stopword removal and stemming are performed (Khan et al., 2010). The number of posts per day and the frequency of each term in the vocabulary composed from all the extracted words is then computed. This generates a time series where transition points between periods of lower or higher frequency of activity can be identified. These points 1
[email protected]
correspond to changes in the dynamics of the social networks re the relevant discussion that beg analysis and interpretation. Two techniques support that: •
•
Word clouds: On the basis of the frequency each term in the vocabulary appears in the posts, the word cloud allows visualising the most used terms. From this the analyst can decide if the discussion has a positive or negative bias and also assess if a marketing campaign is succeeding. Association rules: This technique (Pang & Lee, 2008; Pak & Paroubek, 2010; Thelwall, 2010; Aggarwal, 2011) allows detecting which words appear together in the same posts. This facilitates understanding positive or negative associations of concepts that are reflected in the posts and also to focus future marketing campaigns.
Data Collec`on
Preprocessing
• capturing data from social networks
Analysis of results
• data integra`on • data cleansing • data structuring
• post monitoring • word cloud visualisa`on • associa`on rule discovery
Figure 1. Scheme for the proposed methodology.
3 Results Texts referring to Coca-‐Cola were collected in the Brazilian subset of Tweeter in the period from 17/09/2013 to 29/12/2013. Three major peaks in the number of tweets were identified in the time series (see Figure 2): • • •
the first period of increased activity: from 17/09/2013 to 23/09/2013; the second period: from 26/09/2013 to 28/09/2013; and the third period: from 22/12/2013 to 29/12/2013.
Number of tweets involving the Coca-‐Cola brand published per day st.
1 period
nd.
2 period
rd.
3 period
Figure 2. Post monitoring.
Days
The first activity peak occurred after Brazilian press broke news that a rat was found inside an unopened Coca-‐Cola bottle. In Table 1 we present a word cloud visualization for the most frequent words appearing in this period. The most frequent words were “coca”, “cola” and “rato” (“rato” is the Portuguese word for “rat”). The association algorithm identified the following rule: 50% of all the posts mentioned either “rato” or “coca” (with support of 0.50). Nevertheless, the probability of the person that used the word “rato” also also using the word “coca” was 99% (with confidence of 0.99). The second peek of activity happened when Coca-‐Cola Brazil released a video at YouTube showing its fabrication process, claiming that the quality of the process prevented a rat to be found inside a factory-‐sealed bottle. During this period the number of relevant tweets reduced to 38% (with support of 0.38), although the confidence level remained the same. The third peek is related to the beginning of Coca-‐Cola’s Christmas advertising campaign. It can be seen, either from the cloud network or the rule association, that the number of tweets including “rato” decreased; the confidence also decreases, pointing to Brazilians’ forgetting of the issue, probably caused by the brand’s marketing campaigns. Period
Word Cloud Visualisation
Rule
1st.
Association Rule Discovery Support Confidence
rato -‐> coca
0.50
0.99
rato -‐> coca
0.38
0.99
rato -‐> coca
0.18
0.89
2nd.
3rd.
Table 1. Analysis of the results.
4 References Aggarwal, C. C. (2011). An introduction to social network data analytics (pp. 1-15). Springer US. Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). New York: ACM press. Boyd, D. & Ellison, N. B. (2007). Social network sites: Definition, history, and scholarship. Journal of Computer-‐Mediated Communication, 13(1), 210-230. Chowdhury, G. (2010). Introduction to modern information retrieval. Facet Publishing. Ellison, N. B., Lampe, C. & Steinfield, C. (2009). FEATURE Social network sites and society: current trends and future possibilities. interactions, 16(1), 6-9. Khan, A., Baharudin, B., Lee, L.. H. & Khan K. (2010) A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology, 1(1). Pak, A. & Paroubek, P. (2010, May). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. Proc. Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta. Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1-135. Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43. Thelwall, M., Wilkinson, D., & Uppal S. (2010). Data mining emotion in social network communication: Gender differences in MySpace. Journal of the American Society for Information Science and Technology, 61(1), 190-199. Turban, E., Sharda, R., Aronson, J. E. & King, D. (2013). Business Intelligence: A Managerial rd. Perspective. 3 ed. Prentice-Hall.