Mgr inż. Aleksander Wawer Zespół Inżynierii Lingwistycznej Instytut Podstaw Informatyki PAN Ul. Ordona 21, 01-237 Warszawa
[email protected]
Mgr inż. Radosław Nielek Polsko-Japońska Wyższa Szkoła Technik Komputerowych Ul. Koszykowa 86, 02-008 Warszawa
[email protected]
Application of automated sentiment
extraction
from
text
to modeling of public opinion dynamics Abstract The article presents an application of a computer automated system of analyzing emotional attitude in Polish language to modeling changes of support for political parties, based on information collected from Internet media during the campaign before the October 21st parliament elections in Poland. Data, few hundreds of news articles for each 3 hours, were downloaded by dedicated crawling software. Emotions in texts were computed using a "bag of words" method on clause contexts around sets of keywords. The impact of computed emotions on public opinion has been validated towards web portals visitors number. The obtained results have been put together with public opinion survey data. In order to create a model of media messages impact on opinion changes, system identification methods have been applied. Keywords: public opinion, system identification, natural language processing
Introduction The exponential growth of information published in the web and growing number of entities that not only search, but also create content through Web 2.0 services (such as blogs, forums, social-networking websites) render a manual analysis of published contents impossible, not only ex ante but even ex post. In order to solve that problem one turns towards automated data extraction applications. The application of computers to analyze texts, especially their semantics, enables analysis of tens of thousands articles in real-time.
The problem of automated natural language processing is strictly localized, bounded to the specific rules of a national language. This situation is not typical when compared to the majority of problems posed in Computer Science, which in turn are usually universal. Thus, the parliament elections and campaign time are a good opportunity to test and validate the language tools, because of their national character, large amounts of other than neutral (not to say emotionally charged) linguistic data and last but not least, frequent opinion polls. The discussion of directions and policy of media influence on public opinion is rather a domain of sociologists, political scientists or philosophers. Some, like Chomsky [4], argue that the media mobilize group interests and protect elites, others like Bam Ki-Moon [10] emphasize a fundamental role of independent press for peace and democracy. This article, like a pioneering work of David Fan [2], stays on agnostic grounds of a technical and pragmatic paradigm of modeling and studying the influence of media rather than considering axiological or ethical aspects.
Public opinion as a MIMO system Simulation of public opinion can be seen as a multi input and multi output system (MIMO) with a feedback. In the case of analyzing support for political parties in a pre-election time, input in a tn moment can be represented as a vector s = (s1, s2, s3, s4, s5, s6) of support for every party being active in a society and a matrix n x m, where columns represent active political parties and rows information sources (like web portals, TV news or newspapers). A short period of time, taken into account in our research, allows us to ignore long-term demographical and sociological processes driven by for example an aging society, GDP increase or migrations, important for acceptance of political manifestos presented by every party. In fact, political support is continuous in time as new information appears in the media. Every single bit of information can affect support for (one or more) political parties. Because of limitations of computer-aided social simulations and periodical nature of crawling (every three hours) the system has been considered in a discrete domain. The theoretical account proposed in previous paragraphs is transparent and appealing, the attempt to implement it directly have to face a serious difficulty. The level of public support for political parties, represented as a vector s in a moment tn, is unknown for the majority of n. In fact, the s vector can be known only in the day of elections. During the remaining time only
opinion polls (which can be seen as more or less imprecise approximations of s) are available. In this paper, an estimation of public support for political parties was computed as an average of opinion polls done by three independent research institutes (Gemius, PBS and SMG/KRC). A different kind of problem is related with matrix M, which should contain all political news feeds present in the media landscape. Theoretically, it is possible to collect all appearing information in real time, but in fact its amount (and the lack of appropriate hierarchy or ontology) renders such a collection nearly impossible to implement in a working system. Thus, we applied a special algorithm aimed at reducing the complexity of input data, described in the next paragraphs, aggregating the information for every political party.
Natural language processing and sentiment analysis The proposed approach is based on sentiment analysis and content analysis methods [9]. The main purpose of the designed application was to measure occurrences of emotionally charged (non neutral) vocabulary around mentions of politicians or parties. Such a method does not distinguish between speech acts citing a politician’s negatively (or positively) charged utterance, where the politician X is the author of an emotionally charged statement, from negative utterances dealing with the politician X as the object (or patient in the grammatical sense) about whom a negativity is being stated. In other words, the method treats in the same way statements depreciating X from statements where X uses negative vocabulary himself. In the first step, the sentiment analysis algorithm proceeds by computing contexts around keywords. The context-marking function processed chunks of text as delimited by punctuation, determining context boundaries in an unambiguous way, by allowing only politicians or party names from one party into a single context. The method we applied can be classified as bag of words, because it abstracted from grammatical order of analyzed sentences. Instead, it counted occurrences of certain, emotionally loaded, lexeme categories and multiword units (sequences of lexemes), occurring in contexts of the predefined set of keywords – names of politicians and parties. The key concept can be described as counting not lexeme occurrences, but occurrences of a category of lexemes. Such an abstraction stems from content analysis systems, most notably
the classic General Inquirer [9]. The described application used dictionaries of emotionally loaded words and expressions created by Zetema, consisting of nearly 2000 negative words and multiword units and 1500 positive ones. Zetema dictionaries have been validated towards the IPI PAN corpus [7].
Ontology The tracked set of keywords – or ontology – consisted of six distinctive groups of main politician names and name variants of parties:
PO – 16 names, eg. PO, Tusk, Komorowski, Graś, Schetyna, Zdrojewski [..];
PiS – 13 names, eg. PO PiS, Kaczyński, Ziobro, Kuchciński, Gosiewski [..];
PSL – 7 names, eg. PSL, Pawlak, Jan Bury, Kalinowski, Piechociński [..];
LiD – 13 names, eg. LiD, SLD, Olejniczak, Kwaśniewski, Szmajdziński [..];
LPR – 9 names, eg. LPR, Giertych, Orzechowski, Wierzejski [..];
Samoobrona – 9 names, eg. Samoobrona, Lepper, Maksymiuk, Filipek [..];
Crawling The articles that have been analyzed were downloaded every three hours, from the 16 main polish news portals between 1 and 21 October 2007. Crawler worked in a focused-crawling technology [1,6,8], travelling through the web in a recurrent manner with depth set to 2. From downloaded pages only the content of articles has been extracted and analyzed. Comments, static part of pages and ads were filtered out. Additionally, to avoid irrelevant content, main sections of portals with domestic political news were preselected as starting points for crawling. A focused-crawling with human assistant paradigm [7] has been used to acquire news from web portals. A pattern of links which are followed by the crawler has been prepared manually to limit the number of unnecessarily downloaded pages and to improve the quality of obtained and analyzed texts. Changes in the structure of crawled portals have been controlled on a regular basis and necessary modifications in link patterns have been applied.
The number of crawled pages varied slightly depending on web portals structure. On average, during one crawl about 250 pages were downloaded, which means 1750 pages per day and 36 000 in the three pre-election weeks.
Results: modeling opinion changes under the media influence Further considerations have to be preceded by an essential methodological remark. Measured values are incidental; we have no data dealing with previous elections and changes of support as a result of systematical monitoring resulting in time series. Therefore it is not possible to validate the proposed models using the most common paradigm of splitting the data into testing and training sets. Because of low numbers one cannot seriously consider regression models, often used for system identification. Support data from 1.10.2007 – when the crawling began, is marked as P(t1.10), is the average support as reported by three polls which measured public opinion in that time. Namely, the polls were done by SMG/KRC (dated 29.9), Gemius (1.10) and PBS (27-29.9). Data marked as P(t21.10), from 21.10, are the official election results by the National Election Committee (Państwowa Komisja Wyborcza). For each party p, emotp has been calculated as the ratio of summed negp to posp, where negp represents negatively loaded words and phrases, which occurred in contexts of party p, posp positively loaded words and phrases, which occurred in contexts of party p:
emot p
neg pos
p p
Predicted support values for each party in the day of elections (21.10) were calculated with formulas predA and predB discussed below, using two independent variables, namely the support in 1.10 – when the analysis started, and emot. The predicted support according to predA formula was calculated as: PpredA(t1)=P(t0)e(w1-emot)w2
The optimal values of w1 and w2 coefficients were established using a grid search method on the range [-6;6] as w1=0.6 and w2=1.73. The inverse error function, depending on w1 and w2, as summed 1/|(PpredA(t21.10) - P(t1.10)| for each party, are presented on Figure 1.:
Figure 1. The inverse error function on the predA formula for w1 and w2 coefficients in [-6;6].
Due to the inapplicability of regression models, we decided to try a comparison with a different method of system identification, rarely used but recently gaining on popularity, namely genetic programming [5], belonging to the widely understood artificial intelligence approaches or soft computing. In order to compute a predictive formula for support values under the influence of emot, we used the GPalta toolbox [3]. The best tree (formula), obtained during 30 000 generations, has been presented below as the predB formula: IF (1/P(t0))>emot THEN IF (P(t0)>25.4) THEN RETURN PpredB(t1)=(P(t0)*2)/7.2) + 32.5; ELSE RETURN PpredB(t1)=(P(t0)*2/7.2) + 7.6; ELSE IF (P(t0)*2)>25.4 THEN IF (P(t0)>25.4) THEN RETURN PpredB(t1)=32.1; ELSE RETURN PpredB(t1)=(73.9+P(t0))/50.4;
As it is often the case with genetic programming, the tree (formula) suffers from overspecification. Unfortunately, in case of such a small dataset, one cannot devise an efficient means of dealing with this problem. Table 1. presents support for each party over the two reported dates, at the beginning and at the end of analysis, the emotional data gathered by the crawler and processed by the sentiment analysis engine (emot) and the attempts at predicting the support with the two formulas described above, predA and predB.
PO
PiS
LiD
Samoobrona
LPR
PSL
Sum of errors
P(t1.10)
32.23
31.27
13.27
2
2
4.67
-
P(t21.10)
41.51
32.11
13.15
1.5
1.5
8.91
-
P(t21.10)- P(t1.10)
+9.28
+0.84
-0.12
-0.50
-0.50
+4.24
-
emot
0.45
0.57
0.64
1.29
1.09
0.37
-
Ppred1(t21.10)
41.51
32.74
12.41
0.61
0.86
6.91
4.90
Ppred2(t21.10)
32.1
32.1
13.0
1.5
1.5
1.56
81.7
Table 1. Observed and predicted support values for each party and emot.
Discussion As one can see in the Table 1., in the case of contexts of all political parties except for Samoobrona and LPR, the dominating emotional overtones were positive. The lowest negative to positive vocabulary ratio has been observed in the contexts of PSL politicians and party name mentions. The analysis of the relationship between P(t1.10) and emot on one hand as independent variables and P(t21.10) as the dependent variable reveals two facts. Firstly, the simples rule, that one can learn from the data, is “the rule of 0.6”: a party with emot value lower than 0.6 gains support and vice versa – a party with emot higher than 0.6 loses support. Secondly, the relationship can be described as non-linear, where the support for bigger parties is more prone to substantial gains and losses, as shaped by emot, than the support for smaller parties.
The results of predictions using both system identification formulas, predA and predB, prove superiority of the exponential equation predA, which reached error rate as low as 5 percentage points across all the 6 parties. The most important conclusion that stems from the presented results is that despite the incidental character of the observed phenomena and low data size, we have indirectly confirmed and captured the intuitively perceived relationship between the emotional loads of the media messages and public opinion changes. The relationship can be modeled using system identification methods, allowing researchers and policy makers to anticipate future opinion changes basing on current measurements of emotional overtones towards particular parties and politicians.
Bibliography: 1. Chakrabarti, S., van der Berg, M., & Dom, B. (1999). Focused crawling: a new approach to topic-specific Web resource discovery. In Proceedings of 8th International World Wide Web Conference (WWW8): http://citeseer.ist.psu.edu/chakrabarti99focused.html 2. Fan D. P. : Predictions of Public Opinion from the Mass Media: Computer Content Analysis and Mathematical Modeling, Greenwood Press, 1988 3. GPalta Genetic Programming toolbox: http://gpalta.berlios.de/ 4. Herman E. S., Chomsky N.: Manufacturing Consent: The Political Economy of the Mass Media, Pantheon, 1988 5. Koza J.: Genetic Programming, 1992 6. Lawrence S. , Giles C.L.: Searching the world wide web. Science, 280(4):98—100, 1998 http://citeseer.ist.psu.edu/lawrence98searching.html 7. Przepiórkowski A.: The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa, 2004 8. Rungsawang A., Angkawattanawit N., Learnable topic-specific web crawler, Journal of Network and Computer Applications, Volume 28, Issue 2, April 2005, p. 97-114 9. Stone P. J., Dunphy D. C., Smith M. S., Ogilvie D. M.: The General Inquirer: A Computer Approach to Content Analysis, MIT Press, 1966 10. United Nations, General Assembly OBV/620, PI /1773: http://www.un.org/News/Press/docs/2007/obv620.doc.htm