A proposal to deal with sampling bias in social ...

2 downloads 0 Views 11MB Size Report
Jul 13, 2018 - Facebook, digital campaign and Italian general election 2018. ...... privileged the importance of the party in relation to the candidate, Di Maio, confirming to ...... Sutton, P. C., C. D. Elvidge, and T. Ghosh (2007). ..... Parker, R. M. (2003). ..... aIvey Business School, University of Western Ontario, London N6G ...
Congress UPV 2nd International Conference on Advanced Research Methods and Analytics (CARMA 2018) The contents of this publication have been evaluated by the Scientific Committee according to the procedure described in the preface. More information at http://www.carmaconf.org/

Scientific Editors Josep Domenech María Rosalía Vicente Desamparados Blazquez

Publisher 2018, Editorial Universitat Politècnica de València www.lalibreria.upv.es / Ref.: 6447_01_01_01

ISBN: 978-84-9048-689-4 (print version) Print on-demand DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8742

2nd International Conference on Advanced Research Methods and Analytics (CARMA 2018) This book is licensed under a Creative Commons Atribution-NonCommercial-NonDetivates-4.0 International license Editorial Universitat Politècnica de València http://ocs.editorial.upv.es/index.php/CARMA/CARMA2018

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018

Preface Domenech, Josep a; Vicente, María Rosalía b; Blazquez, Desamparados a a Dept. Economics and Social Sciences, Universitat Politècnica de València, Spain. b Dept. Applied Economics, Universidad de Oviedo, Spain

Abstract Research methods in economics and social sciences are evolving with the increasing availability of Internet and Big Data sources of information. As these sources, methods, and applications become more interdisciplinary, the 2nd International Conference on Advanced Research Methods and Analytics (CARMA) is an excellent forum for researchers and practitioners to exchange ideas and advances on how emerging research methods and sources are applied to different fields of social sciences as well as to discuss current and future challenges. Keywords: Big Data sources, Web scraping Social media mining, Official Statistics, Internet Econometrics, Digital transformation, global society.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia I

Preface

1. Preface to CARMA2018 This volume contains the selected papers of the Second International Conference on Advanced Research Methods and Analytics (CARMA 2018) hosted by the Universitat Politècnica de València, Spain during 12 and 13 July 2018. This second edition consolidated CARMA as a unique forum where Economics and Social Sciences research meets Internet and Big Data. CARMA provided researchers and practitioners with an ideal environment to exchange ideas and advances on how Internet and Big Data sources and methods contribute to overcome challenges in Economics and Social Sciences, as well as on the changes in the society because of the digital transformation. The selection of the scientific program was directed by Maria Rosalia Vicente, who led an international team of 33 scientific committee members representing 28 institutions. Following the call for papers, the conference received 73 paper submissions from all around the globe. All submissions were reviewed by the scientific committee members under a double blind review process. Finally, 40 papers were accepted for oral presentation during the conference. This represents an overall paper acceptance rate of 54%, ensuring a high quality scientific program. It covers a wide range of research topics in Internet and Big Data, including nowcasting people mobility and economic indicators, applications of Big Data methods in retail and finance, using search and social media data, among others. CARMA 2018 also featured two special sessions on “Big Data for Central Banks” and “Using Big Data in Official Statistics,” chaired by Juri Marcucci and Gian Luigi Mazzi, respectively. Both sessions gave a complementary institutional perspective on how to use Internet and Big Data sources and methods for public policy and official statistics. The perspective from the private sector was contributed by Norbert Wirth, who talked about “Data Science development at scale” in his keynote speech. The conference organizing committee would like to thank all who made this second edition of CARMA a great success. Specifically, thanks are indebted to the authors, scientific committee members, reviewers, invited speakers, session chairs, presenters, sponsors, supporters and all the attendees. Our final words of gratitude must go to the Faculty of Business Administration and Management of the Universitat Politècnica de València for supporting CARMA 2018.

II

Domenech, J.; Vicente, M. R.; Blazquez, D.

2. Organizing Committee General chair Josep Domènech, Universitat Politècnica de València Scientific committee chair María Rosalía Vicente, Universidad de Oviedo Local arrangements chair Desamparados Blazquez, Universitat Politècnica de València

3. Sponsors BigML DevStat

4. Supporters Universitat Politècnica de València Facultad de Administración y Dirección de Empresas Departamento de Economía y Ciencias Sociales

4. Scientific committee Concha Artola, Banco de España Nikolaos Askitas, IZA – Institute of Labor Economics Jose A. Azar, IESE Business School Silvia Biffignandi, University of Bergamo Petter Bae Brandtzaeg, SINTEF Jonathan Bright, Oxford Internet Studies José Luis Cervera, DevStat Piet Daas, Statistics Netherlands Pablo de Pedraza, Universidad de Salamanca / University of Amsterdam Giuditta de Prato, European Commission – JRC Directorate B Rameshwar Dubey, Montpellier Business School Enrico Fabrizi, DISES – Università Cattolica del S. Cuore Juan Fernández de Guevara, IVIE and University of Valencia Jose A. Gil, Universitat Politècnica de València

III

Preface

Felix Krupar, Max-Planck-Institute for Innovation and Competition Caterina Liberati, University of Milano-Bicocca Juri Marcucci, Bank of Italy Rocio Martinez Torres, Universidad de Sevilla Esteban Moro, Universidad Autónoma de Madrid / Universidad Carlos III Michela Nardo, European Commission – Joint Research Centre Enrique Orduña, Universitat Politècnica de València Bulent Ozel, University of Zurich / Universitat Jaume I Andrea Pagano, European Commission – Joint Research Centre Ana Pont, Universitat Politècnica de València Ravichandra Rao, Indian Statistical Institute Pilar Rey del Castillo, Instituto de Estudios Fiscales Anna Rosso, DEMM University of Milan Vincenzo Spiezia, OECD Pål Sundsøy, NBIM/Norway Sergio L. Toral Marin, Universidad de Sevilla Antonino Virgillito, Italian Revenue Agency Sang Eun Woo, Purdue University Zheng Xiang, Virginia Tech

IV

Index

Full papers Blockchain-backed analytics. Adding blockchain-based quality gates to data science projects. .................................................................................................................................. 1 Algorithmic Trading Systems Based on Google Trends ...................................................... 11 How to sort out uncategorisable documents for interpretive social science? On limits of currently employed text mining techniques ......................................................................... 19 A proposal to deal with sampling bias in social network big data ....................................... 29 Big data analytics in returns management- Are complex techniques necessary to forecast consumer returns properly? .................................................................................................. 39 Facebook, digital campaign and Italian general election 2018. A focus on the disintermediation process activated by the web ................................................................... 47 Inferring Social-Demographics of Travellers based on Smart Card Data ............................ 55 Relevance as an enhancer of votes on Twitter ..................................................................... 63 Big Data and Data Driven Marketing in Brazil .................................................................... 71 Do People Pay More Attention to Earthquakes in Western Countries? ............................... 79 From Twitter to GDP: Estimating Economic Activity From Social Media ......................... 87 Has Robert Parker lost his hegemony as a prescriptor in the wine World? A preliminar inquiry through Twitter ........................................................................................................ 97

V

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018)

Digital Vapor Trails: Using Website Behavior to Nowcast Entrepreneurial Activity ........ 107 Evolution and scientific visualization of Machine learning field ....................................... 115 Fishing for Errors in an Ocean Rather than a Pond ............................................................ 125 Validation of innovation indicators from companies' websites .......................................... 133 A combination of multi-period training data and ensemble methods to improve churn classification of housing loan customers ............................................................................ 141 Technical Sentiment Analysis: Measuring Advantages and Drawbacks of New Products Using Social Media ............................................................................................................ 145 Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends ....... 157 Towards an Automated Semantic Data-driven Decision Making Employing Human Brain ................................................................................................................................... 167 Limits and virtues of a web survey on political participation and voting intentions. Reflections on a mixed-method search path ....................................................................... 177 Italian general election 2018: digital campaign strategies. Three case studies: Movimento 5 Stelle, PD and Lega ......................................................................................................... 185 Access and analysis of ISTAC data through the use of R and Shiny ................................. 193 Grassroots Market Research on Grass: Predicting Cannabis Brand Performance Using Social Media Scraping ....................................................................................................... 201 What should a researcher first read? A bi-relational citation networks model for strategical heuristic reading and scientific discovery .......................................................................... 209 Historical query data as business intelligence tool on an internationalization contex ........ 221 Big data and official data: a cointegration analysis based on Google Trends and economic indicators ............................................................................................................................ 229 Using big data in official statistics: Why? When? How? What for? .................................. 237

Abstracts A Text-Based Framework for Dynamic Shopping-Cart Analysis ...................................... 247 Big Data Sources for Private Consumption Estimation: Evidence from Credit and Debit Card Records ...................................................................................................................... 248 Measuring Retail Visual Cues Using Mobile Bio-metric Responses ................................. 249

VI

Index

Identification of helpful and not helpful online reviews within an eWOM community using text-mining techniques ....................................................................................................... 250 Gender discrimination in algorithmic decision-making ..................................................... 251 Estimating traffic disruption patterns with volunteer geographic information ................... 252 The educational divide in e-privacy skills in Europe ......................................................... 253 'Whatever it takes' to change beliefs: Evidence from Twitter ............................................ 254 Automated Detection of Customer Experience through Social Platforms ......................... 255 Transport-Health Equity Outcomes from mobile phone location data- a case study ......... 256 An Unconventional Example of Big Data: BIST-100 Banking Sub-Index of Turkey ....... 257 Google matrix analysis of worldwide football mercato ..................................................... 258 Measuring Technology Platforms impact with search data and web scraping ................... 259 Fear, Deposit Insurance Schemes, and Deposit Reallocation in the German Banking System ................................................................................................................................ 260 Macroeconomic Indicator Forecasting with Deep Neural Networks ................................. 261 Financial Stability Governance and Communication ......................................................... 262 Spread the Word: International Spillovers from Central Bank Communication ................ 263 The Catalonian Crises through Google Searches: A Regional Perspective ........................ 264 The Sentiment Hidden in Italian Texts Through the Lens of A New Dictionary ............... 265 X11-like Seasonal Adjustment of Daily Data .................................................................... 266 Empirical examples of using Big Internet Data for Macroeconomic Nowcasting ............. 267 Using big data at Istat: forecasting consumption ............................................................... 268 Mining Big Data in statistical systems of the monetary financial institutions (MFIs) ....... 269

VII

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8292

Blockchain-backed analytics: Adding blockchain-based quality gates to data science projects Herrmann, Markus; Petzold, Jörg and Bombatkar, Vivek Technology & Data, GfK SE, Germany

Abstract A typical analytical lifecycle in data science projects starts with the process of data generation and collection, continues with data preparation and preprocessing and heads towards project specific analytics, visualizations and presentations. In order to ensure high quality trusted analytics, every relevant step of the data-model-result linkage needs to meet certain quality standards that furthermore should be certified by trusted quality gate mechanisms. We propose “blockchain-backed analytics”, a scalable and easy-to-use generic approach to introduce quality gates to data science projects, backed by the immutable records of a blockchain. For that reason, data, models and results are stored as cryptographically hashed fingerprints with mutually linked transactions in a public blockchain database. This approach enables stakeholders of data science projects to track and trace the linkage of data, applied models and modeling results without the need of trust validation of escrow systems or any other third party. Keywords: Blockchain; Data Science; Data Management; Trusted Data; Trusted Analytics.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

1

Blockchain-backed analytics. Adding blockchain-based quality gates to data science projects.

1. Trusted analytics A typical analytical lifecycle in data science projects starts with the process of data creation and collection, continues with data preparation and pre-processing ("data wrangling") and heads towards project specific analytics, visualization and presentation, i.e. the results. To enforce trusted analytics, every step of the data lifecycle and applied analytics of an analytics project needs to meet certain quality standards. While these standards may vary broadly among academia and industries, there is one common challenge for every field of trusted analytics: How to publicly document trusted data and analytics in an immutable way? And if possible, without the involvement of any third party ensuring the trust. Why is this considered a challenge? In academia, trusted analytics is achieved by peerreviewed processes and bibliographical documentation. In data-driven industry sectors, on the other hand, the massive amount of decentralized data being generated daily and the huge number of data science and analytics projects worldwide cannot be evaluated and documented by any manual or human review system in a reasonable amount of time. With the recently matured possibilities of machine learning – and in general the field of artificial intelligence - the documentation of the data-model-result relationship will become more and more relevant and consequently requires a scalable and immutable data and information documentation solution as a quality gate. A decentralized storage system based on blockchain technology is able to introduce such quality gates to data science projects. For this reason we propose "blockchain-backed analytics"; whereby the data, applied methods and relevant results are stored as cryptographically hashed fingerprints in an immutable blockchain database. Delivering this scalable and easy-to-use generic approach means being able to track and trace the linkage of data, models and modeling results, without the need of involving escrow systems or any other third party. Our work builds on existing research in the fields of blockchain-based data protection and identity management, where blockchain technology is being applied to secure the management of digital identities and protect data ownership (Zyskind et. al, 2015). Accordingly blockchain-backed analytics is an extension of blockchain-based identity management techniques to data science projects and is therefore particularly relevant for data-driven academic research and industry projects.

2

Herrmann, M.; Petzold, J.; Bombatkar, V.

2. Blockchain technology To date the application of the blockchain technology is predominantly influenced by Satoshi Nakamoto's design of the cryptocurrency Bitcoin, which is based on the consensus in a distributed system, achieved with a Proof-of-Work algorithm (Nakamoto, 2008). In this regard, a blockchain is a distributed database that is continuously keeping records of transactions in a logical order and in sync across participants, i.e. instances. Multiple transactions are bundled and stored in a block, whereas new blocks are sequentially appended to the previous block(s), with each block containing a list of cryptographically signed transactions with timestamps. In order to ensure integrity of the blockchain, each new block contains a pointer to a distinct hash value of the previous block and – in most cases – the root hash of the Merkle tree of all transactions of the previous block as well. This can be considered as a hash chain of all transactions of the block (ibid.). The sequence of inter-linked blocks then forms a blockchain with the inherit feature that every block can be traced back to the initial first block of the chain. This also implies that any later modification or deletion of single transactions or entire blocks would result in a hash mismatch in hash pointers and Merkle trees and therefore break the chain. A blockchain network can be private, where access and read/write permissions can be restricted, or public with unrestricted access and read/write permissions. Although the most popular applications of public blockchains to date are cryptocurrencies, the technology is by far not limited to this use case (Davidson, 2016). The proposed public blockchain database approach seems to be most suitable for blockchain-backed analytics due to one of its core characteristics: the immutability of its records. 2.1. Immutability of the blockchain The data stored in a blockchain database is immutable in the sense that once a record has been written, it cannot be modified or deleted afterwards. This can be put down to the process of validating transactions and adding them to a new block, which is commonly referred to as "mining". Among others, there are two popular categories of mining algorithms in public blockchains: Proof-of-work (PoW), as applied with Bitcoin mining (Nakamoto, 2008) and Proof-of-stake (PoS), as proposed for the cryptocurrency Ethereum (Buterin, 2017). These categories of mining algorithms both provide consensus among the distributed parties of the blockchain about the validity of the transactions and therefore, the final

3

Blockchain-backed analytics. Adding blockchain-based quality gates to data science projects.

commit to the database. Whereas each set of algorithms contains advocates and attack vectors1, both sets also share characteristics which makes it almost impossible to determine the consensus process for the purpose of fraud or self-interest and therefrom derived ensure the immutability of existing blockchain records. Considering that a broad distribution of mining instances is crucial to reduce the risk of manipulation, it is recommended to use a large (i.e. widespread distributed) public blockchain for blockchain-backed analytics. Alternatively a private or permissioned blockchain can be applied; in particular for big consortiums that aim to retain control over configuration parameters of the blockchain, e.g., to reduce transaction costs (Davidson, 2016). 2.2. Blockchain database capabilities Despite its database structure, a distributed blockchain database is not primarily intended to be used as traditional database storage, mostly due to matters of the distributed technical design and mining process. Notably with the traditional Bitcoin blockchain, there are a number of known scalability limitations, such as the limited number of transactions per block, the limited throughput of transactions and the high latency until a transaction is confirmed (Croman et al., 2016). In addition, classical blockchains are usually not capable of any traditional querying capabilities (as opposed to RDBMS or NoSQL data stores) and in most cases only allow the lookup of existing – and thus valid – transactions. For this reason, public blockchain databases mostly serve as distributed ledgers (especially for cryptocurrencies) with the property of providing synchronized, auditable and verifiable transactional data across multiple users and distributed networks (without the need of the involvement of third parties to validate transactions). They are not designed as data storage.

3. Blockchain-backed analytics The idea of blockchain-backed analytics consists of creating an immutable linkage between the three core components of an analytics project: data, model and result. The data component can be any kind of data that has either been used to train (i.e. to build) a model, or to apply a model. The model component can be any kind of data science model 1

In order to manipulate the PoW consensus, the computing power (i.e. the hash rate) of a fraudulent participant needs to exceed 51% of the hash rate of the overall network. To fix a PoS consensus, a complex randomized process has to be determined. Hence, the probability of exactly hitting one of these attack vectors is negatively correlated with the size of the network and will tend to theoretically zero in large blockchain networks (Buterin, 2017).

4

Herrmann, M.; Petzold, J.; Bombatkar, V.

represented as a function, script, library, binary executable, containerized application or even as virtual machine image; whereas the format of the result is determined by the model. 3.1 Blockchain signatures of components Each component will be registered as a secure cryptographically hashed fingerprint as a transaction property to a public blockchain database, together with a pointer to the transaction identifier containing the component it continues from. The registration process consists of two steps: 1. 2.

Creation of the fingerprint (i.e. a secure cryptographical hash) of the component Signature of the transaction to a public blockchain, consisting of: a. The hash of the component (hx) b. The transaction identifier (tx) of the linked component (optional for the data component)

The fingerprint should be created with a hash function in compliance with the Advanced Encryption Standard (AES) 2 and have a key length of no less than 128-Bit. The transaction properties can be submitted as a hexadecimal string, or ideally in JavaScript Object Notation (JSON) format (Fig.1.). Figure 1: Transaction properties { "properties": { "data": [{"name": "data ", "hash": "hx(data)"}], "model": [{"name": "model", "hash": " hx(model)", "data": "tx(data)"}], "result": [{"name": "result", "hash": " hx(result)", "model": "tx(model)"}] } }

In short, this approach stores the linked chain of analytical components with an immutable public blockchain transaction, whereas each component can be always identified by its unique hash and transaction identifier. Records of the relationship of projects, hashes and transactions have to be kept separately. 3.2 Component linkage verification In order to track a component, the blockchain can be queried by a given transaction or wallet identifier for a specific transaction that includes either data, model or result

2

Advanced Encryption Standard (AES), Retrieved May 12th, 2018, from: https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf; (doi:10.6028/NIST.FIPS.197. 197).

5

Blockchain-backed analytics. Adding blockchain-based quality gates to data science projects.

information as transaction data. The data-model-result relationship can then be traced by the linkage of components that is being reflected in the result’s transaction properties. Furthermore, such a verification procedure could also simply be used to ensure the source integrity of data, models and results on an individual basis; by retrieving the component’s signature and verifying the fingerprints against the fingerprints of the original or linked component. Notably, filtering queries (e.g. finding all datasets a specific model has been applied with) are in general not possible without parsing the entire blockchain. 3.3 Blockchain ecosystem With the increasing popularity of cryptocurrencies, a vast set of blockchain implementations have emerged, with Bitcoin being the first in 2009. The blockchain technology best suited for a specific analytics project depends on individual requirements such as payload size, block time, transaction fees and the public availability of the blockchain. As a ledger for information verification, almost any blockchain technology that allows querying transactions and including transaction properties is principally applicable for blockchain-backed analytics. Overall we can recommend the Ethereum blockchain as an ecosystem for blockchainbacked analytics. Component hashes can be either stored as raw transactional data (i.e. transaction property) in hexadecimal format, or alternatively integrated into a “smart contract”, a programmatic feature of the Ethereum blockchain. Furthermore, as it is one of the largest public blockchain transaction networks worldwide, a widespread distribution of Ethereum nodes is guaranteed. 3.4 Costs analysis Using a public blockchain network always involves costs to process a transaction, i.e. a fee must be paid before a transaction can be processed and validated. With respect to the Ethereum ecosystem, the total fee for a single transaction adds up the base transaction price (currently 21000 “gas”) and the costs for additional payload (currently 4 gas for a zero byte, 68 gas for a non-zero byte).3 Considering that an AES 256bit hash (equal to 32 bytes) can be expressed as a hexadecimal string with 64 characters, a complete data-model-result linkage documentation requires approximately 1 kilobyte of additional payload in hexadecimal format. In sum, the payload of all three components as additional hex-encoded raw transaction data of three transactions on the Ethereum

3

Ethereum Homestead Documentation: Estimating transactions costs on the Ethereum blockchain, Retrieved May 12th, 2018, from: http://ethdocs.org/en/latest/contracts-and-transactions/accounttypes-gas-and-transactions.html.

6

Herrmann, M.; Petzold, J.; Bombatkar, V.

blockchain resulted in about $0.20 total fees with a fast confirmation time (less than 30 seconds) in May 2018.4 In addition to transaction costs, blockchain-backed analytics involves computational costs for hashing the components. Since parallelizing the execution of computing a single hash is not possible, the computational costs of hashing a component only vary with the individual CPU performance; not with the number of physical CPUs or cores. Our own performance tests with two popular secure cryptographic hashing algorithms (BLAKE2 & SHA-256) using commodity hardware have shown that large components with even a one terabyte file size can be hashed under 30 minutes and smaller components with sizes of up to one gigabyte within just a few seconds (Table 1.). Table 1: Computational costs (time) for hashing different file sizes CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz

Algorithm

1 MB

10 MB

100 MB

1 GB

10GB

100 GB

1 TB

BLAKE 2

0.004s

0.027s

0.181s

1.481s

15.584s

2m 25s

~25m

SHA-256

0.015s

0.110s

0.622s

6.204s

63.167s

10m 7s

~90m

Source: Own performance tests (2018).

4. Discussion Whilst we believe that blockchain-backed analytics is a scalable and easy-to-use approach to ensure trusted analytics, there are several considerations which should be made. For example, additional transaction costs for registering components in a public blockchain are not insignificant, although they are very low for single transactions. But it should be noted that – especially in the field of artificial intelligence – self-learning and self-evolving machine learning or deep learning models need to be tracked at every step of the model evolution process. However choosing the right blockchain technology (e.g. with optimal block size and transaction costs) for the specific project requirements and consolidating multiple components into a single transaction can help to optimize the costs of a blockchain-backed analytics project.

4

Own registration of a result-component on the Ethereum blockchain on May 12th, 2018: https://etherscan.io/tx/0xd2749d1bcd7983769ba4801265c65fce8e92df7476f57df01bffcb148e5f0b32.

7

Blockchain-backed analytics. Adding blockchain-based quality gates to data science projects.

In theory, our approach is scalable to any kind of data size in terms of scalability and usability for big data applications, but in practice it is limited to the costs of hashing the components. As described, the computational costs for hashing a component is not considered to be a significant overhead for component sizes up to a few gigabyte, but they have to be taken into account for big data. However in many big data environments data is mostly stored in distributed file systems, such as the Hadoop Distributed File System (HDFS), where the identification of distributed chunks of data is being achieved by inherit file checksum mechanisms that are already been applied during the data ingestion process.5 When applying blockchain-backed analytics with data or results stored in HDFS, the component does not need to be hashed again, because the available block checksums could be re-used as distinct block hashes in order to create a Merkle tree of all relevant blocks. A similar approach also applies for containerized applications with inherit hashing mechanism, such as Docker images, where a fingerprint of the image is automatically created during the image build process.6 Consequently, it is possible to easily integrate parts or entire analytical ecosystems in the format of a Docker image digest as distinct model component into blockchain-backed analytics. Following our approach, where the data itself is not stored on the blockchain, an additional overhead process of maintaining a documentation of the relationship of projects, hashes and transactions has to be taken into account. However recent database solution developments with blockchain capabilities (e.g. decentralization and immutability) on top of traditional database capabilities (e.g. querying, indexing, search), could ease the adoption of blockchain-backed analytics, due to the omission of additional hashing procedures and documentation in off-chain references.7 A similar ease of use could also apply for current developments of distributed (file) system solutions that are directly attached to a blockchain.8

5

Hadoop Checksum, Retrieved May https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoopcommon/FileSystemShell.html#checksum.

12th,

2018,

6

from:

Docker Engine Reference, Docker images digests, Retrieved May 12th, 2018, from: https://docs.docker.com/engine/reference/commandline/images/#list-image-digests. 7

e.g. solution “BigChainDB”, Retrieved May 12th, 2018, from: https://www.bigchaindb.com.

8

e.g. solution “The Interplanetary File System”, Retrieved May 12th, 2018, from: https://ipfs.io.

8

Herrmann, M.; Petzold, J.; Bombatkar, V.

With respect to continuous improvements in the development and integration of blockchain-based technologies, we are confident that our generic proposal of tracing the data-model-result linkage of analytical projects can be easily extended to broader ecosystems, such as continuous integration systems as part of the application lifecycle management.

References Buterin, V. (2017). Proof of Stake FAQ., Ethereum Wiki., Retrieved May 12th, 2018, from: https://github.com/ethereum/wiki/wiki/Proof-of-Stake-FAQ. Croman K. et al. (2016). On Scaling Decentralized Blockchains., In: Clark J., Meiklejohn S., Ryan P., Wallach D., Brenner M., Rohloff K. (eds) Financial Cryptography and Data Security. FC 2016. Lecture Notes in Computer Science, vol 9604. Springer, Berlin, Heidelberg. Davidson, S., et al (2016). Economics of Blockchain., Retrieved May 12th, 2018 from: https://dx.doi.org/10.2139/ssrn.2744751. Nakamoto, S. (2008). Bitcoin: A Peer-to-Peer Electronic Cash System., Retrieved May 12th, 2018, from: http://bitcoin.org/bitcoin.pdf. Zyskind, G., et al (2015). Decentralizing Privacy: Using Blockchain to Protect Personal Data., 2015 IEEE Security and Privacy Workshops, San Jose, CA, pp. 180-184.

9

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8295

Algorithmic Trading Systems Based on Google Trends Gómez-Martínez, Raúl; Prado-Román, Camilo; De la Orden de la Cruz, María del Carmen Departamento de Economía de la Empresa, Universidad Rey Juan Carlos de Madrid

Abstract In this paper we analyze five big data algorithmic trading systems based on artificial intelligence models that uses as predictors stats from Google Trends of dozens of financial terms. The systems were trained using monthly data from 2004 to 2017 and have been tested in a prospective way from January 2017 to February 2018. The performance of this systems shows that Google Trends is a good metric for global Investors’ Mood. Systems for Ibex and Eurostoxx are not profitable but Dow Jones, S&P 500 and Nasdaq systems has been profitable using long and short positions during the period studied. This evidence opens a new field for the investigation of trading systems based on big data instead of Chartism. Keywords: Big data, behavioral finance, investors’ mood, artificial intelligence, Bayesian network, Google Trends.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

11

Algorithmic Trading Systems Based on Google Trends

1. Introduction Algorithmic trading systems invest in financial markets in an unattended and constant way, sending buy and sell orders to the market for a financial instrument, according to a complex mathematical algorithm. Most of the trading systems that are operating nowadays follows Chartism rules, but the irruption of big data in asset management has opened a new approach for algorithmic trading. There are numerous studies that demonstrate that investor mood is affected by multiple factors, changes over time and may be conditioned by experience or training (Cohen and Kudryavtsev, 2012). These changes in mood provide evidence of anomalies in the behavior of stock markets (Nofsinguer, 2005). Corredor, Ferrer and Santamaría (2013) claim that investor mood has a significant effect on stock performance. We find that weather affect to the stock market returns (Hirshleifer and Shumway, 2003, Jacobsen and Marquering, 2008) as sunny climates are associated with an optimistic mood and then positive returns. Seasonal patterns like vacations that implies the effect of “sell in May and go away” or the “Halloween” effect (Bouman and Jacobsen, 2002; Marshall 2010) means that securities market yield should be greater from November to April than from May to October. Even the Moon (Yuan, Zheng and Zhu, 2006) implies different returns according to the different phases of the moon observing differences from 3% to 5% in yield from one phase to another. The sports results are another item that modifies investors mood. Edmans, García and Norli (2007) studied the results of football, cricket, rugby and basketball and others have focused on the NFL (Chang, Chen, Chou and Lin, 2012), football (Berument, Ceylan and Gozpinar, 2006; Kaplanski and Levy, 2010) and on cricket (Mishra and Smyth, 2010). Gómez and Prado (2014) performed a statistical analysis of the following stock markets session return after national team football matches. The results obtained show that after a defeat of the national team, we should expect negative and lower than average prices on the country's stock market, the opposite occurring in the case of a victory. At this stage, if investor mood varies and affects financial markets and their liquidity (Liu, 2015), the challenge that arises is how to measure mood to predict market trend (Hilton, 2001) which leads us to consider a Big Data approach: Wu et al. (2013) use big data to predict market volatility, Moat et al. (2013) use the frequency of use of Wikipedia to determine investor feelings, whereas Gómez (2013) elaborated a “Risk Aversion Index” based on the stats of Google Trends for certain economic and financial terms that relate to market growth. Through an econometric model, he shows that Google Trends provide relevant information on the growth of financial markets and may generate investment signs that can be used to predict the growth of major

12

Gomez-Martínez, R.; Prado-Román, C.; de la Orden de la Cruz, M. C.

European stock markets. According to this approach, we could create an algorithmic trading system that issues buy and sell orders by measuring the level of aversion to risk, if an increase in tolerance towards risk implies a bull market and an increase in aversion to risk a bear market. In this paper we will describe Big Data trading algorithmic systems that, instead of Chartism rules, use Artificial Intelligence (AI) models based on Google Trends to predict de evolution of main world stock index.

2. Methodology and Hypothesis The following statistics are mainly used to measure the perform of an algorithmic trading system (Leshik and Crall, 2011):  





Profit/Loss: the total amount generated by the system from its transactions over a certain period. Success rate: Percentage of successful transactions out of the total transactions, if if the percentage is above 50%, the system is profitable and the higher the percentage, the better the system. Profit Factor: this rate shows the relationship between earnings and losses, by dividing total earnings by total losses. A rate higher than 1 implies positive returns and the higher the rate, the better. Sharpe Ratio: relates profitability to volatility, the higher the ratio, the better the performance of the system (Sharpe, 1994).

InvestMood1 developed in January 2017 trading algorithmic systems for the following index: Ibex 35, Eurostoxx 50, Dow Jones, S&P 500 and Nasdaq. According to Gomez (2013), the volume of searches registered in Google on financial terms has explanatory capacity and predictive on the evolution of the markets. Since 2004 in which Google Trends began to publish these statistics, it is observed that bearish markets imply high level of searches of terms such as crash, recession or short selling, while bull markets imply low levels of this searches. Bearing this in mind, InvestMood have created big data algorithmic trading systems that open long or short positions following an artificial intelligence model in which the predictors are Google Trends stats while the target variable is the next evolution of those index (up / down). The process of the algorithm is the following one: Every first day of the month these systems trains the artificial intelligence models, using a monthly sample of Google Trends 1

For mor information visit: http://www.investmood.com/

13

Algorithmic Trading Systems Based on Google Trends

of dozens of economic-financial terms, and issue a prediction for next month's trend. Google Tends consults have been limited to financial matters and don not have any restriction by localization. The system maintains a long or short open position until there is a new prediction in the opposite direction. From this point, the hypothesis to study is the following one: H1: A big data algorithmic trading system based on artificial intelligence models over investors’ mood can generate positive returns. We will validate this hypothesis if we reach three evidences: 1. 2. 3.

Profit/Loss amount is positive including license costs and trading commissions. Success rate is higher than 50% Profit factor is higher than 1

3. Data Goggle Trends2 has historic data available from 2004 in a monthly base. As the first models was trained on 2017, January 1st the first models were trained using 156 observations. The prospective analysis of this paper starts in January 2017 and ends in February 2018, so we have 14 months for the study and therefor 14 different models, each one for one month. All the quotes and stats used in this study has been provided by Trading Motion. Trading Motion3 is a Fintech who allows users of 23 brokers all over the world to operate in an unattended way using trading algorithmic systems developed by 74 professional developers. All these developers follow Chartism rules except InvestMood, that using the Rey Juan Carlos University has developed its systems using IA on Investors’ Mood. Trading systems studied in this paper are running on Trading Motion form January 2017. After three months testing the systems they were available for the clients for April 2017.

4. Results The URL available for the stats of the systems concerning this study are: Ibex 35: https://www.tradingmotion.com/explore/System/PerformanceSheet?Id=17652 Esx 50: https://www.tradingmotion.com/explore/System/PerformanceSheet?Id=17705 2 3

Visit: https://trends.google.com/trends/ For moro information visit: https://www.tradingmotion.com/

14

Gomez-Martínez, R.; Prado-Román, C.; de la Orden de la Cruz, M. C.

DJ: https://www.tradingmotion.com/explore/System/PerformanceSheet?Id=17651 S&P 500: https://www.tradingmotion.com/explore/System/PerformanceSheet?Id=17654 Nasdaq: https://www.tradingmotion.com/explore/System/PerformanceSheet?Id=17653 The models created for the trading systems has been trained using the algorithms of dVelox, a data mining tool developed by IT firm Apara 4. These algorithms build a Bayesian Network (Bayes, 1763) like the following one, used for the Nasdaq model trained in 2018 February 1st:

Figure 1. Bayesian Network for Nasdaq trading system. Source: dVelox (2018).

Table 1 sums up the performance of each one of the five trading systems that have been running form January 2017 to February 2018. In this table we observe that systems for Ibex and Eurostoxx are not profitable, so we cannot validate de H1 hypothesis for these indexes. Notwithstanding, the systems created for Dow Jones, S&P 500, and Nasdaq has been profitable, they have a success rate higher than 50% and a profit factor higher than 1, so we can validate H1 for this American indexes. 4

For more information visit: http://www.apara.es/es/

15

Algorithmic Trading Systems Based on Google Trends Table 1. Performance of Big Data trading algorithmic systems on Investors’ Mood

Index

Profit/Loss

Success rate

Profit Factor

Sharpe Ratio

Ibex 35

-824,00 €

47,80%

0,99

-0,64

Eursotoxx 50

-560,00 €

50,50%

1,00

-0,12

Dow Jones

16.919,00 €

58,20%

1,36

1,52

S&P 500

20.803,00 €

60,70%

1,44

1,86

Nasdaq

34.628,00 €

62.40%

1,54

2,43

Source: Trading Motion (2018)

5. Conclusions In this study, we used an innovative approach to check the capability of the behavioral finance and the Investors’ Mood to predict the evolution of the financial markets. The study is based on big data and uses artificial intelligence to predict the evolution of Ibex 35, Eurostoxx 50, Dow Jones, S&P 500 and Nasdaq indexes. We can check that these “pure investors’ sentiment” systems can be profitable for the American indexes while the result is poor for European ones. First conclusion is that Google Trends is a good investors’ sentiment metric for the American indexes studied, closer than global sentiment if Google Trends has been no limited by location. The poor results for Ibex or Eursotoxx suggests a limitation in Google Trends stats for this models in further investigation. Second conclusion from this study is that trading systems can be developed using an alternative approach to common systems based on technical analysis. This study has shown how the trading system for Dow Jones, S&P 500 and Nasdaq, based on the predictions of an artificial intelligence model that uses investors’ mood from Google Trends to predict is capable to generate positive returns in a long/short strategy. All this opens an interesting field of research in the development of algorithmic trading.

References Bayes, T. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London 53: 370-418. doi:10.1098/rstl.1763.0053.

16

Gomez-Martínez, R.; Prado-Román, C.; de la Orden de la Cruz, M. C.

Berument, H., Ceylan, N. B., y Gozpinar, E. (2006). Performance of soccer on the stock market: Evidence from turkey. The Social Science Journal, 43(4), 695-699. doi:10.1016/j.soscij.2006.08.021 Bouman, S., y Jacobsen, B. (2002). The Halloween indicator, "sell in may y go away": Another puzzle. The American Economic Review, 92(5), 1618-1635. Chang, S., Chen, S., Chou, R. K., y Lin, Y. (2012). Local sports sentiment y returns of locally headquartered stocks: A firm-level analysis. Journal of Empirical Finance, 19(3), 309-318. doi:10.1016/j.jempfin.2011.12.005 Cohen, G. y Kudryavtsev, A., (2012). Investor Rationality y Financial Decisions. Journal of Behavioral Finance, 13(1), 11-16. Corredor, P., Ferrer, E. y Santamaría, R. (2013): El sentimiento del inversor y las rentabilidades de las acciones. El caso español. Spanish Journal of Finance y Accounting, 42 (158), 211-237 Edmans, A., García, D., y Norli, Ø. (2007). Sports sentiment y stock returns. The Journal of Finance, 62(4), 1967-1998. Gómez Martínez, R. (2013). Señales de inversión basadas en un índice de aversión al riesgo. Investigaciones Europeas De Dirección y Economía De La Empresa, 19(3), 147157. doi:http://dx.doi.org/10.1016/j.iedee.2012.12.001 Gómez Martínez, R., y Prado Román, C. (2014). Sentimiento del inversor, selecciones nacionales de fútbol y su influencia sobre sus índices nacionales. Revista Europea De Dirección y Economía De La Empresa, 23(3), 99-114. doi:http://dx.doi.org/10.1016/j.redee.2014.02.001 Hilton, D.J., (2001). The Psychology of Financial Decision-Making: Applications to Trading, Dealing, y Investment Analysis. Journal of Psychology y Financial Markets, 2(1), 37-53. Hirshleifer, D., y Shumway, T. (2003). Good day sunshine: Stock returns y the weather. The Journal of Finance, 58(3), 1009-1032. Retrieved from http://www.jstor.org/stable/3094570 Jacobsen, B., y Marquering, W. (2008). Is it the weather? Journal of Banking y Finance, 32(4), 526-540. doi:http://dx.doi.org/10.1016/j.jbankfin.2007.08.004 Kaplanski, G., y Levy, H. (2010). Exploitable predictable irrationality: The FIFA world cup effect on the U.S. stock market. The Journal of Financial y Quantitative Analysis, 45(2), 535-553. Retrieved from http://www.jstor.org/stable/27801494

17

Algorithmic Trading Systems Based on Google Trends

Leshik, E. y Crall, J., (2011). An Introduction to Algorithmic Trading: Basic to Advanced Strategies. Wiley. Liu, S., (2015). Investor Sentiment y Stock Market Liquidity. Journal of Behavioral Finance, 16(1), pp. 51-67. Marshall, P. S. (2010). In Kaynak E. H.,TD (Ed.), Sell in may y go away? probably still good investment advice!. HUMMELSTOWN; PO BOX 216, HUMMELSTOWN, PA 17036 USA: INT MANAGEMENT DEVELOPMENT ASSOCIATION-IMDA. Mishra, V., y Smyth, R. (2010). An examination of the impact of India's performance in one-day cricket internationals on the Indian stock market. Pacific-Basin Finance Journal, 18(3), 319-334. doi:10.1016/j.pacfin.2010.02.005 Moat, H., Cure, C., Abakan, A., Kennett, D. Y., Stanley, H. E., y Pries, T. (2013). Quantifying Wikipedia usage patterns before stock market moves. Scientific Reports, Retrieved fromhttp://www.nature.com/srep/2013/130508/srep01801/pdf/srep01801.pdf Narayan, S. y Narayan, P.K., (2017). Are Oil Price News Headlines Statistically y Economically Significant for Investors? Journal of Behavioral Finance, 18(3), pp. 258270. Nofsinguer, J.R., (2005). Social Mood y Financial Economics. Journal of Behavioral Finance, 6(3), pp. 144-160. Sharpe, W., F., (1994) The Sharpe ratio properly used it can improve investment management, J Portf Manag, 21, pp. 49–58 Yuan, K., Zheng, L., y Zhu, Q. (2006). Are investors moonstruck? lunar phases y stock returns. Journal of Empirical Finance, 13(1), 1-23. doi:http://dx.doi.org/10.1016/j.jempfin.2005.06.001 Wu, Kesheng and Bethel, Wes and Gu, Ming and Leinweber, David and Ruebel, Oliver, A Big Data Approach to Analyzing Market Volatility (June 5, 2013). Algorithmic Finance (2013), 2:3-4, 241-267. Available at SSRN: https://ssrn.com/abstract=2274991 or http://dx.doi.org/10.2139/ssrn.2274991

18

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8301

How to sort out uncategorisable documents for interpretive social science? On limits of currently employed text mining techniques Philipps, Axel Institute of Sociology & Leibniz Center for Science and Society (LCSS), Leibniz University Hanover, Germany.

Abstract Current text mining applications statistically work on the basis of linguistic models and theories and certain parameter settings. This enables researchers to classify, group and rank a large textual corpus – a useful feature for scholars who study all forms of written text. However, these underlying conditions differ in respect to the way how interpretively-oriented social scientists approach textual data. They aim to understand the meaning of text by heuristically using known categorisations, concepts and other formal methods. More importantly, they are primarily interested in documents that are incomprehensible with our current knowledge because these documents offer a chance to formulate new empirically-grounded typifications, hypotheses, and theories. In this paper, therefore, I propose for a text mining technique with different aims and procedures. It includes a shift away from methods of grouping and clustering the whole text corpus to a process that sorts out uncategorisable documents. Such an approach will be demonstrated using a simple example. While more elaborate text mining techniques might become tools for more complex tasks, the given example just presents the essence of a possible working principle. As such, it supports social inquiries that search for and examine unfamiliar patterns and regularities. Keywords: text mining; interpretive social science; qualitative research; standardised and non-standardised methods; social science.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

19

Sorting out uncategorisable documents

1. Introduction Before starting to answer the title of the paper, the exact nature of text mining needs to be identified. Text mining is a combination of statistical and linguistic approaches of text analysis that has lately gained attention in the field of digital humanities. An important forerunner was the Italian literary scholar Franco Moretti (2007) with his concept of “distant reading”. He proposed that scholars who are used to employing in-depth interpretations (close reading) are unable to read and study the ever-increasing amount of data that is produced worldwide. Because of this, he recommends a different approach. In contrast to printed books, Moretti accesses digitally-accessible texts and identifies patterns in large corpora. This kind of distant reading includes a growing number of visualisations such as maps, graphs, and trees (Jänicke et al., 2015). Such visualisations usually show relations between such things as actors, names and places; text mining tools, in contrast, concentrate on linguistically small units: words and phrases. Text mining can be defined as a set of “computer-based methods for a semantic analysis of text that help to automatically, or semi-automatically, structure text, particular very large amounts of text” (Heyer, 2009: 2). So, such applications practically count, relate, rank, cluster, and classify single and groups of words in large text corpora and present the outcomes in frequency graphs, word clusters, and networks. In recent years there has also been a growing interest in text mining for social science research. Various works (i.e. DiMaggio, Nag & Blei, 2013; Marres 2017; Philipps, Zerr & Herder, 2017) present mostly exploratory studies using algorithmic information extraction approaches to demonstrate the power of such tools for text analysis in the social sciences. Proponents of these computer-based methods primarily address qualitatively-oriented social scientists for two reasons (i.e. Evans & Aceves, 2016; Wiedemann, 2013). Firstly, such tools help researchers, who mainly work with textual data, to deal with the increasing number of digitally-accessible texts. Secondly, it is argued that, in a similar way to the grounded theory approach (Glaser & Strauss, 1967), text mining is employed to identify patterns. However, these propositions are slightly misleading. This is a rather unbalanced representation of qualitative and interpretive social research and might explain, to some extent, why (semi)-automatic analysis of textual data has, up to now, been widely ignored in interpretive social sciences (for more details see Philipps, 2018). This paper therefore primarily takes a closer look at how text mining analyses textual data and in what respect that analysis differs from methods commonly employed by interpretively-oriented social scientists. In this respect, I suggest a different aim and operating procedure for text mining which is more appropriate for interpretive social science. It includes a shift from standardised procedures of classification and clustering of large text corpora to detecting documents that do not fit to applied constructed concepts. To demonstrate this approach, I am presenting an exemplary working principle of low

20

Philipps, A.

complexity. Later, adapted text mining techniques might become tools for more complex tasks. These seek to support interpretive social science that examine unfamiliar patterns and the regularities of socially-produced meanings.

2. Analysing textual data with text mining and in interpretive social science Text mining techniques comprise of a wide range of methods from frequency and cooccurrence analysis to sentiment analysis and then to more complex approaches such as topic models and machine learning (Marres 2017; Wiedemann, 2016, 2013). While frequency and co-occurrence counts and identifies the use of words and the relationship between groups of words in large text corpora topic models, machine learning transforms words into numbers and computes statistical interferences in textual data. By no means can these methods be successfully employed to detect thematic shifts or networks of knowledge structures on a trans-textual level in social research studies (i.e. Adam & Roscigno, 2005; Blei & Lafferty, 2006). However, applying text mining requires the setting of some parameters before research is started. For frequency and co-occurrence analyses, for example, researchers need to determine relevant words or groups of words in advance. For a sentiment analysis they have to define classes, ranging from extremely negative to extremely positive. In addition, most machine-learning algorithms demand supervised training (intermediate results are controlled and evaluated by analysts during processing) and even for unsupervised topic models (without interference of external data or human control) researchers have to determine the exact number of clusters to be computed. Hence, current text mining methods have certain characteristics in common; before analysis, researchers define, even to the smallest degree, what is relevant and can potentially be found in textual data. Based on these (standardised) parameter settings, whole text corpora are classified, ranked, or grouped. However, standardised approaches are, for a great deal of interpretively-oriented social scientists, the opposite to how they were trained. For the most part, they learned and share the basic premise of interpretive social science working with non-standardised methods. This means that a researcher should approach their object of investigation with an open mind and be prepared for surprises. Hence, these researchers seek to situationally understand meanings produced in interactional settings – being ready to overcome previous classifications and schemes. They aim to generate assumptions based on identified contentrelated, functional and formal aspects of the examined empirical material (for more details see Soeffner, 1999). Nonetheless, while these interpretively-oriented social researchers avoid standardised settings, they employ heuristic models to interpret textual data. They work with commonly-known (scientific) classifications and typifications in order to see how useful this knowledge is for understanding the meaning of given textual data and, at

21

Sorting out uncategorisable documents

the same time, they search for unfamiliar regularities and patterns. Thus, these researchers translate and describe the world of the observed “into one that we find comprehensible” (Abbott, 2004: 31) and only if they discover so far incomprehensible phenomena do they seek to grasp the underlying working principle and meaning in the form of new but empirically-grounded typifications, hypotheses, and theories. Against this background, I presume that currently-operating text mining applications for classification and information extraction are often insufficient to be “complementing techniques” (Wiedemann, 2013: no page) for most social scientists with special training in interpretive methods. Under certain circumstances text mining might enable qualitativelyoriented researchers to learn about the variety and development of relevant categories. It is also reasonable to assume that machine learning algorithms which demonstrate knowledge about statistical characteristics of language and text-external knowledge manually coded by analysts (e.g. categories or example sets) will help to retrieve or annotate information in unknown material. However, in all these cases text mining is used to classify and group the entire textual data based on determined parameter settings. We therefore need to think of additional text mining strategies more adjusted to interpretative social science and its basic premise.

3. Adjusting text mining for interpretive social science Text mining applications might become more relevant for interpretive social science, I suppose, if they enable researchers to divide a large corpus of documents into those with and without comprehensible patterns and components. Such information will stimulate the power of interpretive social inquiry, interpretively explore hidden patterns and unveil unfamiliar meaning. The working principle of such a search strategy might be best described with Max Weber‟s (1949) limiting concept of ideal types: “It is a conceptual construct (Gedankenbild) which is neither historical reality nor even the „true‟ reality. It is even less fitted to serve as a schema under which a real situation or action is to be subsumed as one instance. It has the significance of a purely ideal limiting concept with which the real situation or action is compared and surveyed for the explication of certain of its significant components” (Weber, 1949: 93, italics in the original work). Thus, ideal types are not the final outcome of empirical investigations but are used as an heuristic limiting concept to identify the significant aspect of real situations or actions. Practically, if an ideal type has not fully-grasped all aspects of the social phenomena, the researcher will pay full attention to this and mark it for further interpretation. In Weber‟s book Economy and Society (2013) he, for example, applied ideal types in a “procedure of the „imaginary experiment‟” (10) comparing a purely rational constructed course of actions with the concrete course of events: “By comparison with this it is possible to understand the ways in

22

Philipps, A.

which actual action is influenced by irrational factors of all sorts, such as affects and errors, in that they account for the deviation from the line of conduct which would be expected on the hypothesis that the action were purely rational” (Weber 2013: 6). Thus, he intellectually constructs an ideal type of pure rationality to grasp favouring or hindering circumstances which are devoid of subjective meaning “if they cannot be related to action in the role of means or ends” (Weber 2013: 7). Generally speaking, with ideal types as limiting concepts he describes a common strategy among interpretive social scientists to approach their object of investigation in that one employs conceptual constructs to understand social phenomena and by paying attention to unfamiliar regularities and patterns (in Weber‟s terms: deviations). The latter phenomena are of special interest because their interpretation offers a chance to broaden or even to rewrite established scientific knowledge. However, one has to note that Weber was interested in understanding and explaining social action motivationally. The construction of ideal types thus is not restricted to a rational course of actions. Applying this search strategy to text mining, a modified variant might become central for interpetative social research working with large digitally-accessible text corpora. In contrast to currently operating mining techniques which classify and group an entire text corpus, an adaptation would use constructed concepts to identify documents which show characteristics assumed in the formulated concept and those that do not fit. Therefore, in contrast to present computer-based applications working with linguistic models and theories, an adjusted text mining technique would operate with preliminary ideas and assumptions, formulated by interpretively-oriented researchers. In particular, for a large corpus of documents the latter will come up with a constructed concept after analysing some selected documents and heuristically employ this to sort out documents that display conceptually anticipated features and relations. In the next step, researchers examine and interpret the specificity of the remaining documents. In this process they might adjust existing concepts or formulate others. In addition, from the perspective of the humanities one could also say such a modified text mining technique mimics the hermeneutic circle (see Gadamer, 2004). Suggestions formulated in a first round of interpreting textual data are used to identify what is comprehensible and what is not. Incomprehensible textual data will be analysed in further interpretive rounds producing altered or additional suggestions which become the basis for more interpretive sequences. The process will come to an end with working interpretations (constructed concepts) to understand the textual data of interest. Nonetheless, like the hermeneutic circle the process will be impossible to finish as other researchers might find more appropriate readings for understanding certain textual data in the future.

23

Sorting out uncategorisable documents

4. An example for sorting out uncategorisable documents Often interpretively-oriented social scientists work with and interpret a small number of documents. However, sometimes they are confronted with a large corpus of textual data such as an archive of interview transcripts, protocols, letters and other forms of written documents. There are various ways of dealing with such conditions. With Merkens (2004), one might select some documents according to specific characteristics (i.e. relevant for the research goal) and concentrate on these cases or apply the theoretical sampling strategy starting with a few documents and selecting further documents for interpretation based on minimal and maximal contrasts. Theoretical sampling comes to an end if additional analyses of documents reveal no further information. However, there always remain documents that are not interpreted and may contain unexplored patterns and meanings. Under such circumstances an adapted version of text mining technique would offer an opportunity to search these documents for deeper analyses. In the following paragraphs, I present an instance of low complexity to give an idea of how such an variant of text mining can support interpretively-oriented research projects. It does not involve a reprogrammed text mining application but rather it demonstrates a possible working principle. The case in point is an investigation of applied approaches to promote unconventional ideas in 93 grant proposals sent to a major research-funding organisation in Germany in 2013 (for more detail on method and findings see Philipps, forthcoming). The study started by skimming through the textual data and selecting proposals for deeper analysis. Without any predefined assumptions about specific approaches to unconventional ideas, I began to read a number of grant proposals to get an idea of these. Based on a preliminary impression of the material, I then employed closer readings in a contrastive manner. Using maximal and minimal contrast cases, I searched for specific structural and rhetorical patterns in the rationales of the grant proposals. My interpretation of research proposals continued until typical approaches to unconventional ideas could be identified and separated. After scrutinising 20 proposals and skimming through further applications I came up with a typology of distinct approaches. In an additional and laborious step the typology of identified argumentative patterns was separated into segments and described in a codebook. After a group of interpreters applied segment descriptions to a randomly selected sample of proposals and discussed disagreements and questions, amended codes were used by the author to annotate all 93 research grant proposals. Finally, the manual coding process enabled us to categorise all documents and search for cases with different argumentative patterns or other aspects. Especially for studies with a greater corpus, automatic text mining would be another option searching for empirically-identified patterns before establishing a codebook and manually annotating the remaining documents. However, such a search strategy requires a limiting concept to sort out documents that show conceptually-suggested patterns and those that do

24

Philipps, A.

not. In my study, such a concept might, for example, be typical wordings that appear with the identified approaches. Applicants who promoted ideas of solving practical problems typically discussed “drawbacks” or “disadvantages” of earlier solutions and what “benefit” or “advantage” their solution offers in contrast. Concentrating on these wordings, of course, is one-sided and does not fully capture all possible variants and other typical aspects. However, by producing two groups of documents (with and without these certain wordings), one can reduce the number of proposals demanding deeper analysis. In the case of this research project, a simple retrieval of these terms shows that 48 grant proposals used at least one of the terms if not all of them. Combining this result with the already examined proposals (n=20) 33 uncategoriseable documents remain. Hence, this procedure already condenses the number of non-examined documents from 73 down to 33. Apart from applying additional limiting concepts to further reduce the amount of these documents it should be clear that such a search strategy assists interpretively-oriented social scientists to single out documents for further examination.

5. Conclusion In this paper, I discussed how standardising procedures of current text mining techniques differs in respect of methodological premises commonly employed by interpretivelyoriented social scientists. Without question, text mining features such as ranking, grouping or classifying textual data are useful for many research questions in social sciences. However, I presume an adjusted mining techique will greatly support interpretive social science if it shifts from standardised procedures of classification and the clustering of large text corpora to detect documents that do not fit into applied constructed concepts. It is also important to note that such a mining technique would not be based on linguistic theories and information management concepts but on suggestions offered by interpretively-oriented social scientists. As demonstrated at a low level, such an approach can help interpretive social inquiries to single out documents and examine them for unfamiliar patterns and regularities of socially-produced meanings. Nonetheless, as the complex topic of this paper shows it is still a long way from translating the methodological premises of interpretive social sciences into working additional text mining techniques.

References Abbott, A. (2004). Methods for Discovery. Heuristics for the Social Sciences. New York: W. W. Norton & Company, Inc. Adams, J., & Roscigno, V. J. (2005). White Supremacists, Oppositional Culture and the World Wide Web. Social Forces, 84(2), 759–778.

25

Sorting out uncategorisable documents

Blei, D. M., & Lafferty, J. D. (2006). Dynamic Topic Models. Proceedings of the 23rd International Conference on Machine Learning, ACM, 113–120. DiMaggio, P., Nag, M., & Blei, D. (2013). Exploiting Affinities Between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Art Fundings. Poetics, 41(6), 570–606. Evans, J. A., & Aceves, P. (2016). Machine Translation: Mining Text for Social Theory. Annual Review of Sociology, 42, 21–50. Gadamer, H.-G. (2004). Truth and Method. 2nd rev. ed. Trans. J. Weinsheimer & D. G. Marshall. New York: Crossroad. Glaser, B. G., & Strauss, A. L. (1967). The Discovery of Grounded Theory: Strategies for Qualitative Research. New Brunswick & London: AldineTransaction. Heyer, G. (2009). Introduction to TMS 2009. In G. Heyer (Ed.), Text Mining Services. Building and Applying Text Mining Based Service Infrastructures in Research and Industry. Proceedings of the Conference on Text Mining Services 2009 at Leipzig University (pp.1–14). Leipzig: LIV. Jänicke, S., Franzini, G., Cheema, M. F., & Scheuermann, G. (2015). On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges. In R. Borgo, F. Ganovelli & I. Viola (Eds.), Eurographics Conference on Visualization (EuroVis)STARs. The Eurographics Association. Marres, N. (2017). Digital Sociology: The Reinvention of Social Research. Hoboken: John Wiley & Sons. Merkens, H. (2004). Selection Procedures, Sampling, Case Construction. In U. Flick, E. von Kardoff, & I. Steinke (Eds.), A Companion to Qualitative Research (pp. 165–171). London: Sage. Moretti, F. (2007). Graphs, Maps, Trees. Abstract Models for Literary History. London: Verso. Philipps, A. (forthcoming). Wissenschaftliche Orientierungen. Empirische Rekonstruktionen an einer Ressortforschungseinrichtung. München & Weinheim: Juventa. Philipps, A. (2018). Text Mining-Verfahren als Herausforderung für die rekonstruktive Sozialforschung. Sozialer Sinn. Zeitschrift für hermeneutische Forschung, 19(1): 191– 210. Philipps, A., Zerr, S., & Herder, E. (2017). The Representation of Street Art on Flickr. Studying Reception with Visual Content Analysis. Visual Studies, 32(4), 382–393. Soeffner, H.-G. (1999). Verstehende Soziologie und sozialwissenschaftliche Hermeneutik. In R. Hitzler, J. Reichertz, & N. Schröer (Eds.), Hermeneutische Wissenssoziologie (pp. 39-49). Konstanz: UVK. Weber, M. (2013). Economy and Society: An Outline of Interpretive Sociology. (Translation of Wirtschaft und Gesellschaft, 4th ed., 1956). Berkeley, Los Angeles, & London: University of California Press.

26

Philipps, A.

Weber, M. (1949). Objectivity in Social Science and Social Policy. (Translation of Die ‘Objektivität’ sozialwissenschaftlicher Erkenntnis, 1904) In E. Shils, & H. Finch (Eds.), The Methodology of the Social Sciences (pp. 49–112). Glencoe: The Free Press. Wiedemann, G. (2016). Text Mining for Qualitative Data Analysis in the Social Sciences: A Study on Democratic Discourse in Germany. Wiesbaden: Springer. Wiedemann, G. (2013). Opening up to Big Data: Computer-Assisted Analysis of Textual Data in Social Sciences [54 paragraphs], in: Forum Qualitative Sozialforschung/Forum Qualitative Research, 14, Art. 13, http://nbn-resolving.de/urn:nbn:de:0114-fqs1302231.

27

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8302

A proposal to deal with sampling bias in social network big data Iacus, Stefano Maria a; Porro, Giuseepe b; Salini, Silvia a and Siletti, Elena a a Department of Economics, Management and Quantitative Methods, Università degli Studi di Milano, Italy, b Department of Law, Economics and Culture, Università degli Studi dell’Insubria, Italy

Abstract Selection bias is the bias introduced by the non random selection of data, it leads to question whether the sample obtained is representative of the target population. Generally there are different types of selection bias, but when one manages web-surveys or data from social network as Twitter or Facebook, one mostly need to focus with sampling and self-selection bias. In this work we propose to use official statistics to anchor and remove the sampling bias and unreliability of the estimations, due to the use of social network big data, following a weighting method combined with a small area estimations (SAE) approach. Keywords: Big data; Well-being; Social indicators; Sentiment analysis; Selfselection bias; Small area estimation.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

29

A proposal to deal with sampling bias in social network big data

1. Introduction Despite, social media users could be thought of as the world's largest focus group, and be analysed as such (Hofacker et al., 2016), when one deals with such a data it seems that one cannot to take into account the selection bias. Indeed dealing with Twitter data (as for Iacus et al. (2017)), it is obvious that the sample is done by people that have Internet access, that have decided to open an account on Twitter and that are active users. In statistical literature, studies largely addresses this bias using the propensity score (PS) approach (Rosenbaum & Rubin, 1983) or the Heckman approach (Heckman, 1979). Both methods attempt to match the self-selected intervention group with a control group that has the same propensity to select the intervention, but they both relay on information that dealing with data such as Twitter data are not disposable. In web-survey context, these issues have been addressed by some strategies based on weighting procedures and modelbased approach (Bethlehem & Biffignandi, 2012), nevertheless, also these proposals relies on the availability of unit level information from big data sources, that nowadays are still a mirage, and that, dealing with aggregated big data, are always impossible to achieve.

2. Our proposal We propose to manage sampling bias, due to the use of aggregated data from social networks, combining a weighting method with a small area estimation model. Our proposal start from this consideration: SAE models have been traditionally used to check and remove unreliability from direct estimations, because if we use direct estimations from Twitter data, those can suffer of selection bias as introduced above, first of all we use a weighting method and then we check and remove their unreliability using SAE models. In big data context SAE models have been recently used, employing this new kind of data as a covariates when official statistics are missing or they are poor. Porter et al. (2014) use Google trends searches as covariates in a spatial FH model, while in Falorsi et al. (2017) the time series query share extracted always from Google Trends is used as covariate to improve the SAE model estimates for Italian regional young unemployed. Marchetti et al. (2015) use big data on mobility as covariates in a FH model to estimate poverty indicators; where accounting for the presence of measurement error they follow the Ybarra & Lohr (2008) approach. Moreover, themselves have proposed the use official data to verify and remove the self-selection bias due to the use of big data, but in addition to the suggestion, no concrete proposals has been made. Finally, Marchetti et al (2016) use data coming from Twitter as covariate to estimate Italian households' share of food consumption expenditure. In order to proceed in our direction we have to take into account some topics. When we deal with big data, we often have not a really unit level data to use for direct estimations.

30

Iacus, S. M.; Porro, G.; Salini, S.; Siletti, E.

To overcome this problem we can consider different hierarchical levels of aggregations. As an example, we can think of Italian provinces as a unit level for regions. In this way, should be clear that the use of small sample techniques are suitable. Going back to the example: also if we manage million of tweet, if we consider provinces as statistics unit, this number will always be very small. A good and desirable property of big data is the high time frequency, however this feature is often disregarded for the official statistics. In this work, we consider data with the same frequency, but the opportunity to use data with a different time frequencies could be an interesting methodological challenge for the future. Lastly, dealing with timely and spatial information, we should take into account both time and space correlations too. Following these addresses, we now present our method step by step, and we propose an application as toy example. 2.1. The method About SAE model, we consider area level models, because we assume to have area level covariates. Furthermore, because these data are available for several periods of time T and for D domains, to consider also eventually time and space correlations, we have chosen a spatio-temporal Fay-Herriot (STFH) model, proposed by Marhuenda et al. (2013). Thus, for domain d and t time periods, let be the target parameter, the STFH model, as all the FH models, has two stages, where in first stage, the “sampling model” is defined as follow: ̅̂

̅̂

,

)) ,

,

(1)

where are the sampling errors that are assumed to be independent and normally distributed, and ̅̂ ) is the sampling variance of the direct estimator. Especially, we consider as direct estimator the regional sampling mean, weighted by some characteristics to overcome the non-sampling structure of our data ̅̂





(2)

where is the number of provinces in region d at time t, and For the sampling variance we use the same weights.

are the weights used.

While in second stage, the “linking” model is ,

),

)

(3)

it relates all areas through the regression coefficients, is a column vector containing the aggregated values of k covariates for the d-th area in t-th period, and is the vector of coefficients. are the area effects, that follow a first order spatial autocorrelation, SAR(1), process with variance , spatial autocorrelation parameter and proximity matrix W of dimension . Especially, W is a row-standardized matrix obtained from an initial proximity matrix WI , whose diagonal elements are equal to zero and the residual entries

31

A proposal to deal with sampling bias in social network big data

equal to one, when the two domains are neighbours, and zero otherwise. Normality for is required for the mean squared error, but not for point estimations. Furthermore represents the area-time random effects that are i.i.d. for each area d, following the first order autoregressive, AR(1), process with autocorrelation parameter and variance parameter equal to . The final model is defined as

̅̂

(4)

Then, ) is the vector of unknown parameters involved in the STFH model. Marhuenda et al (2013) give the empirical best linear unbiased estimator (EBLUE) of , and the empirical best linear unbiased predictors (EBLUPs) of and . Both are obtained by replacing a consistent ̂ in the respectively BLUE and BLUPs introduced by Henderson (1975). Also due to Marhuenda et al (2013) is the parametric bootstrap procedure for the estimation of the mean squared error (MSE) of the EBLUPs, that for B bootstrap replies has the following form In this way the point estimates ̂ uncertainty.

of

̂ )



( ̂

)

(5)

can be supplemented with (5) as measure of

3. Application The assessment of well-being mostly at local level is an important task for policy makers, because they increasingly need to target their policies and actions not to the nation, but at local domains. Unfortunately very often there is a lack of this level data, even more if the interest is for high frequency data too. To fill this gap, the use of data from social networks can be considered a good option to improve well-being knowledge. In this section considering a well-being index from Twitter data and some official statistics, we implement the proposed approach to check and, if necessary, to remove unreliability of estimations. 3.1. A Subjective Well-being Index with the Twitter Data Since 2012 Iacus et al. (2015) propose to apply iSA (integrated Sentiment Analysis, Ceron et al. (2016)) method, to derive a composite index of subjective well-being that capture different aspects and dimensions of individual and collective life. This index named Social Well Being Index (SWBI) monitor the subjective well-being expressed by the society through the social networks, epsetially, in Iacus et al. (forthcoming) the SWBI index is provided for the Italian provinces from 2012 to 2016 and combined with the “Il Sole 24 Ore Quality of Life index”. SWBI is not the result of some aggregation of individual well-being measurements, but it directly measures the aggregate composition of the sentiment throughout the society at province or regional level. For this reason, about the weights, we

32

Iacus, S. M.; Porro, G.; Salini, S.; Siletti, E.

can't consider users characteristics as traditionally, but aggregated to area ones. SWBI has been inspired by the definitions introduced by the think-tank NEF (New Economic Foundation), for its Happy Planet Index (New Economics Foundation, 2012), and it is defined as a manifold, dynamic combination of different features, with indicators which look beyond the single item questions and capture more than simply life satisfaction. The eight SWBI dimensions concern three different well-being areas: personal well-being, social well-being and well-being at work. Data source are tweets written in Italian language and from Italy and data are accessed through Twitter's public API. A small part of these data (around 1to 5% each day) contain geo-reference information which allows to build the SWBI indicator at province level in Italy.

Figure 1. Twitter counts, on the left, and Twitter rates, on the right, in 2014-Q4.

In the application presented here, we consider SWBI quarterly data, with an area aggregation at the Italian provincial and regional levels. Now we shortly describe the dimension of these data. To compute the the SWBI index we consider more than two hundred million tweets (201,496,621) in 22 quarters from 2012 to June 2017. Despite this huge total number, we have to reveal a decrease in the number of tweets for all the four quarters of 2015 and the first quarter of 2016, anyway we stress that also for these quarters the counts were still in the thousands (minimum count equal to 1727 tweets in 2016-Q1 for Basilicata). To have a more realistic view of the situation, we consider a Twitter rate: the ratio between the number of analysed tweets and the number of inhabitants in the region in the same period. Considering the simple counts, we would have seen that most of the SWBI info comes from Lombardia, Piemonte, Emilia and Toscana, while considering also the resident population, we note that information is also substantial in particular or small regions such as Friuli, Sardegna, Valle d'Aosta and Molise, for the last two we remark their large variability during the period too. While the dispersion for big regions like Lazio and Lombardia seems to be smaller. However, we can conclude that during our observational period the average Twitter rate is equal to about 20% of the population, with a mean value always greater than 9% (minimum for Campania), for all regions. Looking at Figure 1, as

33

A proposal to deal with sampling bias in social network big data

example, it is clear that, considering the Twitter rate (on the right), all the Italian regions are homogeneously observed with the exception of the Valle d'Aosta which have a higher but almost anomalous rate. 3.2. The implemented model To implement the proposed model, as toy example, we consider only the wor dimension of the SWBI index, at quarterly time level, from 2012-Q1 to 2017-Q2. We compute regional values following (2) and using as weights: the broadband coverage and the Twitter rate. The broadband coverage is provided by “Il Sole 24 Ore” and Infratel Italia, this coverage can be considered as the opportunity to access internet in the different provinces. While the Twitter rate, computed in each period and province level, can be a proxy of the use of Twitter. The weighted quality of job has remained stable, with very little variability between the regions, the distributions are very compressed, until the second half of 2015. From 2015-Q3 weighted wor grows, and especially from the second half of 2016, this dimension attained values greater than 80. Even the differences between the regions are more evident: the distributions are less crushed. We remark that the shapes of the quality of work weighed or not, computed as simple means, are quite similar. It seems that difference for the weighted wor index among regions are small, instead are greater in time. Considering the different rankings obtained by the two indices, in the 29% of the cases there are no differences and only for the 14% of the cases, there is a difference (Δ) in the rankings greater than 5 positions. Regions with the greater Δ are Trentino, Campania, Marche, and Sardegna, for the first two we remark the greatest position improvement, while for the last two we remark the greatest worsening of position. Focusing on time, we recorded the major Δ in 2014-Q1, 2015-Q1 and both the considered 2017 quarters. The mean of the ranking Δ is equal to 2.01(SD = 2.41). Referring to the model (4) we use as direct estimator of regional quality of job the weighted wor and its sampling variance. Because in one Italian region, Valle d'Aosta, there is only one province, we decided to drop this region from our data. In the recked STFH model data are available for T = 22 time instants, from the first quarter of 2012 to the second quarter of 2017, and the dominions are the D = 19 considered Italian regions. The considered area level auxiliary variables were, before any process of selection, in the job context: the unemployment and the inactivity rate, computed both in relation to the labour force, as traditionally, and to the resident population; while in the socio demographic context: the birth, the mortality and the natural rate. All the covariates come from official statistics distributed by ISTAT (http://istat.it/), as representatives for all the Italian regions at quarters time level. The row-standardized proximity matrix Wc of dimension , has been obtained from an initial proximity matrix WIc , whose diagonal elements are equal to zero

34

Iacus, S. M.; Porro, G.; Salini, S.; Siletti, E.

and the residual entries equal to one, when the two regions had some common borders, and zero otherwise. Since in Italy, there are two regions corresponding to two islands, for them we take as neighbours the regions with direct naval connections. 3.3. Results and discussion After fitting the model the selected covariates were the unemployment rate, calculated traditionally dividing by the labour force and the mortality rate. The coefficients were both negative: regions with larger unemployment and mortality rate have smaller quality of job. The estimated spatial autocorrelation is significant enough with a small and negative value of about -0.02. While the temporal autocorrelation parameter is still significant and greater with a positive value equal about to 0.86. The value equal to zero for ̂ is coherent with analysis of distribution discussed above. Quality of job change in time but less or not at all between regions. Comparing the resulting EBLUPs obtained by fitting the STFH model and the direct estimates, weighted or not, we can conclude that the direct weighted estimates are approximately design unbiased. Looking at the rankings, what change if we use EBLUPs estimates instead of direct, weighted or not, estimates ? Comparing the rankings obtained with the simple means wor and those with EBLUPs estimates, we find that in the 31% of the cases the position is the same and in the 14% of the cases the position Δ is greater than 4. Regions and time situation is the same as when we compared above simple means with weighted means. The mean of the ranking Δ is equal to 1.97 (SD = 2.3). While comparing the rankings obtained with the weighted means wor and those with EBLUPs estimates, the situation is very different: in the 88% of the cases the position is the same and in less than 1% of the cases the position Δ is greater than 4. Only in one case we have a great ranking Δ (Marche in 2015-Q3, Δ = 7). The average of the differnces in this case is equal to 0.16 (SD = 0.54). This means that moving to weighted estimates to EBLUPs estimates the ranking are quite the same. In SAE literature, traditionally use coefficients of variations (CV) to analyse the gain of efficiency of the EBLUPs estimates. National statistical institutes are committed to publish statistics with a minimum level of reliability ( 0.182 and 0.52 > 0.32 for R&D and IP respectively), which indicates good nomological validity and that there are no mono-method biases. The first heterotraitheteromethod value is low and not significant (r = -0.17; p-value > 0.05) while the other is moderate and significant (r = 0.294; p-value < 0.05). However, and more importantly, the correlations are lower than the corresponding values found in the validity diagonal, which shows good discriminant validity. All the conditions are satisfied under the original guidelines proposed by Campbell and Fiske (1959), and therefore, no risk of potential biases is induced within the methods, the traits or a combination of both. The results based on this methodology suggests that our web mining indicators reflect the importance given to innovation factors such as R&D and Intellectual property. Table 2. MTMM matrix for RD_INDEX and IP_INDEX Traits

Method 1 (Web) RD

Method 1 (Web)

Method 2 (Questionnaire)

IP

Method 2 (Questionnaire) RD (RD_INDEX)

RD

N/Aa

IP

-0.182

N/Aa

RD (RD_INDEX)

0.419**

0.294*

N/Aa

IP (IP_INDEX)

-0.17

0.52**

0.32**

RD (IP_INDEX)

N/Aa

Note: All traits from Method 1 are measured by single items, there are no reliability statistic that can be calculated. All traits from Method 2 are measured by a formative index and thus, reliability statistic is irrelevent. p < .05. p < .01.

138

Héroux-Vaillancourt, M.; Beaudry, C.

4. Limitations and future research Obviously, more data would allow our research to be more robust. Another limitation of our methodology is the fact that we did not take into account the context around our keywords, possibly leading to multiple false positives. The addition of machine learning techniques, such as Recurrent Neural Network or Natural Language Processing or Bag-ofWords model, is a promising avenue to improve the level of precision by adding to the method the necessary context around keywords. Moreover, we started with theoretical factors for the conceptual framework, then identified the keywords related to these factors, and finally mined the website for these specific keywords. An interesting alternative would be to do this the other way around, i.e., to start with the website content and to identify the factors that can be naturally found via unsupervised machine learning algorithms. The term frequency inverse document frequency technique (TF-IDF) could be used to provide insight into the importance of keywords relative to the rest of a document. In a nutshell, our methodology seems it can be used as a valid approach to provide data for future innovation and technology management studies for the relative importance given to a factor such as R&D and IP, and to test the validity of the measures thus created. In most questionnaire-based surveys, that information is gathered using 1 to 7 Likert scale questions. If the goal of a study is to determine the degree of importance of core factors such as R&D or IP for a firm, the use of Web mining indicators is reasonable. However, if the goal is to gather more specific information, such as the precise actions undertaken by a firm, these web mining indicators may lack the necessary context to behave as expected. The importance given by a company to certain types of activities represents which activities that are supported and encouraged by the culture of the company (Herzog, 2011). Therefore, it is possible that our methodology suggest a novel way to measure quantitavely innovation culture. Of course, company websites are willingly structured in a cooperative and agreeable manner toward whomever is seeking information concerning products, services, activities, and so on. The self-reporting bias induced by this methodology is inevitable. However, it is important to note that questionnaire-based surveys and most national official public directory are all subject to self-reporting biases as well. Fortunately, the bias induced by the web mining technique is as much a quality as a flaw, in that it provides insight on how the company wants to be perceived. Indeed, companies write on their websites about what they care about, what is important for them and who they are as a company. This qualitative information represents the essence of the company. Future research is needed to determine whether this qualitative information could be used as a proxy to understand a company’s culture for instance. Furthermore, future research will be performed to assess how these indicators can be used in actual regressions to understand innovation patterns. It will be especially interesting to assess whether these web indicators tend to be substitutes or

139

Validation of innovation indicators from companies’ websites

complements to the traditional measures use in innovation management studies. This will be performed in the coming months with a sample of 1700 companies.

References Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016 Cenfetelli, R. T., & Bassellier, G. (2009). Interpretation of Formative Measurement in Information Systems Research. MIS Quarterly, 33(4), 689–707. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, N.J.: L. Erlbaum Associates. Domenech, J., Rizov, M., and Vecchi, M. (2015). The Impact of Companies’ Websites on Competitiveness and Productivity Performance (Conference Paper: First International Conference on Advanced Research Methods and Analytics). Diamantopoulos, A. (1999). Viewpoint – Export performance measurement: reflective versus formative indicators. International Marketing Review, 16(6), 444–457. https://doi.org/10.1108/02651339910300422 Diamantopoulos, A., & Siguaw, J. A. (2006). Formative Versus Reflective Indicators in Organizational Measure Development: A Comparison and Empirical Illustration. British Journal of Management, 17(4), 263–282. https://doi.org/10.1111/j.14678551.2006.00500.x Diamantopoulos, A., & Winklhofer, H. M. (2001). Index Construction with Formative Indicators: An Alternative to Scale Development. Journal of Marketing Research, 38(2), 269–277. https://doi.org/10.1509/jmkr.38.2.269.18845 Gök, A., Waterworth, A., & Shapira, P. (2015). Use of web mining in studying innovation. Scientometrics, 102(1), 653–671. https://doi.org/10.1007/s11192-014-1434-0 Haziza, D., & Beaumont, J.-F. (2007). On the construction of imputation classes in surveys. International Statistical Review, 75(1), 25–43. Herzog, P. (2011). Innovation culture. In Open and Closed Innovation (pp. 59–82). Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-3-8349-6165-5_3 Little, R. J. A. (1986). Survey Nonresponse Adjustments for Estimates of Means. International Statistical Review / Revue Internationale de Statistique, 54(2), 139–157. https://doi.org/10.2307/1403140 Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45–65. https://doi.org/10.1016/S01697439(96)00007-X OECD, & Eurostat. (2005). Oslo Manual. Paris: Organisation for Economic Co-operation and Development. Retrieved from http://www.oecdilibrary.org/content/book/9789264013100-en Petter, S., Straub, D., & Rai, A. (2007). Specifying Formative Constructs in Information Systems Research. MIS Quarterly, 31(4), 623–656. Särndal, C. E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling Springer. New York. Thomsen, I. (1973). A note on the efficiency of weighting subclass means to reduce the effects of nonresponse when analyzing survey data. Statistisk Tidskrift, 4, 278–283.

140

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8334

A combination of multi-period training data and ensemble methods to improve churn classification of housing loan customers Seppälä, Tomi; Thuy, Le Department of Information and Service Management, Aalto University, School of Business, Finland

Abstract Customer retention has been the focus of customer relationship management in the financial sector during the past decade. The first and important step in customer retention is to classify the customers into possible churners, those likely to switch to another service provider, and non-churners. The second step is to take action to retain the most probable churners. The main challenge in churn classification is the rarity of churn events. In order to overcome this, two aspects are found to improve the churn classification model: the training data and the algorithm. The recently proposed multi-period training data approach is found to outperform the single period training data thanks to the more effective use of longitudinal data. Regarding the churn classification algorithms, the most advanced and widely employed is the ensemble method, which combines multiple models to produce a more powerful one. Two popularly used ensemble techniques, random forest and gradient boosting, are found to outperform logistic regression and decision tree in classifying churners from non-churners. The study uses data of housing loan customers from a Nordic bank. The key finding is that models combining the multi-period training data approach with ensemble methods performs the best. Keywords: churn prediction, ensemble methods, random forest, gradient boosting, multiple period training data, housing loan churn

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

141

A combination of multi-period training data and ensemble methods to improve churn classification

1. Introduction Customer retention has been the focus of customer relationship management research in the financial sector during the past decade (Zoric, 2016). Retaining existing customers is argued to be more economical over the long run for companies than acquiring new ones (Gur Ali & Ariturk, 2014). Van den Poel & Lariviere (2004), in their attempt to translate the benefits of retaining customers over a period of 25 years into monetary terms, concludes that an additional percentage point in customer retention rate contributes to an increase in revenue of approximately 7% (Van den Poel & Lariviere, 2004).The first step in customer retention is to classify the customers into binary groups of possible churners, indicating to customers that are likely to switch to another service provider, and non-churners, referring to those that are probably staying with the current provider. The second step in customer retention is to take action to retain the most probable churners to either minimize costs or maximize benefits. As a result, churn classification is an important first step in customer retention. However, the main challenge in churn classification is the extreme rarity of churn events (Gur Ali & Ariturk, 2014). For example, the churn rate in the banking industry is usually less than 1%. In order to overcome this rarity issue, a great deal of research has been found to improve the two main aspects of a churn classification model: the training data and the algorithm (Ballings & Van den Poel, 2012). Regarding the training data, the recently proposed multi-period training data approach is found to outperform the single period training data thanks to the more effective use of longitudinal data of churn behavior (Gur Ali & Ariturk, 2014). Regarding the churn classification algorithms, the most advanced and widely employed is the ensemble method, which combines multiple models to produce a more powerful one (Yaya, et al., 2009). Two popularly used ensemble techniques are random forest and gradient boosting (Breiman, 2001), both of which are found to outperform logistic regression and decision tree methods in classifying churners and nonchurners.

2. Research questions To the best of the authors’ knowledge, the proposed multi-period training data has not been applied with the ensemble methods in a churn classification model. As a result, in this study we examine whether the multi-period training data approach, when employed together with ensemble methods in a churn classification model, produces better churn prediction than with logistic regression and decision tree approach.

142

Seppälä, T.; Thuy, L.

The research problem is detailed into the following research questions: 1. In models that employ logistic regression and decision trees, does the multi-period training data approach improve churn classification performance compared to the single period training data approach? 2. In models that employ single period training data, do random forest and gradient boosting improve churn classification performance compared to logistic regression and decision trees? 3. Do models that employ both the multi-period training data approach and ensemble methods perform better in churn classification than those in the first question? 4. What are the best churn predictors in the housing loan context?

3. Methods and Data In order to answer the research questions, four methods are employed in this study as churn classification algorithms: Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting. All of the four methods are employed with both multi-period training data and single period training data to create competing models. Specifically, logistic regression and decision trees are used to run the baseline models with single period training data. The other models are then compared with the baseline models in order to answer the research question. This study uses empirical data of housing loan customers from a Nordic bank. The data was collected and analysed for time period between August 1, 2015 and March 31, 2016, and was divided into observation and performance periods. The predictors employed in this study include information from all the four main groups as recommended in the literature: demography, customer behavior, characteristics of customer relationship and macroenvironmental factors. The following variable selection methods are employed in the SAS Enterprise Miner before running the logistic regression models: a decision tree with the CHAID method, step wise selection, and the variable selection node using the R square procedure. The churn models are evaluated based on three criteria: misclassification rate, Receiver Operating Characteristics (ROC) index and top decile lift.

4. Results and discussion This study validates that both multi-period training data and ensemble methods actually improve the churn classification performance compared to their counterparts in the housing loan context. More importantly, when employed together, the models with the combination

143

A combination of multi-period training data and ensemble methods to improve churn classification

of the proposed multi-period training data approach and ensemble methods such as random forest and gradient boosting have the best performance among all the created models based on the misclassification rate, ROC index and top decile lift. Type II error refers to misclassifying churners as non-churners and it is more severe than misclassifying nonchurners as churners (Type I error) since potential churners will be highly likely to churn without receiving any retention action. Using multi-period training data, the best models are produced with the random forest with a reduction of more than 10% in type II error rate compared with the worst performing models that employ single period training data and logistic regression without variable selection. Such improvement in the misclassified churn events can considerably prevent the bank from a considerable loss of those customers without taking any retention action. The improvement is mainly thanks to the more effective use of churn events that are usually scarce in real life data. Specifically, in contrast to the single period training data approach that captures churns only at a specific period of time and discards the churn events that have happened prior to that period, the multi-period training data approach allows the employment of historical churn events, providing the models with more churn events and mitigating the rarity issue in churn prediction. Therefore the imbalance between the classes is not as severe a problem when using the multiperiod training data approach. Consequently, the authors highly recommend other studies in churn classification to employ the multi-period training data approach together with ensemble methods to achieve the best possible classification models. Regarding the last research question, this study shows that the most important churn predictors belong to the demographic group, in which the number of family members has the most significant effect on churning. It makes sense that a change in the number of family members, will considerably impact the decision related to a housing loan.

References Breiman, L.(2001). Random Forests. Machine Learning, 45(1), 5-32. Ballings, M. & Van den Poel, D. (2012). Customer Event History for Churn Prediction How Long Is Long Enough?. Expert Systems with Applications, 39 (18), 13517-13522. Gur Ali, Ö. & Ariturk, U. (2014). Dynamic Churn Prediction Framework with More Effective Use of Rare Event Data: The Case of Private Banking. Expert Systems with Applications, 41(17). 7889-7903. Van den Poel, D. & Lariviere, B. (2004). Customer Attrition Analysis for Financial Services Using Proportional Hazard Models. European Journal of Operational Research, 157 (1), 196-217. Yaya, X., Xiu, L., E.W.T., N. & Weiyun, Y. (2009). Customer Churn Prediction Using Improved Balanced Random Forests. Expert Systems with Applications, 36(3),5445– 5449. Zoric, A. B. (2016). Predicting Customer Churn in Banking Industry Using Neural Networks. Interdisciplinary Description of Complex Systems, 14(2). 116-124.

144

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8336

Technical Sentiment Analysis: Measuring Advantages and Drawbacks of New Products Using Social Media Chiarello, Filippoa ; Bonaccorsi, Andreaa ; Fantoni, Gualtieroa;Ossola, Giacomoa; Cimino, Andreab and Dell’Orletta, Feliceb a Department of Energy, Systems, Territory, and Construction Engineering, University of Pisa, Italy. bInstitute for Computational Linguistics of the Italian National Research Council (ILC- CNR)

Abstract In recent years, social media have become ubiquitous and important for social networking and content sharing. Moreover, the content generated by these websites remains largely untapped. Some researchers proved that social media have been a valuable source to predict the future outcomes of some events such as box-office movie revenues or political elections. Social media are also used by companies to measure the sentiment of customers about their brand and products. This work proposes a new social media based model to measure how users perceive new products from a technical point of view. This model relies on the analysis of advantages and drawbacks of products, which are both important aspects evaluated by consumers during the buying decision process. This model is based on a lexicon developed in a related work (Chiarello et. al, 2017) to analyse patents and detect advantages and drawbacks connected to a certain technology. The results show that when a product has a certain technological complexity and fuels a more technical debate, advantages and drawbacks analysis is more efficient than sentiment analysis in producing technical-functional judgements. Keywords: Social media; Twitter; Sentiment analysis; Product Success

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

145

Technical Sentiment Analysis: Predicting the Success of New Products Using Social Media

1. Introduction Nowadays, social media have become an inseparable part of modern life, providing a vast record of mankind’s everyday thoughts, feelings and actions. For this reason, there has been an increasing interest in research of exploiting social media as information source of knowledge although extracting a valuable signal is not a trivial task since social media data is noisy and must be filtered before proceeding with the analysis. In this domain, sentiment analysis, which aims to determine the sentiment content of a text unit, is considered one of the best data mining method. It relies on different approaches (Collomb et al. 2013) and it has been used to answer research questions in a variety of fields comprised the measure of customers perception of new products (Mirtalaie et al. 2018). In this work, we try to understand if sentiment analysis is really the best available method to analyse consumer’s perception of products, expecialy when we want to measure the perception of the technical content of the product. Thus we compare State of the art sentiment analysis techniques with a lexicon of advantages and drawbacks related to products. This tool relies on a lexicon developed by Chiarello (2017) to extract advantages and drawbacks of inventions from patents. Our work started with the selection of an event able to polarise Twitter users’ attention and products to analyse. In particular, we chose a premiere tradeshow for the video game industry, and two video game consoles disclosed during the event. We collected about 7 milions tweets about products published before, during and after the tradeshow. Since social media data is noisy (for example it may contains spam and advertising), before proceeding with the analyses, we filtered our dataset. In particular, after removing too short and non-English tweets, we manually classified a randomly extracted subset of posts to train a classifier which provide us the cleansed dataset. Then we conducted a sentiment analysis of the tweets using state of the art machine learning techniques. We classified each tweet as positive, negative or neutral. At this point we applied our lexicon identifying advantages tweets and drawbacks tweets. Finally we compared the outputs of the two analyses for the two product-related clusters of tweets. We found consistent differences between the extractions. The results shows that when a product has a certain technological complexity and fuels a more technical debate, advantages and drawbacks analysis is more able than sentiment in producing technicalfunctional judgements. For this reason we think that the proposed methodology peforms better then standard sentiment analysis techniques when a product has a certain technological complexity and fuels a more technical social media discourse.

146

Chiarello, F.; Bonaccorsi, A.; Fantoni, G.;Ossola, G.; Cimino, A.; Dell’Orletta, F.

2. State of the art We provide an overview of the studies about social media forecasting (Table 1, 2). Researchers especially focused on economics (stock market, marketing, sales) and politics (elections outcomes). In economics, predicting fluctuations in the stock market has been the most studied by far. Early work focused largely on predicting whether aggregate stock measures such as the Dow Jones Industrial Average (DJIA) would rise or fall on the next day, but forecasting can also involve making more detailed predictions, e.g., forecasting market returns or making predictions for individual stocks. The simplest task for stock market prediction is deciding whether the following day will see a rise or fall in stock prices. Comparison between studies is complicated by the fact that stock market volatility, and thereby the difficulty of prediction, may vary over time periods. High accuracy (87,6%) on this task was reported by Bollen (2012). However, slight deviations away from their methodology have seen much less success indicating that the method itself may be unreliable (Xu, 2014). A very good result is achieved by Cakra (2015) who use linear regression to build a prediction model based on the output of sentiment analysis and previous stock price dataset. Social media has also been used to study the ability of online projects to successfully crowdfund their projects through websites like Kickstarter. Li (2016) predicts whether a project will eventually succeed by making use of features relevant to the project itself (e.g., the fundraising goal), as well as social activity features (e.g., number of tweets related to the project), and social graph measures (e.g., average number of followers for project promoters). Using all of these features for only the first 5% of the project duration achieved an AUC of 0.90, reflecting very high classification performance. Many studies analysed the predictive power of social media to improve or replace traditional and expensive polling methods. The simplest technique is measuring tweet volume (tweet mentioning a political party = votes). Chung (2010) and Tumasjan (2010) employed this method obtaining mixed results. Razzaq (2014), Skoric (2012) and Prasetyo (2015) improved this method taking into account the mood of the posts, considering if a candidate or a party is mentioned in a positive or negative manner.

147

Technical Sentiment Analysis: Predicting the Success of New Products Using Social Media

Table 1: Summary of studies in economics. Data source: T = Twitter, F = Facebook, K = Kickstarter, O = blogs, other. Task: MDA = Mean Directional Accuracy, MAPE = Mean Absolute Percentage Error. Article

Topic

Data source

Data size

Observation time

Xu (2014)

Stock market

T

100K tweets

42 days

Crone (2014)

Exchange rates

T, F, O

N/A

N/A

Kordonis (2016)

Stock market

T

N/A

N/A

Cakra (2015)

Stock market

T

N/A

2 weeks

Bollen (2015)

Stock market

T

9.8M

10 months

Brow (2012)

Stock market

T

13K

9 days

Rao (2012)

Stock market

T

4M

14 months

Kim (2014)

Hit songs

T

31.6M

68 days

Korolov (2015)

Donations

T

15M

10 days

Le (2015)

Sports book

T

1.2M

30 days

Tuarob (2013)

Smartphone sales

T

800M

19 months

Asur (2010)

Movie revenues

T

2.8M

3 months

Ahn (2014)

Car sales

T, F, O

26K posts

N/A

Success rate

(DJIA); (NASDAQ).

(Sedan A); (Sedan B).

Chen (2015)

Advertising

T

5.9K users

N/A

Li (2016)

Crowdfunding success rate

T, F, K

106K tweets

6 months

66% gain (click rate); 87% gain (follow rate).

Researchers employ different tools and methods for social media mining, varying from easy to somewhat more complex. The most employed tool is sentiment analysis (with its various approaches: knowledge-based techniques, statistical methods, and hybrid approach) which usually achieves good results. Other researchers use more complex tools such as neural networks or a combination of techniques. At end of the analysis of the state of the art we are able to identify some best practices: (i) implementing suitable techniques to deal with noisy data, (ii) evaluating statistical biases in social media data, (iii) collecting data from heterogeneous sources, (iv) incorporating domain-specific knowledge to improve statistical model.

148

Chiarello, F.; Bonaccorsi, A.; Fantoni, G.;Ossola, G.; Cimino, A.; Dell’Orletta, F.

Table 2: Summary of studies in politics. Data source: T = Twitter. Task: Acc. = Accuracy, MDA = Mean Directional Accuracy, MAPE = Mean Absolute Percentage Error. Article

Topic

Data source

Data size

Observation time

Success rate

Chung (2010)

Renewal of US senate

T

235K tweets

7 days

Acc. 41% - 47%

Tumasjan (2010)

German federal election

T

104K tweets

36 days

MAE 1.65%

Razzaq (2014)

Pakistani election

T

613K tweets

N/A

Acc. 50%

Skoric (2012)

Political election

T

7M tweets

36 days

MAE 6.1%

Prasetyo (2015)

Indonesian political election

T

7M tweets

83 days

MAE 0.62% (State level)

3.Methodology 3.1 Selection of a triggering event and products We chose the Electronic Entertainment Expo as event able to polarise users’ attention. Commonly referred to as E3, it is a premier trade event for the video game industry, presented by the Entertainment Software Association (ESA). We chose two new video game consoles, disclosed at E3 2017, as products of which predicting the success or failure. The first is Xbox One X, a new high-end version of Xbox One with upgraded hardware and the other product is New Nintendo 2DS XL, a streamlined version of the handheld console New Nintendo 3DS XL. 3.2 Data collection Twitter provides two possible ways to gather tweets: the Streaming Application Programming Interface (API) and the Search API. The first one allows user to obtain realtime access to tweets from an input query. The user first requests a connection to a stream of tweets from the server. Then, the server opens a streaming connections and tweets are streamed in as they occur, to the user. However, there are a few limitations of the Streaming API. First, language is not specifiable, resulting in a stream that contains tweets of all languages, including a few non-Latin-based alphabets, that complicates further analysis. Instead, Twitter Search API is a Representational State Transfer API which allows users to request specific queries of recent tweets. It allows filtering based on language, region, geolocation, and time. Unfortunately, using the Search API is expensive and there is a rate limit associated with the query. Because of these issues, we decided to go with the Twitter Streaming API instead. For each product, we detected related hashtags and keywords an constructed a query to download relevant tweets. We chose to collect tweets not only after the tradeshow, but also before. For these reason, we initially identified some products keywords with their provisional names and we updated them at a later stage. Tweets have been downloaded from CNR (Consiglio

149

Technical Sentiment Analysis: Predicting the Success of New Products Using Social Media

Nazionale delle Ricerche, Istituto di Informatica e Telematica, Area di Pisa) since 11th June 2017 h. 10:00 to 31th July 2017 h. 15:00. 3.3 Data filtering The initial dataset resulted to be very noisy, containing tweets written in different languages, advertising and posts related to different products or subjects. We chose to keep into account only English tweets because sentiment and advantages/drawbacks lexicon is in this language. The data set is filtered removing tweets with less than five words and nonEnglish posts with a language classifier. We obtained 7.165.216 of filtered tweets. At this point we created a golden set of relevant tweet to train a Supported Vector Machine classifier able to recognize relevant and unrelevant tweets. We defined characteristics that make a tweet: (i) relevant (posted by users or containing words or opinions related to our products of interests and their functionalities), (ii) irrelevant (tweets containing advertisings, links to e-commerce websites or messages related to other products or subjects). A researcher manually classified a subset made up of randomly extracted tweets. In particular, we exctract a subset composed of 6.500 finding 105 positive results and 6.395 negative. SVM model was then trained using this dataset, and computed a probability for each tweet to be relevant or irrelevant. A threshold of 0.7 has been chosen to label a tweet as relevant. The final dataset of filtered tweets, made up of 66.796 posts. We clustered tweets using product-related keywords. Clustering posts allowed us to further filter the final dataset which contained a small number of irrelevant tweets (Table 3). Table 3. Clusters of tweets N° of tweets

% of tweets

Xbox One X

64.885

97,14 %

New N2DS Irrelevant tweets

1.706

2,55 %

198

0,30 %

Table 4. Sentiment analysis classification Positive

Negative

Neutral

Xbox One X

35,99%

4,65%

59,37%

New N2DS

52,99%

1,58%

45,43%

Overall

36,42%

4,57%

59,01%

3.4 Sentiment analysis Table 4 presents the results of the sentiment analysis. We classified each tweet according to its sentiment into positive, negative, or neutral. We used an established methodology

150

Chiarello, F.; Bonaccorsi, A.; Fantoni, G.;Ossola, G.; Cimino, A.; Dell’Orletta, F.

developed by Cimino (2016). We pre-processed the tweets by removing mentions (@ character), URLs, product hashtags, emoticons and single characters. As a result, for each tweet we obtained a probability of belonging to a mood class. After a manual analysis, we used a class prediction probability threshold of 0.6 to filter out low confidence prediction, i.e. tweets that cannot be classified as positive or negative with a high confidence are classified as neutral instead. 3.5 Advantages and drawbacks analysis To extract technical advantages and drawbacks from tweets we used the lexicon developed in Chiarello (2017) that contains 657 Advantages words and 297 Drawbacks clues. These words are searched on our dataset finding different percentages of tweets with words from the lexicon in the two product-related clusters of tweets. Table 5 reports the results. Table 5: Percentages of tweets containing or not words from our lexicon. Tweets with adv

Tweets with drw

Tweets with adv & drw

Tweets with no adv or drw

Tweets with adv or drw

Xbox One X

8,84%

3,74%

0,37%

87,05%

12,95%

New N2DS XL

6,62%

0,94%

0,00%

92,44%

7,56%

4. Results: Comparison Between Sentiment Analysis and Technical Advantages and Disadvantage Extraction We adapted the advantages & drawbacks analysis to give as output a classification ef each tweet. We classified data coming from the latter analysis in this way: (i) positive (tweets containing only advantages words), (ii) negative (tweets containing only drawbacks), (iii) neutral (tweets with no words of our lexicon or controversial tweets). As we can see in figure 2, sentiment analysis is more able to polarise tweets. In fact, with this analysis we found lower levels of neutral tweets, respectively 59.37 % for Xbox One X and 45.43% for the New Nintendo 2DS XL.

151

Technical Sentiment Analysis: Predicting the Success of New Products Using Social Media

Adv/Drw Analysis

Sentiment Analysis

Xbox One X

59,37%

New Nintendo 2DS XL

52,99%

45,43%

35,99% 4,65%

1,58% 92,44%

87,42%

8,84%

3,74%

6,62%

0,94%

Figure 1: Comparison between Sentiment analysis and Advantages/Drawbacks analysis

This was an expected result since this kind of analysis is designed to deal with colloquial language while our lexicon is technical, being derived from patents analysis. What surprised us is the different polarisation of the products that we see comparing the two analyses. In fact, while with sentiment analysis Nintendo achieves lower percentages of neutral tweets, with advantages and drawbacks analysis is the opposite, since Xbox tweets are more polarised. We also noted that we found more tweets with words of our lexicon in the Xbox subset than in the Nintendo one (Table 5). We did the hypothesis that the differences between the percentages of tweets with words found for each product, and the differences of polarisation between the two analyses depend on the different marketing focus, target customer, and technological complexity of the two new video game consoles. Xbox One X targets hard-core gamers who really wants a premium experience1. With its marketing campaign, Microsoft pushed the technical supremacy of its new machine over the competitors’ products, fuelling a debate about its technical features amongst the potential users. As a result, the campaign produced a more technical social discourse that allowed us achieving better results. Instead, the new Nintendo handheld console has been developed 1

http://www.businessinsider.com/why-xbox-one-x-costs-500-2017-6?IR=T (last access: 17/11/2017)

152

Chiarello, F.; Bonaccorsi, A.; Fantoni, G.;Ossola, G.; Cimino, A.; Dell’Orletta, F.

targeting children and families providing a model that falls somewhere in the middle of the line of 3DS consoles2. We initially checked our hypothesis using Google Trend to compare users’ search interest about technical review of the two products during the data collection period (Figure 2). Then, we analysed the number of technical articles related to the new products published by the 25 most popular video games and technology websites in the U.S, according to the ranking of SimilarWeb, a digital marketing intelligence company which publishes insights about websites. We entered queries reported in Table 6 into Google search engine to retrieve technical article within the web domains previously identified: we obtained 1.117 articles about Xbox and only 52 about Nintendo, proving that technical debate concerning Xbox is greater. This is and evidence of the fact that when a product has a certain technological complexity and fuels a more technical debate, advantages and drawbacks analysis is more able than sentiment in producing technical-functional judgements. The greater number of neutral tweets found with advantages and drawbacks analysis can also be explained with the Means-end chain model (Reynolds, 1995). Consumers express themselves basing on personal consequences linked with product use or basing on personal values satisfied by the product itself. For these reasons, tweets contain a more colloquial language which sentiment analysis is more able to interpret than the latter tool. 100

Xbox One X review

Search interest

80

New N2DS XL review

60 40 20 0 11/06/2017

21/06/2017

01/07/2017

11/07/2017

21/07/2017

31/07/2017

Figure 2: Google Trends comparison of search-terms “Xbox One X review” and “New Nintendo 2DS XL review” during the data collection period, since 11th June 2017 to 31st July 2017. Values on the vertical axis depict search interest compared to the highest point in the graph during the observation time. A value of 100 is the peak popularity for the term. On average, users searched for Xbox reviews with an approximately five times higher frequency.

2

http://www.nintendolife.com/news/2017/05/reggie_explains_the_reasoning_behind_the_new_2ds_xl (last

access: 17/11/2017)

153

Technical Sentiment Analysis: Predicting the Success of New Products Using Social Media

Table 6: Queries entered into Google search engine to search for technical articles within selected web domains. We selected keywords related to technical features of the products. The example report queries used for one of the analysed website: ign.com

Xbox One X

New Nintendo 2DS XL

allintitle: (4k OR hdr OR hardware OR graphics OR review OR resolution OR fps OR fast OR comparison OR frame OR enhanced OR performance OR cpu OR gpu OR ram) AND ("xbox one x") site: ign.com allintitle: (graphics OR review OR screen OR comparison OR enhanced OR performance OR cpu OR gpu OR ram OR battery OR weight) AND “new nintendo 2ds xl” site: ign.com

5. Conclusion Methods and techniques for social media mining with sentiment analysis is one of the most appreciated tool amongst researchers, having a very good reputation in the informatic fields. Also, big companies make use of it because it can be a rich source of information to adjust marketing strategies, improve campaign success, advertising message, and customer service. Nevetheless sentiment analysis is designed to extract feelings related sentiment polarity from tweets of user and not other kinds of polarity, like polarity related to technical advantages and drawbacks of products the users are experiencimg. In this paper we shown how using a technical lexicon to analyse technical polarity of tweets is a a more effective approach in giving technical-functional judgements about a product we respect to state of the art sentiment analysis techniques. It is particulartly true when a product has a certain technological complexity.

References Ahn H., and Spangler W. S. (2014) “Sales prediction with social media analysis”.SRII Global Conference. IEEE, 2014. Asur S., and Huberman B. A. “Predicting the Future With Social Media”. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE, 2010. Bollen J., Mao H., and Zeng X. , 2015, “Twitter mood predicts the stock market”. Ref: http://arxiv.org/abs/1010.3003 Brown, Eric D., "Will Twitter Make You a Better Investor? A Look at Sentiment, User Reputation and Their Effect on the Stock Market" (2012). SAIS 2012 Proceedings. 7. Cakra Y. E., and Trisedya B. D. “Stock price prediction using linear regression based on sentiment analysis”. In 2015, International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE, 2015.

154

Chiarello, F.; Bonaccorsi, A.; Fantoni, G.;Ossola, G.; Cimino, A.; Dell’Orletta, F.

Chen J., Haber E., Kang R., Hsieh G., and Mahmud J. “Making use of derived personality: The case of social media ad targeting”. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 2015. Chiarello F., Fantoni G., Bonaccorsi A. (2017) Product description in terms of advantages and drawbacks: Exploiting patent information in novel ways. ICED 2017 Chung J., and Mustafaraj E. “Can collective sentiment expressed on twitter predict political elections?”. In Twenty-Fifth AAAI Conference on Artificial Intelligence. AAAI, 2010. Cimino A., Dell’Orletta F. (2016) “Tandem LSTM-SVM Approach for Sentiment Analysis“. In Proceedings of EVALITA ’16, Evaluation of NLP and Speech Tools for Italian, 7 December, Napoli, Italy. Collomb A. ,Costea C.,Brunie L. (2013). A Study and Comparison of Sentiment Analysis Methods for Reputation Evaluation. Crone S. F., and Koeppel C. “Predicting exchange rates with sentiment indicators”. In 2014, IEEE conference on computational intelligence for financial engineering & economics (CIFEr). IEEE, 2014. Kim Y., Suh B., and Lee K. “#nowplaying the future billboard: mining music listening behaviors of twitter users for hit song prediction”. In Proceedings of the first international workshop on social media retrieval and analysis, pages 51-56. ACM, 2014. Kordonis J., Symeonidis S., and Arampatzis A. “Stock price forecasting via sentiment analysis on Twitter”. In Proceedings of the 20th Pan-Hellenic Conference on Informatics, article no. 36. ACM, 2016. Korolov R., Peabody J., Lavoie A., Das S., Magdon-Ismail M., and Wallace W. “Actions are louder than words in social media”. In IEEE/ACM International Conference on Advances in Social Network Analysis and Mining. IEEE, 2015. Le L., Ferrara E., and Flammini A. “On predictability of rare events leveraging social media: a machine learning perspective”. In Proceedings of the 2015 ACM on Conference on Online Social Networks. ACM, 2015. Li Y., Rakesh V., and Reddy C. K. “Project success prediction in crowdfunding environments”. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 247–256. ACM, 2016. Mirtalaie M.A., Hussain O.K., Chang E., Hussain F.K. (2018) Sentiment Analysis of Specific Product’s Features Using Product Tree for Application in New Product Development. Lecture Notes on Data Engineering and Communications Technologies Prasetyo N. D., and Hauff C. “Twitter-based election prediction in the developing world”. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, pages 149158. ACM, 2015. Rao T., and Srivastava S. “Analyzing stock market movements using twitter sentiment analysis. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), pages 119–123. IEEE Computer Society, 2012. Razzaq M. A., Qamar A. M., and Bilal H. S. M. “Prediction and Analysis of Pakistan Election 2013 based on Sentiment Analysis”. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014). IEEE, 2014. Reynolds T. J., Gengler C. E. e Howard D. J (1995). . A Means-End Analysis of Brand Persuasion through Advertising, “International Journal of Research in Marketing”, Vol. 12, No. 3, October, pp. 257–266.

155

Technical Sentiment Analysis: Predicting the Success of New Products Using Social Media

Sang E. T. K., and Bos J.. Predicting the 2011 Dutch senate election results with Twitter. In Proceedings of the Workshop on Semantic Analysis in Social Media, pages 53-60. ACM, 2012. Skoric M, and Poor N. “Tweets and Votes: A Study of the 2011 Singapore General Election”. IEEE, 2012. Tuarob S., and Tucker C. S. “Fad or here to stay: predicting product market adoption and longevity using large scale, social media data”. In ASME 2013 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Volume 2B: 33rd Computers and Information in Engineering Conference Portland, Oregon, USA, August 4–7, 2013. Tumasjan A., Sprenger T. O., Sandner P. G., and Welpe I. M. “Predicting elections with twitter: what 140 characters reveal about political sentiment”. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. AAAI, 2010. Xu F., and Keselj V. “Collective sentiment mining of microblogs in 24-hour stock price movement prediction”. In IEEE 16th conference on Business Informatics. IEEE, 2014.

156

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8337

Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends Pekar, Viktor Finance Department, Business School, University of Birmingham, United Kingdom.

Abstract Consumer expenditure constitutes the largest component of Gross Domestic Product in developed countries, and forecasts of consumer spending are therefore an important tool that governments and central bank use in their policy-making. In this paper we examine methods to forecast consumer spending from user-generated content, such as search engine queries and social media data, which hold the promise to produce forecasts much more efficiently than traditional surveys. Specifically, the aim of the paper is to study the relative utility of evidence about purchase intentions found in Google Trends versus those found in Twitter posts, for the problem of forecasting consumer expenditure. Our main findings are that, firstly, the Google Trends indicators and indicators extracted from Twitter are both beneficial for the forecasts: adding them as exogenous variables into regression model produces improvements on the pure AR baseline, consistently across all the forecast horizons. Secondly, we find that the Google Trends variables seem to be more useful predictors than the semantic variables extracted from Twitter posts, the differences in performance are significant, but not very large. Keywords: Google Trends and Search Engine data; Social media and public opinion mining; Internet econometrics; Machine learning econometrics; Consumer behavior, eWOM and social media marketing.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

157

Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends

1. Introduction Consumer expenditure constitutes the largest component of Gross Domestic Product in developed countries: in the US, it accounts for about 70% of GDP, in the UK 66%, in Germany 60% (Pistaferry, 2015). Significant changes to consumer spending are key to predict the depth of a recession or the speed of recovery, and central banks use consumer spending forecasts as an important tool for monetary policy-making. Government institutions and market research agencies compile their consumer spending indices on a regular basis. Among best-known examples are the University of Michigan Consumer Sentiment Index for the US, or the Household Final Consumption Expenditure by the UK Office for National Statistics. Currently, such indices are measured by market research surveys, but these have significant drawbacks: they are expensive to organize, they have sampling problems, the amount of effort required to collect and compile the data often entails that the indices are out of date by the time they are published. This paper examines the hypothesis that user-generated content, such as search engine queries or social media posts, offers a better alternative to traditional surveys when it comes to estimating consumer expenditure. Effective methods to extract signals about future consumer spending from this data may help to produce forecasts more efficiently, based on much larger data samples, and in near-real time. Previous work studied models of consumer spending trained on search engine data, based on the intuition that web searches for product names indicate intended purchases (Vosen and Schmidt, 2011; Scott and Varian, 2015; Wu and Brynjolfsson, 2015). Another direction of research has been to estimate economic confidence and purchase intentions of consumers from social media using automatic sentiment analysis (O’Connor et al., 2010; Daas and Puts, 2014; Najafi and Miller, 2016). In this paper we study the relative utility of evidence about purchase intentions found in search engine queries versus those found in social media, for the problem of forecasting consumer expenditure.

2. Google Trends The Google Trends (GT) site provides data on the volume of all Google queries based on geographic locations and time, collected since 2004. The frequencies of queries is not the absolute number of actual queries, but a normalized index, such that for any given retrieval criteria, the index is always between 0 and 100, 100 being the count of the most common query in the retrieved data.

158

Pekar, V.

GT contains data not only on individual queries, but also on categories of queries. In our study we use the data on search volumes of the 18 subcategories of the top-level "Shopping" category in GT. Examples of the subcategories are "Apparel", "Consumer Electronics", "Luxury Goods", "Ticket Sales". The search volume on each category is used as an exogenous variable in the Support Vector Regression model. The time period we analyse spans 43 months (from 1st January 2014 to 31.07.2017). Because GT returns only weekly search volumes in one request for periods longer than six months, we are able to retrieve only weekly search volumes for the entire time period. To obtain daily search volumes, we first retrieve daily data in separate queries for each 6month period. Then, within each such subset, we fit a linear regression on the monthly data and use it to obtain daily volumes for the entire 43 months dataset.

3. Purchase intentions on Twitter Our method aims to predict a consumer spending index from the mentions of purchase intentions in Twitter posts. The method consists of the following steps. First, tweets mentioning a purchase intention are identified. Second, noun phrases referring to the objects of the intended purchases are extracted and represented as semantic vectors using the word2vec method. Finally, a regression model of the consumer spending index is trained that uses semantic vectors as explanatory variables. These steps are detailed below. 3.1. Detecting purchase intentions To obtain tweets mentioning purchase intentions, we issue a set of queries to the tweet collection, which are meant to capture common ways to express an intention to buy something. They are created from combinations of (1) first-person pronouns ("I" and "we"), (2) verbs denoting intentions ("will", "'ll", "be going to", "be looking to", "want to", "wanna", "gonna"), and (3) verbs denoting purchase ("buy", "shop for", "get oneself"), thus obtaining queries such as "I will buy" or "we are going to buy". The text of each tweet is then processed with a part-of-speech tagger. PoS tag patterns are then applied to extract the head noun of the noun phrase following the purchase verb (e.g., "headphones" in "I am looking to buy new headphones"). After that, daily counts of the head nouns are calculated. 3.2. Semantic vectors To represent the semantics of the nouns, we use the word2vec method (Mikolov et al., 2013) which has proven to produce accurate approximations of word meaning in different NLP tasks. A word2vec model is a neural network that is trained to reconstruct the linguistic context of words. The model is built by taking a sequence of words as input and

159

Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends

learning to predict the next word, using a feed-forward topology where a projection layer in the middle is taken to constitute a semantic vector for the word, after connection weights have been learned. The semantic vector is a fixed-length, real-valued pattern of activations reaching the projection layer. For each word, the input text originally has a dimensionality equal to the vocabulary size of the training corpus (typically millions of words), but the semantic modelling provides reduction to the size of the vector (typically several hundreds). For each date, we map each noun that was observed on that day to a semantic vector, using 100-dimensional word2vec vectors trained on a large corpus of Twitter posts. The semantic vectors of all the nouns for each day are then averaged to obtain a single vector. The components of the vectors will then be used as exogenous variables in regression models. To allow for some time between the stated purchase intention and the actual purchase, we experiment with the “intention lag”, different numbers of days between the day on which intentions were registered and the day for which the value of the consumer spending index is predicted.

4. Experiments 4.1. Data Indicator of Consumer Expenditure. As the forecast variable in our model, we use the Gallup Consumer Spending Index (CSI)1. The index represents the average dollar amount Americans report spending on a daily basis. The eventual index is presented as a 3-day and a 14-day rolling averages of these amounts. In our evaluation, we used the 3-day values of CSI, between January 1, 2014 and July 31, 2017, i.e. 1,310 days in total. Twitter. For the same time period, we collected Twitter posts that originate from the US and that express intentions to buy, obtaining the total of 288,730 messages. Counts of nouns referring to purchases were extracted and rolling averages for each noun for threeday periods were calculated. To eliminate noisy data, we selected the 1000 most common nouns to construct semantic vectors. Google Trends. Also for the same period, we obtain frequencies of searches in the 18 subcategories of the top-level Shopping category from the GT site, limiting the data to the US. Train-validation-test split. The available data was divided into the training, validation and test parts, in proportion 60%-20%-20%.

1

http://www.gallup.com/poll/112723/gallup-daily-us-consumer-spending.aspx

160

Pekar, V.

4.2. Modelling strategies We experiment with four methods to ensure stationarity of the time series data: differencing, detrending, seasonal adjustment and detrending with seasonal adjustment. Detrending and seasonal adjustment are performed using the STL method (Cleveland et al., 1990). Before evaluating the quality of forecasts on test data, the forecasts are dedifferenced, and the trend and the seasonal component estimated on training data are added to the forecasts. 4.3. Support vector regression The Support Vector Machines learning algorithm (Cortes and Vapnik, 1995) is one of the most popular machine learning methods for supervised learning. In our experiments we use Support Vector Regression (SVR), a version of SVM adapted for regression. During evaluation, we experimentally determine free parameters of SVR (the cost parameter, the gamma parameter and the kernel type) on a validation dataset using the grid search technique. The model with the best parameter configuration is then evaluated on the test set. 4.4. Evaluation method Once a model was trained on the training set and its parameters optimized on the validation set, it was evaluated on the test set using dynamic forecasting: given the first day t of the test set, and the forecast horizon h, the model predicted h days in the future, for each day from t2 to th the values predicted by the model for previous days were input as endogenous variables. In the following, we report results for h = 7, 14, and 28. As evaluation metrics, we use the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). As the baselines, we use SVR models trained with the same algorithms but only on endogenous variables, i.e. lagged values of CSI. Because CSI displayed weekly seasonality, we used seven lagged variables in the baseline model.

5. Results and discussion 5.1. Modelling strategies An inspection of the correlogram of the CSI time series suggested that it is likely to have weekly seasonality. Furthermore, considering the long time period the data covers, the data may contain a trend. Therefore we first examined the effect of different techniques to “whiten” the time series on the quality of the forecasts. Table 1 details the results (the baseline refers to the raw original data, the best RMSE and MAE scores are in bold).

161

Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends

Table 1. Forecast accuracy for different time series transformation methods. h=7

h=14

h=28

RMSE

MAE

RMSE

MAE

RMSE

MAE

Baseline

12.78

9.9

14.42

11.23

14.42

11.38

Differencing

12.11

9.57

21.51

18.0

11.77

9.04

Detrending

10.64

8.18

11.49

8.9

11

8.42

Deseasonalizing

12.62

9.85

14.57

11.36

14.35

11.33

Detrend+Deseason

10.5

7.96

11.48

8.64

10.96

8.23

These results show that applying both detrending and seasonal adjustment consistently resulted in the best forecasting results, for all forecasting horizons. Therefore, in the subsequent experiments, the CSI data was detrended and deseasonalized. 5.2. Effect of Google Trends variables We next examined the effect of supplying GT data as exogenous variables into the regression model, in addition to the autoregressive variables. The results are shown in Table 2 (improvements on the baseline are in bold). We find that the GT variables do often perform better than the purely endogenous baseline, for all the three forecast horizons. It appears also that shorter intentions lags (between 0 and 4) produce better quality models. The best-performing model is a lag of one model, which beats the baseline by 3-7% across all the horizons.

162

Pekar, V.

Table 2. Forecast accuracy with GT variables. h=7

h=14

h=28

Intention lag

RMSE

MAE

RMSE

MAE

RMSE

MAE

0

10.49

8.03

10.9

8.19

10.57

8.0

1

10.21

7.69

10.81

8.06

10.52

7.85

2

10.4

7.64

10.78

7.97

10.7

7.98

3

10.56

8.0

10.78

8.02

10.68

8.02

4

10.84

8.3

11.15

8.34

11.5

8.63

5

10.57

8.06

11.52

8.65

11.62

8.73

6

10.46

7.99

11.48

8.65

10.9

8.24

7

11.19

8.58

11.28

8.39

11.09

8.33

5.3. Effect of Twitter variables Table 3 presents results on the effect of semantic variables extracted from Twitter posts. As with GT variables, improvements are found across all the horizons. However, the baseline is consistently outperformed only when the intention lag is 0, and the improvements are more modest, ranging between 1 and 5%. Comparing the performance of the models with GT variables and with Twitter variables, we observe that the GT model tends to fare better, but the gain on the Twitter model is not more than 0.3 points (3.6%) in either RMSE and MAE. Still, the differences in forecasts between the two types of models at corresponding horizons are statistically significant.

6. Conclusions In this paper we presented a study comparing indicators of purchase intentions obtained from Google Trends to those obtained from Twitter using NLP analysis of the messages, on the task of forecasting consumer expenditure. Our main findings are that, firstly, both kinds of purchase intention indicators are beneficial for the forecasts: the improvements on the baseline are consistent across all the forecast horizons and in terms of both evaluation metrics. Secondly, the study found the Google Trends variables seem to be more useful predictors than the semantic variables extracted from Twitter posts, although the differences in performance are not very large.

163

Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends

Table 3. Forecast accuracy with Twitter variables. h=7

h=14

h=28

Intention lag

RMSE

MAE

RMSE

MAE

RMSE

MAE

0

10.34

7.76

10.92

8.21

10.88

8.15

1

10.73

7.99

11.37

8.49

11.09

8.29

2

10.75

8.06

11.46

8.61

11.14

8.37

3

10.57

8.01

11.3

8.55

10.97

8.34

4

10.65

8.25

11.33

8.57

11.04

8.42

5

10.71

8.03

11.31

8.46

11.51

8.67

6

10.92

8.15

11.32

8.43

11.52

8.69

7

10.86

8.14

11.54

8.68

11.53

8.7

Future directions for this work may involve a further analysis of models that use Google Trends data, such as an analysis of more fine-grained Google Trends subcategories, automatic selection of the most relevant predictors among them, and their semantic clustering.

References Cleveland R., Cleveland W., McRae J., & Terpenning I. (1990). STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol.6, No.1, 1990. pp. 3–73. Cortes C., & Vapnik, V. (1995). Support-vector networks. Machine Learning. 20 (3): 273– 297. Daas P., & Puts M. (2014). Social media sentiment and consumer confidence. In Workshop on using Big Data for forecasting and statistics. Mikolov T., Chen K., Corrado K., & Dean J. (2013). Efficient estimation of word representations in vector space. In Proceedings of CoRR. Najafi H. and Miller D. (2016). Comparing analysis of social media content with traditional survey methods of predicting opening night box-office revenues for motion pictures. Journal of Digital and Social Media Marketing, 3(3):262–278.

164

Pekar, V.

O’Connor B., Balasubramanyan R., Routledge R., & Smith N. (2010). From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of ICWSM. Pistaferri L. (2015). Household consumption: Research questions, measurement issues, and data collection strategies. Journal of Economic and Social Measurement. Scott S. & Varian H. (2015). Bayesian Variable Selection for Nowcasting Economic Time Series. In Economic Analysis of the Digital Economy, pages 119–135. University of Chicago Press. Vosen S. & Schmidt T. (2011). Forecasting private consumption: survey-based indicators vs. Google trends. Journal of Forecasting, 30(6):565–578. Wu L. & Brynjolfsson E. (2015). The future of prediction: How Google searches foreshadow housing prices and sales. In Economic Analysis of the Digital Economy, pages 89–118. University of Chicago Press.

165

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8338

Towards an Automated Semantic Data-driven Decision Making Employing Human Brain Fensel, Anna Semantic Technology Institute (STI) Innsbruck, Department of Computer Science, University of Innsbruck, Austria

Abstract Decision making is time-consuming and costly, as it requires direct intensive involvement of the human brain. The variety of expertise of highly qualified experts is very high, and the available experts are mostly not available on a short notice: they might be physically remotely located, and/or not being able to address all the problems they could address time-wise. Further, people tend to base more of their intellectual labour on rapidly increasing volumes of online data, content and computing resources, and the lack of corresponding scaling in availability of the human brain resources poses a bottleneck in the intellectual labour. We discuss enabling direct interoperability between the Internet and the human brain, developing "Internet of Brains", similar to "Internet of Things", where one can semantically model, interoperate and control real life objects. The Web, "Internet of Things" and "Internet of Brains" will be connected employing the same kind of semantic structures, and work in interoperation. Applying Brain Computer Interfaces (BCIs), psychology and behavioural science, we discuss the feasibility of a possible decion making infrastructure for semantic transfer of human thoughts, thinking processes, communication directly to the Internet. Keywords: Semantic Technology; Decision Making; Brain Computer Inerface; Data Value Chain; Artificial Intelligence; Data Management.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

167

Towards an Automated Semantic Data-driven Decision Making Employing Human Brain

1. Introduction Decision making is time-consuming and costly, as it requires direct intensive involvement of the human brain. The variety of expertise of highly qualified experts is very high, and the available experts are mostly not available on a short notice: they might be physically remotely located, and/or not being able to address all the problems they could address timewise. Generally, exchanging and managing the data on the Internet in a dynamic and efficient manner are among key challenges for the information systems requested nowadays by enterprises, institutions and citizens. People tend to base more of their intellectual labor on rapidly increasing volumes of online data, content and computing resources, and the lack of corresponding scaling in availability of the human brain resources poses a bottleneck in the intellectual labor. Finally, communication of the results of the intellectual labor requires further efforts, of putting the outcomes in a commonly processible representation form, such as spoken words or written texts. To approach the optimal data management of the future, we discuss the possibility of enabling of direct interoperability between the Internet and the human brain, developing "Internet of Brains", similar to "Internet of Things". On the latter, one can semantically model, interoperate and control real life objects, and the applications of the semantic Internet of Things are numerous, see e.g. the areas of smart homes or transport. The Web, "Internet of Things" and "Internet of Brains" will be connected employing the same kind of semantic structures, and work in interoperation. Applying Brain Computer Interfaces (BCIs), psychology and behavioral science, an infrastructure for semantic transfer of human thoughts, thinking processes and communication directly to the Internet can be designed. This will facilitate the intellectual labor and its representation in human and machine readable forms, and address the aspects difficult to account so far, f.e. non-verbal communication. Service-based enablers for discovery of interdependencies across human reasoning and senses and heterogeneous datasets for assisting humans in making decisions and changing their behavior and workflows can be created, as well as making these decisions and workflows more transparent and traceable. The latter can be performed taking into account the currently existing developments and standards in the related fields, particularly, semantic data licensing (Pellegrini et al., 2018), and smart contracts – as these are being exploited already broadly in practice (Underwood, 2016). The high-level results of the envisioned solution will include:  

168

a framework for applying human thoughts and senses in decision making, through the semantic interfaces, and its concrete design and implementation, synthesizing concrete semantics from abstract thoughts and emotions, and

Fensel, A.



automation of an intellectual labor (in the way the robotics is replacing manual labor), with employment of these capacities.

And the corresponding technical objectives are as follows: 

Design and development of a semantic infrastructure capturing the domain of human reasoning and senses, as well as decision making and intellectual work processes that are based on them,



Mapping the output of state of the art BCIs to the semantic infrastructure, producing a corresponding mappings library,



With the framework for streaming human brain activity online, enabling easier modeling of the data in both design time and the run time of the digital workplace scenario – and eventually the organizations creating their own applications and workflows basing on these models,



Speed up the velocity of the data flow in an information system that are currently bottlenecked by the slow speed of human decision making abilities, or are even performed with mistakes due to their imperfection (e.g. in scenarios connected with reporting),



Making the decision processes transparent, traceable, and easier to optimize (e.g. it can be easily established which nodes are causing delays),



Integrate new techniques facilitating easier data reuse, such as semantic information on how the data and content can be licensed (licenses library and tools can be applied out of our development in DALICC project1),



Visualization of the data, decisions in a form that is actionable to humans in a digital workplace scenario.

2. State of the Art and Progress Beyond it The proposed solution will aim to advance the state of the art in the following areas: (1) semantic modelling, knowledge representation, (2) data-empowered reasoning and decision making, (3) sensor technology. Further, we overview of the state of the art in these fields and how the aimed results are expected to advance the state of the art.

1

DAta LIcenses Clearing Center: https://dalicc.net

169

Towards an Automated Semantic Data-driven Decision Making Employing Human Brain

2.1. Semantic Technology as a Communication Means on the Internet “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” - this statement of Tim Berners-Lee has gained even more relevance since the start of this century (Berners-Lee et al., 2001). The vision of the Semantic Web seminal paper came closer, starting with the appearance of the basic semantic languages such as RDF, RDFS, OWL, semantic web service languages. Early from the appearance of the Semantic Web, the challenge to use semantics to facilitate human communication is being addressed, with the “semantic desktop” initiative and NEPOMUK project (Decker and Frank, 2004) being among the first ones. Now there exist IT multi-stakeholder ecosystems and infrastructures to interoperate across different marketing data and content resources using Linked Data (Bizer et al., 2009) and semantic technologies (Domingue et al., 2011), enhancing interoperability of distributed resources for allowing meaningful searches and efficient information dissemination for humans alike as machines. Research and developments on combining human and computing resources are abundant – see e.g. developments such as “social machines” (Hendler and Mulvehill, 2016), and use cases of that kind in the infrastructures such as Wikipedia (Smart et al., 2014), however, none of them yet comprise processing of the direct input from the human brain. In this development, skipping of the step involving natural language processing technology and communicating the outcomes of human thinking to machines in a semantic form will become possible. This will be used in the first place by the people with limited abilities, but also by people who are able to represent their thinking in an intermediate representation format (spoken words, written texts, etc.) – for the reasons of efficiency and scalability. In research and commercial developments, methods and tools to extract semantics from intermediary communication means have been developed, for example, extracting emotions and sentiments from the Web (Baldoni et al., 2012), as well as from the natural language texts (Mathieu, 2005). Such methods work already work in practice with a relatively large success, but inherently presume the availability of intermediary knowledge representation sources. On the contrary, here, extraction of emotions and sentiments from the human brain would take place directly. Some of the modeling and representation of the sentiments and emotions from the state-of-the-art research can be taken into account when modelling the framework, also including dedicated efforts to build the relevant ontologies (López et al., 2008; Borth et al., 2013). 2.2. Reasoning and Decision Making in the “Data Tsumami” Conditions As it is known, in the human reasoning and decision making, there are normally “soft” and “hard” factors involved. For example, if someone is hiring an employee to work with, usually both sides are important: whether a potential employee has an adequate

170

Fensel, A.

qualification and experience (“hard” factors), and whether he or she would fit well in the team (“soft” factors). Frameworks and models are currently starting to appear in the literature in application to various tasks and domains e.g. forecasting (Bańbura & Rünstler, 2011), as well as the approaches towards explaining human decision making (Rosenfeld and Kraus, 2018). Such works approach the possibilities to formalize the decision making and reasoning process semantically. In the world overflown by a “data tsunami”, humans are standing at the edge of their decision making and behavior change capacities, and the need to overcome these is unavoidable. The reasons here are as follows: 

Drastic increase in the amounts of the data and information that can be potentially relevant for making right decisions, de facto, the current “hard” reasoning we perform now is mostly always “incorrect and incomplete” – and methods and tools to address such reasoning (Fensel et al., 2008) have been developed, particularly, in EU LarKC project2,



Limitations and restrictions of human mind in taking decisions (such as due its limited immediate storage capacity, irrationality caused by bias (Boutang and De Lara, 2015)),



Increase in the dynamicity: often, the situations change on the fly, and the used data, workflow models may need to be replaced – as well as the behavior changed, this again, poses a challenge in choice of goals, methods, and implementation of tasks to a human brain,



Effective decisions and human behavior changes are essential parts of success, in particular economic success; even more dramatic: in some areas such as climate change or energy efficiency, the change of human behavior may mean the difference between “to be” and “not to be” for the human kind.

Given the ability to process and analyze large amounts of data, the machines already arguably outperform humans when it comes to the intellectual labor, where only “hard” factors are involved. However, many decisions carried out solely on the “hard” facts remain unviable in the real world, as they may go in contrary with the human senses, emotions, feelings, intuition, and eventually safety of the humans and acceptance with them. Applying on emotions (fear, curiosity, enjoyment and many others), human brains are able to rather successfully filter out the “right” contexts and defining the new ones (Kahneman, 2011), i.e. possessing the “soft factor” capacity which machines do not possess.

2

Large Knowledge Collider: http://larkc.sti2.at (archived web site)

171

Towards an Automated Semantic Data-driven Decision Making Employing Human Brain

Online communication, on the other hand, is not trivial, as it still hides most, or a large part of the semantics, e.g. transferred over non-verbal communication in face-to-face communication. Leaving alone the fact that a human, in order to communicate, needs to create a representation of the thought or an emotion, e.g. spoken words, images, text, which is of course not 100% identical to the original thought or an emotion. Here, we will pursue elimination of the typical intermediate representation layer for a human brain activity, and will create a precise machine and human readable semantic layer for it instead, and map the signals coming out of the bio-sensing equipment directly into this layer. In communication infrastructures, heterogeneous communities of stakeholders need to be addressed, and semantics is a very suitable instrument for this, as the essence of ontologies inseparably reflect the communities using them (Mika, 2005; Zhdanova, 2008). An additional challenge here is that humans also have a tendency to conceal the outcomes of their thinking, or even communicate the facts that do not correspond to them, if they feel like they would be getting an advantage, in particular, a match to the desired limited resource on the market (Roth et al., 2015), or a better perception by the society. Again, semantics has a potential to resolve this challenge, and test/simulate the realities which would take place under the conditions of humans expressing their actual thoughts and feelings. 2.3. Hardware and Sensors Availability A better understanding of the human brain stands high on the priority of the European Commission, e.g. The Human Brain Project3 is ongoing as a H2020 FET Flagship Project which strives to accelerate the fields of neuroscience, computing and brain-related medicine, since 2013, with the duration of 10 years. On the side of the technical development, also the BCIs have been investigated, and the forecasts and scenarios have been roadmapped, confirming the expected broad spread and varying spectrum of the application scenarios where BCIs will be used (Brunner et al., 2015). Now it is the right time to base the project on this technology, due to the following technical reasons: 1) Now the BCI technology is becoming mature and available. Companies selling advanced consumer-oriented products in the sub-1000 dollar range include Emotiv (http://www.emotiv.com) and Neurosky (http://www.neurosky.com). There are also very inexpensive open source biosensing solutions, such as OpenBCI (http://openbci.com). Generally, currently tools for make technical connections to the brain cost as little as starting from 30 dollars. 3

Human Brain Project: https://www.humanbrainproject.eu/en/

172

Fensel, A.

2) The user acceptance level of the technology is also becoming sufficient, to have an expectation that a system developed on a base of a BCI will be used and usable. Now BCI are already actively used beyond the typical for them medical scenarios, and e.g. are employed in games – however, also there a systematic approach of the related data management is missing (Gurkok et al., 2017). Thus, a systematic approach for integrating human thinking and reasoning activity on the Internet needs to be designed and developed practically.

3. Conclusions We have described initial principles and prerequisites of the direct integration of the data stemming from the human brain into decision making processes of the future. The main measurable success criterion of this work can be characterized as a transition step from Big Data to Smart Data. It typically involves enablement of more efficient adding value participation, i.e. increasing efficiency and/or provisioning and take up of new types of interactivity which bring benefits to the involved stakeholders (for example, faster decision making, less effort to transform the brain activity into the intermediary communication formats such as spoken words or written text). Further, the societal externalities of Big Data use (Cuquet and Fensel, 2018) will be accounted for, even when humans are unaware of them. The approach is to be realized in the chosen application domains going beyond the current state of the art of ontology-based service interfacing, integration and bio- and crowd- sensing. The results are to be evaluated within real scenarios, with real life data and services, as well as with real end users. The evaluation outcomes are to confirm the technical feasibility of ontology-based intervention networked services enablement, as well as its added value from the originated new usage scenarios, and its acceptance by the end users.

Acknowledgements This work has been partially funded by project DALICC, supported by the Austrian Research Promotion Agency (FFG) within the program “Future ICT”.

References Baldoni, M., Baroglio, C., Patti, V., & Rena, P. (2012). From tags to emotions: Ontologydriven sentiment analysis in the social semantic web. Intelligenza Artificiale, 6(1), 4154.

173

Towards an Automated Semantic Data-driven Decision Making Employing Human Brain

Bańbura, M., & Rünstler, G. (2011). A look into the factor model black box: publication lags and the role of hard and soft data in forecasting GDP. International Journal of Forecasting, 27(2), 333-346. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34-43. Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data-the story so far. Semantic Services, Interoperability and Web Applications: Emerging Concepts, 205-227. Borth, D., Chen, T., Ji, R., & Chang, S. F. (2013, October). Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the 21st ACM international conference on Multimedia, 459-460. ACM. Boutang, J., & De Lara, M. (2015). The Biased Mind: How Evolution Shaped Our Psychology Including Anecdotes and Tips for Making Sound Decisions. Springer. Brunner, C., Birbaumer, N., Blankertz, B., Guger, C., Kübler, A., Mattia, D., del R. Millán, J., Miralles, F., Nijholt, A., Opisso, E., Ramsey, N., Salomon, P., & Müller-Putz, G.R. (2015) BNCI Horizon 2020: towards a roadmap for the BCI community. BCI Journal. URL: http://bnci-horizon-2020.eu/roadmap Cuquet, M., & Fensel, A. (2018). The societal impact of big data: A research roadmap for Europe. Technology in Society, Elsevier. Decker, S., & Frank, M. (2004). The social semantic desktop. Digital Enterprise Research Institute, DERI Technical Report May, 2, 7. Domingue, J., Fensel, D., & Hendler, J. A. (Eds.). (2011). Handbook of semantic web technologies (Vol. 1). Springer Science & Business Media. Fensel, D., Van Harmelen, F., Andersson, B., Brennan, P., Cunningham, H., Della Valle, E., Fischer, F., Huang, Z., Kiryakov, A., Kyung-il Lee, T., Schooler, L., Tresp, V., Wesner, S., Witbrock, M., & Zhong, N. (2008). Towards LarKC: a platform for webscale reasoning. In IEEE International Conference on Semantic Computing, 524-529, IEEE. Gurkok, H., Nijholt, A., & Poel, M. (2017). Brain-Computer Interface Games: Towards a Framework. Handbook of Digital Games and Entertainment Technologies, 133-150. Hendler, J., & Mulvehill, A. M. (2016). Social machines: the coming collision of artificial intelligence, social networking, and humanity. Apress. Kahneman, D. (2011). Thinking, fast and slow. Macmillan. López, J. M., Gil, R., García, R., Cearreta, I., & Garay, N. (2008, September). Towards an ontology for describing emotions. In World Summit on Knowledge Society, 96-104, Springer, Berlin, Heidelberg. Mathieu, Y. Y. (2005, October). Annotation of emotions and feelings in texts. In International Conference on Affective Computing and Intelligent Interaction, 350-357, Springer, Berlin, Heidelberg. Mika, P. (2005). Ontologies are us: a unified model of social networks and semantics. In: Proceedings of the 4th International Semantic Web Conference, LNCS 3729, 522–536, Springer. Pellegrini, T., Schönhofer, A., Kirrane, S., Steyskal, S., Fensel, A., Panasiuk, O., MirelesChavez, V., Thurner, T., Dörfler, M., & Polleres, A. (2018). A Genealogy and

174

Fensel, A.

Classification of Rights Expression Languages – Preliminary Results. In: Trend and Communities of Legal Informatics - Proceedings of the 21st International Legal Informatics Symposion, IRIS 2018, 243-250, Salzburg, Austria. Rosenfeld, A., Kraus, S. (2018). Predicting Human Decision-Making: From Prediction to Action. Morgan and Claypool. Roth, A. E. (2015). Who Gets What—and Why: The New Economics of Matchmaking and Market Design. Houghton Mifflin Harcourt. Smart, P., Simperl, E., & Shadbolt, N. (2014). A taxonomic framework for social machines. In Social Collective Intelligence, 51-85, Springer, Cham. Underwood, S. (2016). Blockchain beyond bitcoin. Communications of the ACM, 59(11), 15-17. Zhdanova, A.V. (2008). Community-driven Ontology Construction in Social Networking Portals. International Journal on Web Intelligence and Agent Systems, 6(1), 93-121, IOS Press.

175

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8341

Limits and virtues of a web survey on political participation and voting intentions. Reflections on a mixed-method search path Faggiano, Maria Paola Department of Communication and Social Research (CoRiS), Sapienza University, Italy, Rome

Abstract The Internet offers new opportunities for the empirical research, especially if we consider that nowadays most citizens are made up of web surfers: on the one hand, we are seeing the transfer of some traditional methodologies on Internet, on the other hand we are witnessing the development of new innovative data collection and analysis tools. The study was conducted through a classical survey tool (the questionnaire), using it as part of a web survey. Secondly, we chose Facebook as an instrument which is particularly suitable for the investigated topic (political participation and voting intentions), because the election campaign for the 2018 Italian general election took place, for all parties and candidate leaders, mainly on this Social Network. Two surveys were carried out, the first one in September 2017 and the second one in February 2018, reaching about 850 and 1,400 cases, with similar percentages over the whole block of variables and with stable connections among them. The aim is to highlight the advantages and disadvantages of a Web survey on the topic of political participation, showing particular attention to strategic choices and decisions that impact positively on the data quality, according to a mixed-method approach. Keywords: web survey; political participation; voting intentions; Social Network; mixed method approach.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

177

Limits and virtues of a Web Survey on political participation and voting intentions

1. Old and new sociological survey tools in the Digital Age: A focus on the Web survey The rapid spread of digital technologies in each dimension of everyday life has inevitably produced changes in practices, styles, relationships and social interactions. The interest of human sciences researchers in the effects produced by the digitalization process is growing; at the same time the Internet offers new opportunities for empirical research (Cipriani, Cipolla, Losacco, a c. di, 2013; Boccia Artieri, a c. di, 2015): on the one hand, we are witnessing the transfer on the Web (and the adaptation process) of some traditional methodologies (primarily the questionnaire), on the other hand we can observe the development of new innovative data collection and analysis tools. The Internet represents a great chance on the political participation side and for many other topics of sociological interest, considering that nowadays the majority of citizens are made up of Internet users, who are so easily reachable for the purpose of compiling a questionnaire (in Italy today 67% of Italians surf the Web – Report of Demopolis, 2017). However, there is still a generational gap, in fact, only young people use the Internet in almost all cases; even if the "digital divide" is gradually being reduced, through the entry into the mass consumption of mobile devices such as smartphones and tablets, also for the elderly population. Instead, it is more and more complex, for the purpose of conducting surveys, to draw up reliable population lists from which to extract random samples, or to acquire updated and complete telephone registers in order to make telephone or face to face interviews. Moreover, we cannot overlook a detail which applies both to online and offline surveys: in many cases excessive circulation of investigations, opinion polls, market research, etc. have made the voter-citizen elusive and uncooperative, always less inclined to be interviewed. Sometimes voters are very suspicious or unavailable to any initiative that requires on his part to answer to a set of questions (sometimes the respondents show that they do not know how to distinguish the request for collaboration to sociological surveys from that one coming from direct marketing activities). There are in fact many online questionnaires: 1. prepared in the launch phase of a new product/service; 2. designed to record the customers satisfaction about a service or to collect the students approval in a school/university; 3. addressed to citizens to know points of view and needs with respect to a collective service; 4. prepared by political parties to mobilize and get to know voters, etc. Focusing specifically on the Internet, although a self-selected sample poses known problems in terms of its statistical representativeness in comparison with its population, the principle of maximum freedom granted to the respondent seems to produce positive effects in terms of fidelity and quality of the collected data and also of the success of a research initiative: a subject is free to choose to participate or not in an investigation, "released" from the presence or voice of the interviewer at the time of compilation, free to choose the moment compiling into a larger time unit. However, despite a careful selection of the platform and the specific virtual space

178

Faggiano, M. P.

that hosts a web survey, although there is a continuous monitoring of the phase of data collection by a research team, the participatory outcome of a study condicted on the web is in hands of a rather low number of respondents, considering that the average yield of these surveys is normally less than 1% of the reached contacts. A limited number of completed questionnaires is recorded in many studies in which the online questionnaire was used (Faggiano, 2007; Peruzzi, 2017; Vaccari et al., 2013). Therefore, a sample - more or less consistent and close to the characteristics of the population - will be consist those people who will choose to adhere to the research activity during the period of the data collection. Their participation is based on their available time, their sensitivity and interest, their own skills about the treated topics. Let's dwell on a peculiarity of the online questionnaire, connected with the typical interactivity of virtual environments. It is interesting to note that a respondent can actively participate in the research experience, not limited to the mere filling out his specific questionnaire. Let's imagine an online questionnaire published on a Facebook page: in this case, the respondents can interact via the web with the posts containing the link of the questionnaire, for example by adding a positive or negative reaction towards the initiative of research and/or the sponsoring institution. They can comment, start a debate with the research team and/or other respondents, share or discredit other people’s arguments; write privately to the institution that initiated the investigation, contribute to the circulation of the research initiative through online sharing modes.

2. Doing research to evaluate the survey approach: a voters-based study In making an evaluation on the chosen survey strategy, argued and supported by data, the study we present was conducted through a classic surveying tool such as the questionnaire, using it as part of a Web survey. More precisely, a post containing the link to the questionnaire was published on the Facebook institutional page of the Department of Communication and Social Research of the Sapienza of Rome. The post was previously sponsored in order to reach social profiles as heterogeneous as possible. These profiles were based on social extraction charateristics, professional activity, level of education and hobbies and interests. The post was also shared on other online platforms and channels. The choice of Facebook is obviously not accidental: it is the most used platform in Italy (and not only), with 33% of the registered population, followed by WhatsApp and Messenger (services also widely used by the research team for the purpose of disseminating the questionnaire, in addition to Telegram, Instagram, LinkedIn, Twitter, mailing lists). Moreover, Facebook is particularly suitable for the treated topic (political participation and voting intentions), because the election campaign for the 2018 Italian general election took place, for all parties and candidate leaders, mainly on this Social Network. Compared to the theme of research, Italy has a large tradition of empirical studies, generally carried out with standardized methodologies of questioning of voters (think of the activity of the Italian

179

Limits and virtues of a Web Survey on political participation and voting intentions

Society of Electoral Studies or the Italian National Election studies). On the eve of the 2018 political elections, the dimensions in analysis were: values, sense of legality, idea of social justice, trust in the institutions, social resentment, social problems perceived as urgent, political orientation and electoral behaviour over time, traditional forms of political and social participation, forms of online political participation and hybrid styles. Two fairly close surveys were carried out, one in September 2017 and the other one in February 2018, reaching about 850 cases in the first case and about 1,400 units in the second one, with similar percentages over the whole block of variables and with stable connections among them. The aim is to highlight the advantages and disadvantages of a web survey on the participation topic, showing particular attention to those strategic choices and decisions that have positive effects on the data quality, in a mixed method approach. Although the reached samples cannot be considered representative from the statistical point of view (these are self-selected samples), we want to underline the particular attention used towards the data quality (specially for the phase of design of the data collection tool, the pretest and the data analysis of the first survey, which has been essential to calibrate the questionnaire with a view of its second use and of the improvement of the whole data collection strategy). We have the numbers to analytically describe and deepen opinions, attitudes, values and social practices related to the most significant Italian electoral targets (left area, right area, Five Star Movement, Non-voting area, area of indecision), in addition to the typical sociodemographic and economic variables. Furthermore, the offline use of the questionnaire (approximately one fifth of the interviews for each survey) has been an effective solution (Kott, Chang, 2010) for reaching a marginal voter who still exists in Italian society (often elderly and with a low level of education): the non-user of the Web (today it is estimated that a quarter of the Italian population does not use the Internet at all).

3. Measures for improving data quality In a synthetic way, it is useful to spend a few words about all the measures that the research team has developed in order to obtain reliable data and to widen to the maximum the respondents’ catchment, not only from the strictly quantitative point of view (the numerical consistency of the sample), but also with reference to its heterogeneity regarding the strategic variables related to the political participation theme (as a problem of statistical coverage). In fact, the Web data collection mode is definable as "device agnostic", that is the questionnaire that we used in the survey is suitable to be compiled either from PC, tablet or mobile device (for example, the scales of attitude originally prepared with scores from 0 to 10, were brought back to a 0-5 range, in order to optimally display each possible mode of response, even from smartphones). Through a meticulous pre-test on a heterogeneous sample (by age and level of education), conducted both online and offline (with a questionnaire in paper format), we made improvements to the wording and the

180

Faggiano, M. P.

formulation of the precoded answers. Furthermore, we worked on the order and the number of questions, as well as on the closure of some questions, etc. We have already specify the reasons below the choice of the most used social platform, Facebook, and of the overall strategy of sharing the questionnaire on other channels (Social Networks and mailing lists) in the selected temporal unit (one month for the first and for the second data collection; in the latter case it was the period of the election campaign in a technical sense). With reference to the aspect of sharing, by following the writing of a long list of themes and key words, we tried to identify some Facebook groups connected with social activities and diversified interests, having the electoral targets as a point of reference. It was not easy to penetrate into these groups, especially for increasingly restrictive rules on privacy and spamming. So, an element that we had held in high regard for the success of the survey research, did not help matters much. The surveys conducted in specific groups to which you belong and within which you declare the research activity in an explicit way are very different (case studies, surveys on circumscribed themes, etc.). Many undergraduates in social sciences take advantage of their belonging to youth groups (university, music, etc.) to collect data on lifestyles of their peers; these experiences' success depends both on their role (symmetrical), and on the composition of the target audience (homogeneity compared to a hobby or to age, etc.). Getting the attention of large and heterogeneous populations is more complicated, as it is inevitable that some targets are more receptive than others. The sponsorship of the post, spread throughout the entire research timeframe with a medium economic investment, has led to contain the initial distortion of the sample, that was introduced by a sharing mode of the post based on an university circle and related to the local context of the Lazio Region. This attempt to curb the problem has made possible to reach subjects scattered throughout the whole national territory, differentiated by title of study, age, gender, hobbies and interests. Compared to the second survey, which totaled 1,400 respondents, the statistics on involvement are as follows: more than 30,000 views, about 2,000 interactions with the post, about 100 reactions (especially likes) and hundreds of comments (as expressions of distrust or annoyance towards the research initiative, real invective in the complaining of the Italian political system and its political representatives, expressions of support towards particular parties/declarations, interactions and sometimes animated discussions among commentators of different political orientations, etc.). As mentioned, the precise evaluation of the first survey results allowed to refine the questionnaire of the second survey, with the goal of being able to compare the data of the two research rounds. For example, we refined the text of questions and answers in the direction of clarification and simplification; sometimes we changed the order of items and replies in case of unreliable data; we eliminated items and response modes in the case of redundancies, of excessively unbalanced data, of distortions linked to social desirability, and also in the direction of thinning and agility of the instrument. All the open/semi-open questions of the first survey were closed (an example for all regards the question about the

181

Limits and virtues of a Web Survey on political participation and voting intentions

motivations related to voting intentions). The accompanying post of the link (through the explicitation of the research theme, of the institutional subjects involved, of the time required for compiling, etc.) has been prepared in a very accurate way. The post has been "placed at the top" within the Department page for the entire duration of the research, the same time span in which the team has constantly monitored the input data and motivated individuals to participate. The module we prepared for the completion of the questionnaire has graphic characteristics that make it aesthetically pleasing. The number of questions is not excessive. There are numerous indications for the correct filling of the questions. Moreover, the technology in use intervenes in order to reset the different erroneous and partial forms of compilation, that are not containable in the case of the paper questionnaires completion: this makes the online completion, for many ways, simpler and more fruitful (few errors and few missing data) than the offline one.

4. The advantages and disadvantages of online data collection In considering the advantages of a Web survey (Lombi, 2015) we can count on the substantial containment of the research costs (Groves, 1989) on several levels: the sampling tools, the minor human resources employed, the saving of printing questionnaires/doing phone calls, the avoidance of a wide territory mobility, etc. The Internet also has great potential to spread the data collection instrument in a wide and variegated territorial context. The subjects involved in the survey are automatically listed as records of a data matrix. In other words, we have an immediate availability of data for subsequent analyses. We have also already referred to the greater accuracy of the data collected (absence of errors of insertion, absence of errors of compilation, containment of the missing values), as well as their immediate availability in matrix and the possibility of monitoring the results in itinere. Obviously, there are also disadvantages: the first one is the statistical nonrepresentativeness of the samples, due to the fact that not all the population uses the Internet and Facebook, while, the second one regards the mechanisms of autoselection and the effect "ball of snow" triggered by the sharing system. The borders of the universe of Facebook subscribers are not defined and, moreover, are constantly evolving; in addition, there is no coincidence between the Social network world and the respective universe of voters present in a given territory, without taking account of the system of basic features of Facebook users, that are not known in a precise way. Finally, we cannot avoid multiple completions made by a unique subject, although we have the possibility to identify and correct them. In fact, in relation to the study on voting intentions in the electoral campaign phase, there are a series of distortions: those due to the applied techniques (standardized questionnaire published on the net) and some other distortions related to the topic (let's think to distrust and resentment towards politics in this historical moment, or to the distrust of some voters, or again to the need for privacy on the personal intentions of voting, etc.).

182

Faggiano, M. P.

In particular, in thinking of a negative combination of "technical effect" and "theme effect", there are electoral targets difficult to contact, sometimes impossible to reach. These are the elderly, the non-internet users, the right and extremists voters (who show particular distrust, specific cultural and valuable characteristics and sometimes are afraid to express themselves on ideas and practices that they consider to be wholly personal), people with low degree of study, foreigners. On the other hand, we can observe young people, subjects with a high level of education, left voters, people who are interested in politics and rather well-informed (it emerges a highly motivated "respondent-type", that is sensitive to research initiatives and interested in politics). Obviously, some distortions in the sample composition, if contained, can easily be corrected through an appropriate weighting.

5. Concluding notes: Online data collection is not enough It can be concluded that, without solving every criticality emerged but making the utmost effort in the direction of the quality of the achievable empirical basis, the only assumable perspective is the mixed-method approach (Dillman, Smith, Christian, 2009; Amaturo, Punziano, 2016). In our study we highlight the need to combine an offline data collection instrument with one online, taking for granted the same basic characteristics for both (same questionnaire, same mode of administration), in order to address the problem of the coverage of absent or under-represented targets. The email addresses provided for free by a part of the people that we reached online, allowed to proceed with interviews in depth about the anomalous aspects that emerged during the data analysis. The most collaborative subjects have sometimes offered their support to help us in reaching other subjects poorly represented in the sample, with the same characteristics or not, in order to involve them in the investigation. For some aspects, the statistical non-representativeness of the sample gives way to a strong representativeness regarding the substantial plane (Di Franco, 2010). In considering the aim of securing a numerical consistency for each of the electoral targets known to be diffused in Italy in this historical moment -, we have committed ourselves in this survey (whose publication of the extended version is being set up) to obtain the maximum heterogeneity of socio-demographic and economic variables, despite the distortions generated by the online publication of the questionnaire. All this has been done to be able to accurately investigate values, opinions, perceptions, behaviours in their synergy, with the purpose of identifying electors’ styles and profiles who are prevalent and recognizable in our society. The statistical non-representativeness of the sample certainly does not interfere with the theoretical deepening of the theme of political and social participation, neither with the testing and refining of the instruments of data collection, with the activity of conceptualization and operation, with the study of social trends prevailing through the identification of interconnections among variables at multivariate level. On the other hand, a description, a typing and an accurate interpretation of data is a valuable and

183

Limits and virtues of a Web Survey on political participation and voting intentions

fundamental basis also for the preparation of explanatory models and for the identification of predictive factors. All this is based on a deep knowledge of the social and political context of reference, also resulting from the ability to interconnect different and complementary data (for example, analysis of the electoral campaign; analysis of attitudes, voters’ intentions and perceptions, aggregate analysis of the outcome of an electoral session, etc.). Mauceri (2003) says about it: “If (...) probabilistic sampling is irreplaceable in research situations in which it is intended to estimate precisely what is the numerical consistency of the diffusion of certain traits within a given population, as when we made opinion polls, the assessment of the relations among variables can require to privilege a comparative logic rather than generalizing, aimed, for example, to compare groups of subjects with opposite action orientations (...) and to establish which are contextual, relational and individual elements that make their courses of action so different”.

References Amaturo, E., Punziano, G., (2016). I “Mixed-Methods” nella ricerca sociale, Roma: Carocci. Boccia Artieri, G., (a cura di), (2015). Gli effetti sociali del web. Forme della comunicazione metodologie della ricerca sociale, Milano: Franco Angeli. Cipriani, R., Cipolla, C., Losacco, G., (a cura di), (2013). La ricerca qualitativa fra tecniche tradizionali ed e-methods, Milano: Franco Angeli. Di Franco, G., (2010). Il campionamento nelle scienze umane. Teoria e pratica, Milano: Franco Angeli. Dillman, D.A., Smyth, J.D., Christian, L.M., (2009). Internet, mail and mixed-mode surveys. The tailored design method, San Francisco: Jossey-Bass. Faggiano, M.P., (2007), La formazione sociologica nell’università della riforma. La domanda, i percorsi, l’offerta presso il CdL in Sociologia della Sapienza di Roma, in Fasanella, A. (a cura di), (2007), Milano: Franco Angeli. Groves, R.M., (1989), Survey errors and survey costs, New York: Wiley and Sons. Kott, P.S., Chang, T., (2010). Using calibration weighting to adjust for nonignorable unit nonresponse, Journal of the Americn Statistical Association, 105 (491): 1265-1275, DOI: 10.1198/jasa.2010.tm09016. Lombi, L., (2015). Le wen survey, Milano: Franco Angeli. Gli effetti sociali del web. Forme della comunicazione e metodologie della ricerca online, Milano: Franco Angeli. Mauceri, S., (2003). Per la qualità del dato nella ricerca sociale. Strategie di progettazione e conduzione dell’intervista con questionario, Milano: Franco Angeli. Peruzzi, G., (a cura di), (2017), Le reti del Terzo Settore. Terzo Rapporto, Roma: Forum Nazionale del Terzo Settore. Vaccari et al., (2013), A survey of Twitter users during the 2013 Italian general election, Rivista Italiana di Scienza politica, Anno XLIII, n. 3: 381-410, DOI: 10.1426/75245.

184

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8343

Italian general election 2018: digital campaign strategies. Three case studies: Movimento 5 Stelle, PD and Lega Calò, Ernesto Dario; Faggiano, Maria Paola; Gallo, Raffaella; Mongiardo, Melissa Department of Communication and Social Research (CoRiS), Sapienza University, Rome, Italy.

Abstract The advent of the Network Society has brought substantial transformations also in the politics, which, like other areas of society, is affected by important changes. The network, which regulates social relations, has become the place of political discussion and that is where the most substantial part of the electoral campaign for the 2018 general election took place. The object of our research is the observation of the political propaganda of the Movimento 5 Stelle, the Partito Democratico and the Lega (the three most voted parties in the Italian elections) through the institutional accounts of the political parties on Facebook. Once collected a research sample of 1,397 posts officially published online on the three monitored accounts, the aim of our analysis is to investigate the communication strategies of the parties in a phase of hybrid democracy crossed by a deep crisis of political representation. From our analysis it emerges how the three political forces, that refer to different electorates, organize their electoral propaganda, each according to their own strategy. Keywords: Political election 2018; Networked politics; Digital campaign; Movimento 5 Stelle; Partito Democratico.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

185

Italian general election 2018 - Digital campaign strategies. Three case studies: M5S, PD and Lega

1. Introduction On the eve of the vote for the 2018 general election, Italy, like the other mature Western democracies, is undergoing a deep crisis of political representation (Manin, 2016). This crisis is linked to several factors, including: the end of ideologies that has progressively transformed the traditional forms of political participation (Fukuyama, 1992); the digitalization of the social sphere that led to the birth of the Network Society (Castells, 2009) and of the networked individualism (Raine, Wellman, 2012); the distrust of political parties and institutions that operate in a political arena regulated by marketing and communication and which sees in social networks the real places of political participation and discussion. It is a new society characterized by a hybrid environment. It results problematic for traditional political forces that are dealing with a metamorphosis of representative democracy, of its languages and its communication (Chadwick, 2013). In addition to this, even from a strictly political point of view, the scenario appears to be complex: the affirmation of the Movimento 5 Stelle, as a post-ideological party, has contributed to erode the consensus of the traditional parties and has broken the traditional bipolarity regulated by the right/left alternation. The splitting of the Partito Democratico has led to a fragmentation of the left-wing area, unable to build a unitary political proposal. Abstentionism and non-voting represent a consistent block of voters who in a consolidated democracy are to be considered as a real political force that claims its right to not choose (Manin, 2016). Italy, in line with international and European political trends, is object of a return to nationalist sentiments (Holtz-Bacha, 2016) and, in the European context, represents one of the most interesting case studies of populism. Populism, as a consequence of the conflict between people and the elite (Diamanti, Lazar, 2018). Logically, all this events affect on the tones and the languages of political parties, in a phase definited “of permanent postmodern electoral campaign", characterized by the use of the network and by a fluid electorate that must be struck by the tones of a captivating communication (Norris, 2000). According to the Law1, the 2018 election campaign is the first in which political parties receive no public funding from the State; therefore, it is an electoral campaign that needs "zero cost" instruments and that sees in the social networks the most democratic communication tool equally available to all political parties.

1

Decree Law December 28, 2013 n. 149.

186

Calò, E. D.; Faggiano, M. P.; Gallo, R.; Mongiardo, M.

The electoral result has given back to Italy an apparent condition of ungovernability: a tripolar scenario characterized by very different political forces and by their irreconcilable nature. Starting from this assumption, our attention is focused on the analysis of the online campaign of the Movimento 5 Stelle, the Lega and the Partito Democratico, observed and analyzed through the institutional profile of the political parties on Facebook, with the aim of identifying a clear map of the communication strategies of each political parties and of observing the flow of communication addressed to the voters without the mediation of third parties. This research interest originates from the hypothesis that the new mediated political scene, as an object of substantial distortions, regulated by an immediate communication, represents the privileged seat of negative communication strategies and attacks of the opponents, who aim to obtain the consensus by pointing at the elaboration of strongly emotional and not very rational messagges.

2. Research methodology The object of our analysis is the online electoral campaign officially managed by those political parties that have obtained the highest number of votes: the Movimento 5 Stelle, the Lega and the Partito Democratico. We have chosen to focus our attention on Facebook, because it is the most widespread social network and it is the one that allows a more direct interaction between politicians and users and also because it is a real social marketing tool, as an indispensable tool to communicate the politics and its propaganda. Our monitoring consisted in the collection of all posts and contents published by the political parties. It took place during two stages: the first and last weeks of the electoral campaign (from 5 to 11 February and from 26 February to 4 March). This preparatory part of our research returned a total corpus of analysis of 1,397 posts collected, as said, from the official Facebook pages of the main three political parties. After having a review of the existing literature on the subject, we have constructed a data matrix to analyze each of the posts in a detailed manner, according to a series of variables considered relevant for the analysis of the communication strategy of the three political parties. Every data collected were treated from a quantitative point of view to describe the frequency and the intensity of the post publication activities of each considered party and then from a qualitative point of view, in order to investigate the kinds of diffused material, their contents and their related function, witg the im of returning a clear trend of the communication strategies of each political force.

187

Italian general election 2018 - Digital campaign strategies. Three case studies: M5S, PD and Lega

3. The corpus of analysis and the intensity of the publication activities The sample analyzed consists of 1,397 posts published by the Lega, the Movimento 5 Stelle (M5S) and the Partito Democratico (PD): 680 posts during February 5-11 and 717 posts during February 26-March 4). It is evident (Table 1.) that in the two weeks considered the volume of published posts is almost stable and describes the same intensity of the publication activities. The substantially stable percentages (with a slight increase in activities during the second week) describe that the political parties reseverd the same attention at the opening and closing of the electoral campaign: 48.68% vs. 51.32%. Table 1. Number of posts published by political party per week Week

%

In absolute terms

February 5-Febaury 11

48.68

680

February 26-March 4

51.32

717

100

1,397

Total

Source: Our elaboration.

In the two weeks considered, the trend of posts publication remains substantially stable, without significant variations. It cannot be said the same about the communicative intensity of the three political parties. In looking at the posts publication activity of the Movimento 5 Stelle, the Lega and the Partito Democratico (Table 2.), it is evident that the Lega represents 73.16% of the total volume of posts of the general sample. The Lega published average of about 500 posts a week and about 70 posts a day, marking a substantial difference comparing to the other two political parties. The Lega is distinguished by a posts publication activity "almost obsessive", that aims a constant contact with voters throughout the day; it is not the same for the Movimento 5 Stelle and for the Partito Democratico, that have had much lower percentages. In particular, the Partito Democratico represents 7.02% of the entire sample, that corresponds to a poor use of the network as a propaganda tool, slightly oscillating between the first and second week of detection. The Movimento 5 Stelle, with a decidedly more contained trend compared to that of the Lega, increases the production of electoral propaganda online during the second week, making the most of at the closing phase of the electoral campaign.

188

Calò, E. D.; Faggiano, M. P.; Gallo, R.; Mongiardo, M.

Table 2. Number of posts per week (% of the total number of posts) Political Party

February 5-11

February 26-March 4

Total

Lega

37.44%

35.72%

73.16%

M5S

8.23%

11.60%

19.83%

PD

3.01%

4.01%

7.02%

Total

48.68%

51.32%

100%

Source: Our elaboration.

4. Type of posts To be able to investigate the communication of each party, we have developed an easy tool of analysis aimed at identifying the descriptive categories of the different types of post, in order to better describe the communicative style of each party. We have built a variables based analysis according to the following descriptive methods: Link (posts containing a link that refers to an external page), Photo (photographs and images of electoral propaganda), Video (direct events, electoral spots, excerpts of transmissions television), Status (post of written text only). The most used post type of the total sample of analysis (Table 3.) is the image (45.45%), because, as happens in the offline electoral campaign, is exploited for the immediacy of its communication that can directly catch the attention of the voter. The use of the images, between the first and second week of relevation, is clearly increasing, posts of external links to the social platform are drastically reduced, while a post of only text requires a greater activation effort from the user. Table 3. Posts publication style Type of posts

February 5-11

February 26-March 4

Total

Photo

41.32%

49.37%

45.45%

Link

40.15%

23.85%

31.78%

Video

18.24%

25.80%

22.12%

Status

0.29%

0.98%

0.64%

Total

100.00%

100.00%

100.00%

Source: Our elaboration.

The three political parties managed the online electoral campaign according to different communication styles (Table 4.). The Movimento 5 Stelle, unlike the Lega and the Partito Democratico (that prefer the use of images), prefers in its communication the use of video (49.93% posts) to the detriment of the image (25.27%). If the verbal communication

189

Italian general election 2018 - Digital campaign strategies. Three case studies: M5S, PD and Lega

(Status) requires a greater interaction from the user, the Lega – which, as we have seen, is the most active political party on the network - does not use this type of post at all, preferring the immediacy of the images, that correspond to over 50% of its sample. Posts of Status, which represent the smallest part of the total sample, with only 9 total posts, have been mainly made by the Partito Democratico, which, in considering a lack of capabilities of the use of the network, uses less immediate and captivating languages. Table 4. Posts publication styles adopted by political parties Types of post

Lega

M5S

PD

Photo

50.78%

25.27%

46.94%

Link

34.05%

27.44%

20.41%

Video

15.17%

46.93%

24.49%

Status

0

0.36%

8.16%

Total

100%

100%

100%

Source: Our elaboration.

5. Communication strategies Communication strategy adopted by the political parties during the electoral campaign was investigated by elaborating a variable named "Post Function". The variable is articulated on three general macro-categories, describing three main strategies: "Negative campaign", which contains adversary's attacks and denigration functions; "Political proposal", which illustrates the program points and the actions carried out by the political parties; and "Engagement", that aims to involve the voters proposing to be militant 2.0, who act in first person on the digital campaign of the political party. The variable "Post Function" (Table 5.) confirms the hypothesys of the propensity to adopt a Negative campaign strategy: in fact, about 48.8%2 of the published posts aim to persuade the voter by denigrating the political opponent and by using negative feelings. About 11% of posts use clearly Negative and Negative-Comparing modalities; 9.40% denounce circumstances of political relevance through the use of statistical data; 8.80% generate fear and concern about events reported; while 17.4% recall specific dramatic news events.

2

Cumulative percentage of Negative, Negative Comparing, Data declaration, Irony/parod/sarcasm, Fear and Current events modes.

190

Calò, E. D.; Faggiano, M. P.; Gallo, R.; Mongiardo, M.

Table 5. Post Function Post Function

%

Negative

8.4

Negative Comparing

2.9

Data Declaration

9.4

Irony/Parody/Sarcasm

1.9

Fear

8.8

Current Events

17.4

Past Political Achievements

1

Political Program

12.3

Political Issues

3.1

Identity Membership

1.5

Feeling good

3.6

Media Agenda

5

Territorial Agenda

11.9

Online Mobilization

11.7

Fundraising

1

Total

100 Source: Our elaboration.

Considerated the minority of the other two communication strategies, our attention is focused on the analysis of the prevailing negative. The Lega is the party most oriented to the Negative campaign. It publishes posts that recall facts of crime or illegality (about 19% of posts) that aim to emotionally shake the voters causing fear and worry (about 11% of posts). Even the Movimento 5 Stelle adopts the Negative strategy, by directly attacking the opposing political parties in a comparison of the political proposal (about 10% of posts) and using statistical data to highlight what differentiates it from others (about 13% of posts). The Partito Democratico, the outgoing government party, is the only party that does not use negative strategies. It focuses its communication on the policy proposal aimed at explaining and illustrating the actions taken and the program points (about 37% of posts).

191

Italian general election 2018 - Digital campaign strategies. Three case studies: M5S, PD and Lega

6. Conclusive observations As emerged in the course of the discussion, the online electoral campaign of the three parties has returned a composite and descriptive picture of their communicative peculiarities. The overall sample of the materials, collected in the two weeks of monitoring, describes a consistent and unequivocal protagonism of the Lega, followed by the Partito Democratico, whose online campaign appears truthfully inconsistent, if compared to the other two political forces, and not perfectly in line with languages and tones of immediacy imposed by the online network. If the Lega and the Movimento 5 Stelle aim to hit the imagination of the electorate with images and videos of immediate use, the Partito Democratico produces long text posts that imply voluntary activation by the user. The negative tones, as supposed in the introduction, play a key role. For the Lega, news events are the basis for the elaboration of the negative political messages, that are based on verbal and symbolic violence, without any programmatic political proposal. The target of its negative propaganda is the left, the European Community and the immigration phenomenon meant as a threat to national security. The negative tones of the Lega are full of ideological political references that refer to xenophobia, nationalism and anti-Europeanism. The negative trend of the Movimento 5 Stelle is in line with its nature of post-ideological political movement and aims to build its image according to a denouncing and contrasting strategy towards the political adversaries. The Movimento 5 Stelle does not express any ideological value and does not take sides on the news events that have strongly influenced the debate in the electoral campaign. If the Lega and the Movimento 5 Stelle propose an electoral campaign of attack; the Partito Democratico plays an electoral campaign in a defense tactics, but, considering the effective electoral results and the poor digital campaign, it did not exert much appeal on voters.

References Castells, M. (2009). Comunicazione e potere. Milano: Università Bocconi Editore. Chadwick, A. (2013). The Hybrid Media System: Politics and Power. New York: Oxford University Press. Diamanti, I., Lazar, M. (2018). Popolocrazia. Bari – Roma: Laterza. Fukuyama, F. (1992). La fine della storia e l’ultimo uomo. Milano: Rizzoli. Holtz-Bacha, C. (2016). Europawalkampf 2014. Berlin: Springer Vs. Manin, B. (2016). Principi del governo rappresentativo. Bologna: Il Mulino. Norris, P. (2000). A Virtuous Circle. Political Communications in Postindustrial Societies. Cambridge: Cambridge University Press. Raine, L., Wellman, B. (2012). Networked. Cambridge: Mit Press.

192

2nd International Conference on Advanced Research Methods and Analytics (CARMA2018) Universitat Polit`ecnica de Val`encia, Val`encia, 2018 DOI: http://dx.doi.org/10.4995/CARMA2018.2018.8345

Access and analysis of ISTAC data through the use of R and Shiny González-Martel, Christiana ; Cazorla Artiles, José M.b; Pérez-González, Carlos J.c a Departamento de Métodos Cuantitativos en Economía y Gestión, Universidad de Las Palmas de Gran Canaria, Spain, bUniversidad de Las Palmas de Gran Canaria, Spain, c Departamento de Matemáticas, Estadística e Investigación Operativa, Universidad de La Laguna, Spain.

Abstract The increasing availability of open data resources provides opportunities for research and data science. It is necessary to develope tools that take advantage of the full potential of new information resources. In this work we developed the package for R istacr that provides a collection of eurostat functions to be able to consult and discard the data that Eurostat, including functions to retrieve, download and manipulate the data set available through the ISTAC BASE API of the Canary Institute of Statistics (ISTAC). In addition, A Shiny app was designed for a responsive visulization of the data. This develope is part of the growing demand for open data and ecosystems dedicated to reproducible research in computational social science and digital humanities. With this interest, this package has been included within rOpenSpain, a project that aims to promote transparent research methods mainly through the use of free software and open data in Spain. Keywords: Economic databases; R; package; Shiny; visualization.

This work is licensed under a Creative Commons License CC BY-NC-ND 4.0 Editorial Universitat Polit`ecnica de Val`encia

193

Access and analysis of ISTAC data through the use of R and Shiny

1. Introduction The Open Data initiative or data opening is a practice that seeks to ensure that certain data and information belonging to public administrations and organizations are accessible and available to everyone, without technical or legal restrictions. Ruijer et al. (2017) have studied a context-sensitive open data design that facilitates the transformation of raw data into meaningful information constructed collectively by public administrators and citizens. Thorsby et al. (2017) research on features and content of open data portals in American cities. Their results show that, in general, the portals are in a stage of development and need to improve user help and analysis features as well as inclusion of features to help citizens understand the data, such as more charting and analysis. The reproducible research defined as the complete analytical workflows, fully replicable and transparent, that span from raw data to final publications can benefit from the availability of algorithmic tools to access and analyse open data collections (Gandrud, 2013; Boettiger et al., 2015). Dimou et al. (2014) presents a use case of publishing research metadata as linked open data and creating interactive visualizations to support users in analyzing data in a research context. However, the data provided in open access are not in a standardized format and arises the need to adapt the code to specific data sources to accommodate variations in raw data formats, access details so that the end users can avoid repetitive programming tasks and save time allowing simplification, standardization, and automation of analysis workflows facilitating reproducibility, code sharing, and efficient data analytics. Following this idea, within the ecosystem of R, several packages have been created to work with data from Food and Agricultural Organization (FAO) of the United Nations (FAOSTAT; Kao et al. 2015), World Bank (WDI; Arel-Bundock 2013, wbstats; Jesse Piburn 2018), Open Street Map (osmar; Eugster and Schlesinger 2012) amog others. The Canary Institute of Statistics (Instituto Canario de Estadística, ISTAC) provides a rich collection of data, including thousands of data sets on Canarian demography, health, employment and tourism and other topics in an open data format. ISTAC is the central authority of the canary statistical system and the official research office of the Government of the Canary Islands and, among others, among its functions are to provide statistical information and coordinate the public statistical activity of Canary Island autonomous region. The main access to ISTAC is the web-based graphical user interface (GUI) from where the data can be consulted and downloaded in alternative formats. This access method is fine for the occasional use but is tedious for large selections and when the user must access to data

194

González-Martel, C.; Cazorla Artiles, J.M.; Pérez-González, C. J.

very frequently. The second method uses an Application Programming Interface (API) that can be embedded in a computer code to programmatically extract data from ISTAC. We have developed a R package that integrates the API into the code that allows for the downloaded data to be directly manipulated in R. Based on this package, we have also created a Shiny application that allows a visualization of ISTAC data. The visualization characteristics is one of the most important features in analyzing information from open data sources. Chen and Jin (2017) have recently proposed a data model and application procedure that can be applied for visualization evaluation and data analysis in human factors and ergonomics. Jones et al. (2016) research innovative data visualization and sharing mechanisms in the study of social science survey data on environmental issues in order to allow the participatory deliberation. Kao et al. (2017) shows how to use a visualization analysis tool for open data with the aim to verify whether there exists sensitive information leakage problem in the target datasets. This paper provides an overview of the core functionality in the current release version. A comprehensive documentation and source code are available via the package homepage in Github1. The package is part of rOpenSpain2, an initiative whose objective is to create R packages to exploit open data available in Spain for reproducible research. This paper is structured as follows: firstly, we explain the data extraction procedure implemented in the R library and the workflow to achieve visualization of data. In section 3 we explain the architecture of the visualizations with Shiny. Finally, we present some concluding remarks.

2. The extraction routine in istcar To install and load the last release version of istacr, the user should type in R the installation from GitHub command from the devtools package. devtools::install_github(“rOpenSpain/istacr”) library(“istacr”) When the package is loaded the metadata of each dataset available by ISTAC BASE API are also loaded into the cache variable. It contains information about the title, topic, subtopic, the url to access to the json data, among other. For searching about a specific term the istac_search() function is provided. 1 2

https://github.com/rOpenSpain/istacr https://ropenspain.es/

195

Access and analysis of ISTAC data through the use of R and Shiny

busqueda.egt