FROM BIG DATA TO INFORMATION: STATISTICAL ISSUES THROUGH EXAMPLES Silvia Biffignandi1, Serena Signorelli1 1
University of Bergamo (e-mail:
[email protected],
[email protected])
KEYWORDS: Big Data, quality, representativity.
1
Introduction on Big Data
In the last few years, the term Big Data has been used more and more in various fields, especially in statistics. Unfortunately, there is not a precise definition of Big Data, due to the fact that it is a general concept related to many disciplines and to a wide amount of different data. However, it is possible to identify three main characteristics (Laney, 2001) of them, that we can refer to as the three Vs: Volume, which refers to the sheer amount of data available for analysis; Velocity, meant as both the speed at which these data collection event can occur and the pressure of managing large streams of real-time data; Variety, that represents the complexity of formats in which this kind of data can be presented. Others have added other features (AAPOR Report on Big Data, 2015): Variability (which refers to the inconsistency of the data across time), Veracity (that regards the accuracy of the data), Complexity (the need to link these data to multiple sources). Having the abovementioned characteristics as a background information, research has to be undertaken to define more specific concepts and methodologies. One interesting operational starting point is to consider various types of data which could rely on the general concept and try to evaluate advantages and problems with reference to different disciplines which should be engaged in the collection, treatment and use of this data. In this paper we will try to highlight some key issues disentangled from the statistical point of view, especially from business and social statistics. It is possible to classify Big Data into various type (AAPOR Report on Big Data, 2015): Social media data Personal data Sensor data Transactional data Administrative data, even it is in doubt whether this last category can be considered as Big Data. In some cases, survey data collected quickly using technical tool and contacting a large amount of units could be considered in the frame of the Big data concept.
The paper discusses at first some statistical and quality issues (par. 2), introduces some recent empirical experiences (par.3) and focuses on some critical points for the use of Big Data for statistical purposes by presenting some original overview and analyses of existing databases.
2
Statistical and quality issues
In a statistical point of view, it would be natural to think that a huge amount of data is a positive aspect, but this is not always true. Big Data, as the term suggests, carry a great quantity of data but it is important to look at the quality of them before using them for statistical purposes. It is important to keep in mind that Big Data, differently from traditional probability based survey data, are not collected and designed to a specific statistical purpose, but are ‘harvested’ as they are. So, the first problem that arises regards the fact that statistical tools traditionally used in survey data collection are not immediately applicable. This implies that this kind of data can contain errors; the problem is that errors could be of different nature, therefore they need to be investigated and handled using appropriate errors categorization and statistical methods. Categorization and relative appropriate methods are still under study. Most errors are related to representativity. The paper focuses on this aspect especially with reference to weighting and integration. The AAPOR Report on Big Data (2015) suggests the introduction of a Total Error Framework specific for Big Data, based on the Total Survey Error (TSE) framework that already exists (Biemer, 2010). Another issue concerns the volatility and instability of this data; Big Data coming from social network could become incomparable from one day to the next, due to the recurring changes, that social network introduce in order to improve their structure and user experience. Moreover, transactional data or administrative data could change their structure and the way they are collected for operational and efficiency reasons. Other problems are due to the big dimensionality of these data. Big data are often used to detect correlations among variables, but not on explaining why things are happening (causal relationships). This can lead to the identification of a spurious relationship, for example, between two completely different variables that fluctuate in the same way. This is the first issue that Fan et al. (2014) identified. The other two issues regard noise accumulation and the uncorrelation of model covariates with the residual error. We have to keep in mind that even if these data have a big dimensionality, they are not representative. They represent only a specific population and so they need a big effort in order to try to make them representative with respect to the whole population. To assess the quality of Big Data, the UNECE Big Data Quality Task Team (2015) proposes a framework that aims at assessing the quality of these data at three stages: Input, when the data is acquired;
Throughput, when the data gets transformed, analyzed or manipulated; Output, the phase of reporting. This framework focuses on the specific quality requirements and challenges for the use of Big Data in official statistics.
3
Some empirical studies
Up to now, Big Data have been used in the analysis of different fields. In some cases they are used in combination with traditional survey data. All the experiences are focusing on the potentiality of this type of sources, there is still a huge amount of work that has to be undertaken to provide statistical data. Hereunder, experiences that are considered in the paper are listed and critically analysed. In medical field, the Centers for Disease Control and Prevention have used Big Data to study Diabetes (Day H.R., Parker J.D., 2013). They linked data from the National Health Interview Survey (NHIS) with the Medicare Chronic Condition (CC) Summary file. The objective of the research was to verify the correspondence of the self-reported answers about diabetes in the survey with the diabetes indicators contained in the CC file (an algorithm computes the presence of a chronicle disease based on Medicare data). There are also application in marketing, in particular in the advertising field. Two different applications (Duong T., Millman S., 2015; Porter S., Lazaro C.G., 2014) have added Big Data to traditional survey data in order to check the effectiveness of mobile ads and brands. Regarding the use of social media data, the University of Michigan (Antenucci D., Cafarella M., Levenstein M.C., Ré C., Shapiro M.D., 2014) has built the Social Media Job Loss Index in order to monitor the unemployment trend using Twitter data. Another social media application have been produced by Statistics Netherlands (Daas, P.J.H., Puts, M.J., Buelens, B., van den Hurk, P.A.M., 2013) to analyse the content of Dutch Twitter messages and to conduct a sentiment analysis through the main social networks (Twitter, Facebook, Google+, ecc.). The same authors have also used traffic sensor data to predict traffic intensity (Daas, P.J.H., Puts, M.J., Buelens, B., van den Hurk, P.A.M., 2013). Bogomolov et al. (2014) used a combination of demographics and mobile data to predict crime. A different kind of technique that can be used is Web scraping. Nathan et al. (2013) have done an application; they collected all the data and information they could find on English digital enterprises in order to rewrite the official classification of English enterprises (in which digital companies are underestimated).
4
Critical analysis
Different types of Big Data are described, and research steps to be undertaken to obtain statistical data are discussed and analyses described. American Community Survey (ACS), social network data for reputation in tourism sector, voluntary on-
line panel are described and how to approach their methodological treatment is explored. Special attention to representativity and error is provided.
References AAPOR Report on Big Data, AAPOR (American Association For Public Opinion Research) Big Data Task Force, February 12, 2015. ANTENUCCI, D., CAFARELLA, M., LEVENSTEIN, M.C., RE, C. & SHAPIRO, M.D. 2014. Using Social Media to Measure Labor Market Flows. NBER Working Papers 20010, National Bureau of Economic Research, Inc. BIEMER, P. 2010. Total Survey Error: Design, Implementation, and Evaluation. Public Opinion Quarterly, 74(5), 817-848. BOGOMOLOV, A., LEPRI, B., STAIANO, J., OLIVER, N., PIANESI, F., & PENTLAND, A. 2014. Once Upon a Crime: Towards Crime Prediction from Demographics and Mobile Data. Proceedings of the 16th International Conference on Multimodal Interaction, 427-434. DAAS, P.J.H., PUTS, M.J., BUELENS, B., VAN DEN HURK, P.A.M. 2013. Big Data and Official Statistics. Paper for the 2013 New Techniques and Technologies for Statistics conference. Brussels, Belgium. DAY, H.R., PARKER, J.D. 2013. Self-report of Diabetes and Claims-based Identification of Diabetes Among Medicare Beneficiaries. National Health Statistics Reports, 69. DUONG, T., MILLMAN, S. 2014. Behavioral Data as a Complement to Mobile Survey Data in Measuring Effectiveness of Mobile Ad Campaign. Presented at the CASRO Digital Research Conference. FAN, J., HAN, F., & LIU, H. 2014. Challenges of Big Data analysis. National Science Review, 1(2), 293-314. LANEY, D. 2001. 3-D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note. NATHAN, M., ROSSO, A., GATTEN, T., MAJMUDAR, P., MITCHELL, A. 2013. Measuring the UK’s digital economy with Big Data. National Institute of Economic and Social Research (NIESR). PORTER, S., LAZARO, C.G. 2014. Adding Big Data Booster Packs to Survey Data. Presented at the CASRO Digital Research Conference. TASK TEAM ON BIG DATA QUALITY. 2015. A Suggested Framework for National Statistical Offices for assessing the Quality of Big Data. Paper for the 2015 New Techniques and Technologies for Statistics conference. Brussels, Belgium.