M4 cl3. 0.396. 0.036. 0.330. M4 cl4. 0.396. 0.000. 0.321. Total. 0.235. 0.061. 0.194. Selectivity concerns. Quality issues when using Big Data in Official Statistics.
Quality issues when using Big Data in Official Statistics Paolo Righi Giulio Barcaroli Natalia Golini
Outline Background Description of Real Data Context Simulation Conclusion Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Background
Big Data represent a concrete opportunity for improving the official statistics reducing costs and burden Often the debate on these sources is focused on volume, velocity, variety and on IT capability to capture, store, process and analyze Big Data for statistical production
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Background
NSOs, usually face veracity (data quality as selectivity and trustworthiness of the information) and validity (data correct and accurate for the intended use). Veracity and validity affect the accuracy (bias and variance) of the estimates
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Background
The claim that large amount of information coming from a Big Data Source produces more reliable statistics than survey data is not enough in itself The question is: When does high amount of data produce high quality statistics?
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Descritiption of Real Data Context
Istat carries out a Survey on the use of ICT by Enterprises Target population enterprises with 10-249 employees by specific economic activities (about 180,000 units)
Population with website about 132,000 Sample size about 23,000 units
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Descritiption of Real Data Context
Section of the questionnaire focused on the use of the web
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Simulation: Aims
Could the information be extracted by a web scraping procedure? Could powerful auxiliary variables be extracted by web scraping procedure?
Istat has got a list of Url for accessing to the websites and making scraping Simulation is based on the evidences observed on Web scraping + text processing + machine learning performed in parallel analysis on real data Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Simulation: Data Target population 132,400 Sample 23,229 Big Data
Big Data 100,996 Sample non-respondents 9,366 (Respondents 13,863) Unobservable Big Data 32,320 (observable 68,676)
Sample Target population
Big Data
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Target population without Big Data 31,404 Target parameter: number of enterprises doing e-commerce (online ordering) General context
𝑌 = ∑𝑈 𝑦𝑘 𝑦𝑘 = 1 then e-commerce yes
Simulation: Data
Big Data 𝑝𝑟𝑜𝑏(𝑦𝑘 = 1) = .235
Expected probability of e-commerce Domain M1 cl1 M1 cl2 M1 cl3 M1 cl4 M2 cl1 M2 cl2 M2 cl3 M2 cl4 M3 cl1 M3 cl2 M3 cl3 M3 cl4 M4 cl1 M4 cl2 M4 cl3 M4 cl4 Total
Big Data 0.170 0.154 0.218 0.333 0.138 0.124 0.151 0.222 0.050 0.026 0.039 0.025 0.319 0.379 0.396 0.396 0.235
No Big Data 0.048 0.023 0.014 0.000 0.037 0.027 0.009 0.000 0.013 0.004 0.002 0.000 0.103 0.081 0.036 0.000 0.061
Target population
Sample
Target population
0.140 0.120 0.170 0.261 0.110 0.110 0.110 0.181 0.040 0.020 0.030 0.020 0.270 0.320 0.330 0.321 0.194
URL-list is on volunteer basis Assumptions: 1. if an enterprise uses actively its website for business (for instance doing e-commerce) then it has interest to increase its reachability, and therefore the probability to be in the Url-list; 2. Enterprise dimension increases probability of doing e-commerce conditionally to be in the Url-list, otherwise decreases
𝑝𝑟𝑜𝑏(𝑦𝑘 = 1) = .061
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Selectivity concerns
Simulation: Web scraping and text processing
Simulated automatic scraping Collected all texts from websites and apply text mining and natural processing techniques Identified relevant terms detected as predictors examples: “add to cart”, “credit card”, “order”, etc.)
Two scenarios on prediction of the relevant terms: 1. weak dependence with the target variable (harmonic mean of precision and recall indicators equal to 63%); 2. strong dependence with the target variable ((harmonic mean of precision and recall indicators equal to 96%). Validity concerns 2016 ICT close to Scenario 1 Scenario 2 the y variables is almost observed Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Simulation: Design and estimator
Supervised approach: target variable is observed in the sample
Stratified simple random sampling: strata by size classes, cl1… cl4. Largest inclusion probabilities assigned to the large enterprises Unit non response (cl1 r.p. = 0.45; cl2 r.p.= 0.88; cl3 r. p.= 0.95, cl4 r.p.= 0.97) Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Expected values Domain M1 cl1 M1 cl2 M1 cl3 M1 cl4 M2 cl1 M2 cl2 M2 cl3 M2 cl4 M3 cl1 M3 cl2 M3 cl3 M3 cl4 M4 cl1 M4 cl2 M4 cl3 M4 cl4 Total
Size 3.074,09 845,37 520,21 1.681,42 151,53 56,36 43,34 210,66 728,30 103,50 47,08 115,53 3.371,07 604,77 395,37 1.914,43 13.863,04
e-commerce 430,45 101,37 88,42 438,14 16,63 6,19 4,78 38,07 29,16 2,06 1,43 2,34 910,32 193,47 130,51 613,66 3.007,00
Type L L L L S VS VS L S VS VS VS L L L L Total
Simulation: Design and estimator Estimator
Expression
Description
Weight
Mod1
𝑌 = ∑(𝑈 1−𝑟 1) 𝑦𝑘 𝑏𝑘 + +∑𝑟 𝑦𝑘 𝑏𝑘
Predict BD population and direct estimate using predicted+ sampled values for estimating the parameter on the residual target population
Adjusted by non-response rate
Mod2
𝑌 = ∑(𝑈 1−𝑟 1) 𝑦𝑘 𝑤𝑘 + +∑𝑟 𝑦𝑘 𝑤𝑘
Des1
𝑌 = (𝑛/𝑟)∑𝑟 𝑦𝑘 𝑏𝑘
Des2
𝑌 = ∑𝑟 𝑦𝑘 𝑤𝑘
Comb1
𝑌 = ∑(𝑈 1−𝑟 1) 𝑦𝑘 + ∑𝑟 1 𝑦𝑘 + + (𝑛/𝑟)∑(𝑟−𝑟 1) 𝑦𝑘 𝑏𝑘
Comb2
𝑌 = ∑(𝑈1−𝑟 1) 𝑦𝑘 + ∑𝑟 1 𝑦𝑘 + ∑(𝑟−𝑟 1) 𝑦𝑘 𝑤𝑘
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Direct estimation based on the sampled values
Predict BD population and direct estimate using the sampled values for the residual target population
Adjusted by class Adjusted by non-response rate Adjusted by class (ok) Adjusted by non-response rate Adjusted by class (ok)
Simulation: Results – 1,000 Monte Carlo estimates Estimator Mod1 Scenario1 Mod1 Scenario2 Mod2 Scenario1 Mod2 Scenario2
Estimator Des1
Des2
Statistic CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE
Statistic CV RBIAS RRMSE CV RBIAS RRMSE
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
VS 112.90 629.80 632.17 111.24 85.35 135.34 65.47 628.42 630.26 64.65 90.87 99.44
VS 142.30 61.61 153.85 89.33 -1.59 89.33
Domain Type S L 10.54 25.42 313.08 74.46 313.26 77.97 8.63 25.54 44.75 74.72 45.36 77.43 10.51 14.75 342.75 70.70 342.91 70.82 8.79 14.86 55.11 25.63 55.66 26.03
Domain Type S L 18.08 14.75 -25.88 62.56 31.57 62.73 23.97 9.25 -1.68 0.39 24.03 9.25
Total 1.82 28.47 28.53 0.65 19.11 19.12 1.83 27.72 27.78 0.67 17.54 15.56
Total 1.62 -9.36 9.50 1.92 -0.02 1.92
Bias especially for selectivity concerns+ wrong NR correction Prediction affects the bias Small variance
Wrong NR correction Unbiased
Simulation: Results – 1,000 Monte Carlo estimates Estimator Comb1 Scenario1 Comb1 Scenario2 Comb2 Scenario1 Comb2 Scenario2
Estimator Des1
Des2
Statistic CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE
Statistic CV RBIAS RRMSE CV RBIAS RRMSE
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
VS 83.27 391.99 399.29 81.99 101.40 130.36 81.94 368.97 373.27 80.64 63.43 102.59
Domain Type S L 9.95 12.79 156.88 41.97 157.19 42.78 10.16 12.961 -18.48 32.20 21.09 34.70 12.74 12.59 165.38 25.61 165.80 26.26 12.91 12.71 13.79 7.90 17.72 14.97
VS 142.30 61.61 153.85 89.33 -1.59 89.33
Domain Type S L 18.08 14.75 -25.88 62.56 31.57 62.73 23.97 9.25 -1.68 0.39 24.03 9.25
Total 1.48 1.46 2.08 1.13 -3.82 3.99 1.58 5.39 5.62 1.26 0.11 1.26
Total 1.62 -9.36 9.50 1.92 -0.02 1.92
Bias, but small for total
Small CVs especially for small domain Competitive RRMSE
Simulation: Results – 1,000 Monte Carlo estimates
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Simulation: Results – 1,000 Monte Carlo estimates
The anticipated variance as a measure for the accuracy of complex multisource statistics P. Righi, P.D. Falorsi– Itacosm 2017, June 14-16 Bologna
Simulation: Results – 1,000 Monte Carlo estimates
The anticipated variance as a measure for the accuracy of complex multisource statistics P. Righi, P.D. Falorsi– Itacosm 2017, June 14-16 Bologna
Conclusion 1. Simulation highlights quality estimates related to error properties of a Big Data source
Selectivity and validity (weak & strong dependence) 2. Results show estimators highly dependent on Big Data source covering more than 60% of the population and with powerful information can fail Statistical agencies need, above all, sources of data that cover a known population with error properties that are reasonably well understood and that are not likely to change under their feet (Citro, 2014). Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Conclusion 3. Mod1 and Mod2 assume the distribution of y variable within the Big Data population is (incorrectly) the same as in the target population 4. Comb1 and Comb2 relax this strong assumption addressing to a blended approach – Better estimates […] very large amounts of data, […], will tend to produce estimates of apparently very high precision, essentially because of strong explicit or implicit assumptions of at most weak dependence underlying such methods (Cox, 2015) Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
Conclusion 5. Performances of the compared estimators vary across the domains 6. An exhaustive analysis should consider the complete set of quality dimensions such as cost and timeliness Quality is not an absolute. It must be evaluated relative to the stated aims of the survey and the purpose to which is put, and the investment (time and money) in obtaining the data (Couper, 2013)
Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
What else
Currently we are studying the use of sampling weights for estimating the model parameters of the the predictive estimators Mod1, Mod2, Comb1 and Comb2 - to deal with bias by NR Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
What else Internet as data source could be useful for other purposes In the parallel analysis we studied the incoherence between survey data and predicted values (applying non-parametric machine learning technique) Interactive analysis of the websites without concordance shows a) 49% of miss-classifications depend on measurement errors (wrong answer in the questionnaire) b) 73% of wrong answer changes from 𝑦𝑘 = 1 to 𝑦𝑘 = 0 Interactive analysis of concordance between observed and predicted values shows c) 2% of cases hides measurement errors Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.
References
Barcaroli G. (2016) Machine learning and statistical inference: the case of Istat survey on ICT. Proceedings 48th Scientific Meeting Italian Statistical Society (2016). G. Barcaroli, et al. (2015). Internet as Data Source in the Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Volume 44, 31-43. April 2015. Citro, C. F. (2014). From multiple modes for surveys to multiple data sources for estimates. Survey Methodology, 40, pp. 137-161. Couper, M.P. (2013). Is the sky falling? New technology, changing media, and the future of surveys. Survey Research Methods, 7, pp. 145-156. Cox D.R. (2015). Big data and precision. Biometrika, 102, pp. 712–716 Valliant R., Dorfman A. H., Royall R. M.: Finite Population Sampling and Inference: A Prediction Approach. Wiley. New York (2000). Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.