Quality issues when using Big Data in Official Statistics

0 downloads 0 Views 1MB Size Report
M4 cl3. 0.396. 0.036. 0.330. M4 cl4. 0.396. 0.000. 0.321. Total. 0.235. 0.061. 0.194. Selectivity concerns. Quality issues when using Big Data in Official Statistics.
Quality issues when using Big Data in Official Statistics Paolo Righi Giulio Barcaroli Natalia Golini

Outline  Background  Description of Real Data Context  Simulation  Conclusion Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Background

 Big Data represent a concrete opportunity for improving the official statistics reducing costs and burden  Often the debate on these sources is focused on volume, velocity, variety and on IT capability to capture, store, process and analyze Big Data for statistical production

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Background

 NSOs, usually face veracity (data quality as selectivity and trustworthiness of the information) and validity (data correct and accurate for the intended use).  Veracity and validity affect the accuracy (bias and variance) of the estimates

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Background

 The claim that large amount of information coming from a Big Data Source produces more reliable statistics than survey data is not enough in itself  The question is: When does high amount of data produce high quality statistics?

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Descritiption of Real Data Context

 Istat carries out a Survey on the use of ICT by Enterprises  Target population enterprises with 10-249 employees by specific economic activities (about 180,000 units)

 Population with website about 132,000  Sample size about 23,000 units

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Descritiption of Real Data Context

 Section of the questionnaire focused on the use of the web

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Simulation: Aims

 Could the information be extracted by a web scraping procedure?  Could powerful auxiliary variables be extracted by web scraping procedure?

 Istat has got a list of Url for accessing to the websites and making scraping  Simulation is based on the evidences observed on Web scraping + text processing + machine learning performed in parallel analysis on real data Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Simulation: Data Target population 132,400 Sample 23,229 Big Data

Big Data 100,996 Sample non-respondents 9,366 (Respondents 13,863) Unobservable Big Data 32,320 (observable 68,676)

Sample Target population

Big Data

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

 Target population without Big Data 31,404  Target parameter: number of enterprises doing e-commerce (online ordering) General context

𝑌 = ∑𝑈 𝑦𝑘 𝑦𝑘 = 1 then e-commerce yes

Simulation: Data

Big Data 𝑝𝑟𝑜𝑏(𝑦𝑘 = 1) = .235

Expected probability of e-commerce Domain M1 cl1 M1 cl2 M1 cl3 M1 cl4 M2 cl1 M2 cl2 M2 cl3 M2 cl4 M3 cl1 M3 cl2 M3 cl3 M3 cl4 M4 cl1 M4 cl2 M4 cl3 M4 cl4 Total

Big Data 0.170 0.154 0.218 0.333 0.138 0.124 0.151 0.222 0.050 0.026 0.039 0.025 0.319 0.379 0.396 0.396 0.235

No Big Data 0.048 0.023 0.014 0.000 0.037 0.027 0.009 0.000 0.013 0.004 0.002 0.000 0.103 0.081 0.036 0.000 0.061

Target population

Sample

Target population

0.140 0.120 0.170 0.261 0.110 0.110 0.110 0.181 0.040 0.020 0.030 0.020 0.270 0.320 0.330 0.321 0.194

 URL-list is on volunteer basis  Assumptions: 1. if an enterprise uses actively its website for business (for instance doing e-commerce) then it has interest to increase its reachability, and therefore the probability to be in the Url-list; 2. Enterprise dimension increases probability of doing e-commerce conditionally to be in the Url-list, otherwise decreases

𝑝𝑟𝑜𝑏(𝑦𝑘 = 1) = .061

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Selectivity concerns

Simulation: Web scraping and text processing    

Simulated automatic scraping Collected all texts from websites and apply text mining and natural processing techniques Identified relevant terms detected as predictors examples: “add to cart”, “credit card”, “order”, etc.)

 Two scenarios on prediction of the relevant terms: 1. weak dependence with the target variable (harmonic mean of precision and recall indicators equal to 63%); 2. strong dependence with the target variable ((harmonic mean of precision and recall indicators equal to 96%). Validity concerns  2016 ICT close to Scenario 1  Scenario 2 the y variables is almost observed Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Simulation: Design and estimator

 Supervised approach: target variable is observed in the sample

 Stratified simple random sampling: strata by size classes, cl1… cl4.  Largest inclusion probabilities assigned to the large enterprises  Unit non response (cl1 r.p. = 0.45; cl2 r.p.= 0.88; cl3 r. p.= 0.95, cl4 r.p.= 0.97) Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Expected values Domain M1 cl1 M1 cl2 M1 cl3 M1 cl4 M2 cl1 M2 cl2 M2 cl3 M2 cl4 M3 cl1 M3 cl2 M3 cl3 M3 cl4 M4 cl1 M4 cl2 M4 cl3 M4 cl4 Total

Size 3.074,09 845,37 520,21 1.681,42 151,53 56,36 43,34 210,66 728,30 103,50 47,08 115,53 3.371,07 604,77 395,37 1.914,43 13.863,04

e-commerce 430,45 101,37 88,42 438,14 16,63 6,19 4,78 38,07 29,16 2,06 1,43 2,34 910,32 193,47 130,51 613,66 3.007,00

Type L L L L S VS VS L S VS VS VS L L L L Total

Simulation: Design and estimator Estimator

Expression

Description

Weight

Mod1

𝑌 = ∑(𝑈 1−𝑟 1) 𝑦𝑘 𝑏𝑘 + +∑𝑟 𝑦𝑘 𝑏𝑘

Predict BD population and direct estimate using predicted+ sampled values for estimating the parameter on the residual target population

Adjusted by non-response rate

Mod2

𝑌 = ∑(𝑈 1−𝑟 1) 𝑦𝑘 𝑤𝑘 + +∑𝑟 𝑦𝑘 𝑤𝑘

Des1

𝑌 = (𝑛/𝑟)∑𝑟 𝑦𝑘 𝑏𝑘

Des2

𝑌 = ∑𝑟 𝑦𝑘 𝑤𝑘

Comb1

𝑌 = ∑(𝑈 1−𝑟 1) 𝑦𝑘 + ∑𝑟 1 𝑦𝑘 + + (𝑛/𝑟)∑(𝑟−𝑟 1) 𝑦𝑘 𝑏𝑘

Comb2

𝑌 = ∑(𝑈1−𝑟 1) 𝑦𝑘 + ∑𝑟 1 𝑦𝑘 + ∑(𝑟−𝑟 1) 𝑦𝑘 𝑤𝑘

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Direct estimation based on the sampled values

Predict BD population and direct estimate using the sampled values for the residual target population

Adjusted by class Adjusted by non-response rate Adjusted by class (ok) Adjusted by non-response rate Adjusted by class (ok)

Simulation: Results – 1,000 Monte Carlo estimates Estimator Mod1 Scenario1 Mod1 Scenario2 Mod2 Scenario1 Mod2 Scenario2

Estimator Des1

Des2

Statistic CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE

Statistic CV RBIAS RRMSE CV RBIAS RRMSE

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

VS 112.90 629.80 632.17 111.24 85.35 135.34 65.47 628.42 630.26 64.65 90.87 99.44

VS 142.30 61.61 153.85 89.33 -1.59 89.33

Domain Type S L 10.54 25.42 313.08 74.46 313.26 77.97 8.63 25.54 44.75 74.72 45.36 77.43 10.51 14.75 342.75 70.70 342.91 70.82 8.79 14.86 55.11 25.63 55.66 26.03

Domain Type S L 18.08 14.75 -25.88 62.56 31.57 62.73 23.97 9.25 -1.68 0.39 24.03 9.25

Total 1.82 28.47 28.53 0.65 19.11 19.12 1.83 27.72 27.78 0.67 17.54 15.56

Total 1.62 -9.36 9.50 1.92 -0.02 1.92

Bias especially for selectivity concerns+ wrong NR correction Prediction affects the bias Small variance

Wrong NR correction Unbiased

Simulation: Results – 1,000 Monte Carlo estimates Estimator Comb1 Scenario1 Comb1 Scenario2 Comb2 Scenario1 Comb2 Scenario2

Estimator Des1

Des2

Statistic CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE CV RBIAS RRMSE

Statistic CV RBIAS RRMSE CV RBIAS RRMSE

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

VS 83.27 391.99 399.29 81.99 101.40 130.36 81.94 368.97 373.27 80.64 63.43 102.59

Domain Type S L 9.95 12.79 156.88 41.97 157.19 42.78 10.16 12.961 -18.48 32.20 21.09 34.70 12.74 12.59 165.38 25.61 165.80 26.26 12.91 12.71 13.79 7.90 17.72 14.97

VS 142.30 61.61 153.85 89.33 -1.59 89.33

Domain Type S L 18.08 14.75 -25.88 62.56 31.57 62.73 23.97 9.25 -1.68 0.39 24.03 9.25

Total 1.48 1.46 2.08 1.13 -3.82 3.99 1.58 5.39 5.62 1.26 0.11 1.26

Total 1.62 -9.36 9.50 1.92 -0.02 1.92

Bias, but small for total

Small CVs especially for small domain Competitive RRMSE

Simulation: Results – 1,000 Monte Carlo estimates

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Simulation: Results – 1,000 Monte Carlo estimates

The anticipated variance as a measure for the accuracy of complex multisource statistics P. Righi, P.D. Falorsi– Itacosm 2017, June 14-16 Bologna

Simulation: Results – 1,000 Monte Carlo estimates

The anticipated variance as a measure for the accuracy of complex multisource statistics P. Righi, P.D. Falorsi– Itacosm 2017, June 14-16 Bologna

Conclusion 1. Simulation highlights quality estimates related to error properties of a Big Data source

Selectivity and validity (weak & strong dependence) 2. Results show estimators highly dependent on Big Data source covering more than 60% of the population and with powerful information can fail Statistical agencies need, above all, sources of data that cover a known population with error properties that are reasonably well understood and that are not likely to change under their feet (Citro, 2014). Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Conclusion 3. Mod1 and Mod2 assume the distribution of y variable within the Big Data population is (incorrectly) the same as in the target population 4. Comb1 and Comb2 relax this strong assumption addressing to a blended approach – Better estimates […] very large amounts of data, […], will tend to produce estimates of apparently very high precision, essentially because of strong explicit or implicit assumptions of at most weak dependence underlying such methods (Cox, 2015) Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

Conclusion 5. Performances of the compared estimators vary across the domains 6. An exhaustive analysis should consider the complete set of quality dimensions such as cost and timeliness Quality is not an absolute. It must be evaluated relative to the stated aims of the survey and the purpose to which is put, and the investment (time and money) in obtaining the data (Couper, 2013)

Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

What else

 Currently we are studying the use of sampling weights for estimating the model parameters of the the predictive estimators Mod1, Mod2, Comb1 and Comb2 - to deal with bias by NR Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

What else  Internet as data source could be useful for other purposes  In the parallel analysis we studied the incoherence between survey data and predicted values (applying non-parametric machine learning technique)  Interactive analysis of the websites without concordance shows a) 49% of miss-classifications depend on measurement errors (wrong answer in the questionnaire) b) 73% of wrong answer changes from 𝑦𝑘 = 1 to 𝑦𝑘 = 0  Interactive analysis of concordance between observed and predicted values shows c) 2% of cases hides measurement errors Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.

References

 Barcaroli G. (2016) Machine learning and statistical inference: the case of Istat survey on ICT. Proceedings 48th Scientific Meeting Italian Statistical Society (2016).  G. Barcaroli, et al. (2015). Internet as Data Source in the Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Volume 44, 31-43. April 2015.  Citro, C. F. (2014). From multiple modes for surveys to multiple data sources for estimates. Survey Methodology, 40, pp. 137-161.  Couper, M.P. (2013). Is the sky falling? New technology, changing media, and the future of surveys. Survey Research Methods, 7, pp. 145-156.  Cox D.R. (2015). Big data and precision. Biometrika, 102, pp. 712–716  Valliant R., Dorfman A. H., Royall R. M.: Finite Population Sampling and Inference: A Prediction Approach. Wiley. New York (2000). Quality issues when using Big Data in Official Statistics P. Righi, G. Barcaroli , N. Golini.