Data mining for official statistics Challenges and opportunities
Bart Buelens, Piet Daas and Jan van den Brakel Department of Methodology Statistics Netherlands Heerlen, Netherlands
[email protected] Abstract— We present our vision on the use of data mining for official statistics, illustrate this with some examples, sketch a general framework, and provide directions for future research. Keywords-sampling; estimation; modeling; prediction; big data
I.
SUMMARY
We envisage data mining methods to become increasingly important in the compilation of official statistics – quantitative information about the socio-economic situation. Countries must produce such statistics, and generally have dedicated agencies for doing so, a role fulfilled by Statistics Netherlands (SN) in the Netherlands. Example statistics include gross domestic product, consumer price index, and key figures about demographics, labor and housing markets, and business statistics. The use of data mining for official statistics is in its infancy. The primary reason is that survey sampling is traditionally the method of choice to collect data of interest. While the field of survey statistics is well developed, research into alternative approaches is driven by some recent trends: reuse of data already available within government agencies, reduction of administrative burden, and budget cuts in the public sector. At the same time, there is an apparent abundance of data. With the digitization of society, many more agencies and organization collect and process data than ever before. Some of these data may be useful for official statistics, replacing or complementing surveys. Non-survey data sources are being used for official statistics already, insofar that they are complete. Examples are the population register, all tax returns submitted with the Tax Office, the registers of all cars, all dwellings, etc. Use of such data for official statistics is studied in a subfield of official statistics known as register methodology. However, many non-survey data sources are not complete at all, or contain data that is only somehow related to the statistics of interest. An example is a data set containing all calls made with mobile phones in a certain period of time, and for each call the location of the antenna the call was made through. If the goal is deriving mobility statistics – where and how people move throughout the country – then the issues are: (1) the data set only contains information about people owning and using a mobile phone, not about the others, and (2) it contains calling behavior
rather than travel behavior. More generally, there is a need for methods that predict variables of interest from correlating variables, and that can predict these as well for unobserved units. This is where data mining can play a role. There are standard statistical modeling methods that may be appropriate, but they often require (generalized) linearity, make certain distributional assumptions, and are not suited to the size of many contemporary data sets available ("big" data). At SN, preliminary research is underway, investigating data mining methods for big data. Besides the mobile phone example, data from inductive traffic loops is being used to estimate turnover in the transport industry, and social media messages are being used for opinion mining, and potentially estimation of consumer confidence. Topics of present and future research include: determining and linking suitable data sets to enhance predictions, establishing appropriate prediction and estimation methods, error budgeting, and model selection and validation. At the same time, it is imperative that official statistics that are produced in such novel ways comply with (inter)national standards and guidelines, and with privacy legislation. We present our vision on the use of data mining for official statistics, illustrate this with some examples, sketch a general framework, and provide directions for future research.