Quality Challenges in Processing Administrative Data to Produce

0 downloads 0 Views 85KB Size Report
observations. Considerations linked to the quality of data suggest a micro-level ... the LCI -Labour Cost Index Regulation, n.450/2003), the Oros Survey was planned to extend the ... Given the. Oros release delay, Istat asked INPS to transmit these electronic information as soon ... based on calibration (Baldi et. al., 2004).
Proceedings of Q2008 European Conference on Quality in Official Statistics

Quality Challenges in Processing Administrative Data to Produce Short-Term Labour Cost Statistics Congia M.Carla, Pacini Silvia, Tuzi Donatella Italian National Statistical Institute (Istat) 1

Abstract The Italian Oros Survey on Wages and Labour Cost is an innovative case of shortterm statistics produced with the help of administrative sources in order to cover all size enterprises in the private sectors. The use of administrative data in short-term statistics implies paying attention to unusual statistical quality aspects. Statisticians cannot prevent or reduce no-sampling errors in raw administrative data capturing, and some ex-post traditional editing techniques, like questionnaire revision and enterprise recalling, are not applicable. The complexity of the production process is also caused by the huge number of records and the highly disaggregated level of raw data. In fact, given the short-time constraint in the releases, the Italian NSO was obliged to capture data from the Italian Social Security Institute without any previous process of aggregation and checking. So, the retrieval and translation of the administrative data into statistical information is one of the most critical aspect to be faced at the beginning of the process; and its effectiveness has a significant impact on the quality of the final indicators. For this purpose, a metadata database, used as a translation scheme, has to be produced and constantly updated by Istat, according to laws, regulations, and contribution rates, so to follow frequent changes in concepts and definitions. On the other hand, the availability of very disaggregated data allows for the exploitation of a very rich informative source for different statistical aims, and for a more direct control on the overall translation phase. When the statistical variables have been made available, a more traditional micro level check procedure is applied. Editing on outliers and anomalous values, and imputation of unit non responses may consequently be needed, with a particular attention to influential observations. Considerations linked to the quality of data suggest a micro-level integration between the administrative source and the Large Enterprises Survey data. Nevertheless, the integration involves a record-linkage aspect and the computation of harmonised variables. At the end of the process, a macro data validation is carried out which implies, among other aspects, time series analysis and macro-level comparisons with other statistical sources. Finally, the standardization and documentation of the whole check and editing process is a fundamental target of the Oros quality procedure. Key words: Administrative data, Short-term statistics, Editing and imputation

1 Congia M.Carla. Istat. Short Term Employment and Income Statistics, Via Tuscolana, 1782 – 00173 Rome, Italy. E-mail: [email protected]. Pacini Silvia. Istat. Short Term Employment and Income Statistics, Via Tuscolana, 1782 – 00173 Rome, Italy. Email: [email protected]. Tuzi Donatella. Istat. Short Term Employment and Income Statistics, Via Tuscolana, 1782 – 00173 Rome, Italy. E-mail: [email protected].

1 Introduction The Italian Oros Survey is an innovative example of register-based survey making a massive use of administrative data to produce short-term statistics. Besides the unusual quality aspects commonly interesting surveys based on administrative sources, the short-time constraints render the Oros quality strategy really peculiar. The release timeliness, in fact, obliged the Italian Statistical Institute to acquire data without any previous check and aggregation by the administrative system, implying two opposite consequences. From a part, highly disaggregated raw data imply the availability of a rich set of exploitable information and allow a direct control on most of the processing aspects. On the other hand, this large autonomy in the utilization of data requires an hard work to guarantee a high standard of quality. The latter is heavily influenced by the frequent changes of the administrative regulation, the lack of standardized metadata to translate the administrative information, the complex structure of the raw data to be processed. In this paper we illustrate the main issues of the quality strategy implemented for the Oros Survey, a waste set of actions that cover the entire production process. In paragraph 2 a short description of the Survey and of its sources is given, while the peculiarities of the quality strategy are presented in paragraph 3. In the following paragraphs we go throught the most interesting quality issues. Paragraph 4 illustrates the problems to be faced for exploiting the structural information of the quarterly administrative register, while paragraph 5 explains the complex steps in the retrieval of the statistical variables from the administrative information, implying the building and up-dating of a metadata system on the administrative data. Once the statistical variables have been made available, a more traditional micro level editing on outliers and anomalous values is applied, with a particular attention to influential observations, as described in paragraph 6. Non-responses, affecting a particular subpopulation of the Oros Survey, are corrected through imputation as reported in paragraph 7. A special attention is given to Large Enterprises, for which a micro-level integration with a statistical source is realised, involving record-linkage and harmonisation issues as described in paragraph 8. Paragraph 9 illustrates a key step in the Oros quality target: checks on macro data to identify possible residual errors through time series analysis and macro-level comparisons with other statistical sources.

2. The Italian Oros Survey The Oros 2 Survey is a register-based survey that produces quarterly information on gross wages, other labour costs and total labour cost changes for all the Italian private non-agricultural firms (Baldi et al., 2004). The target sectors are sections from C to K of the European Community classification (Nace Rev.1.1). Each quarter provisional indicators on the interest variables are released with a delay of about 70 days from the reference period. A revision of the preliminary data follows with a delay of 15 months. Until 2002, the Italian National Statistical Institute has collected this information with a monthly frequency, to the limited extent of firms with 500 or more employees, through the Survey on Labour Input Variables in Large Firms (hereafter Large Enterprises Survey - LES). 2

The acronym Oros stands for Occupazione (Employment), Retribuzioni (Wages), Oneri Sociali (Other Labour Costs). At the moment, figures on employment are produced to calculate the per capita values but are not released.

Given the structure of the Italian business population, mainly composed of small and medium enterprises (about 40% of total employees are concentrated in units with less than 20 persons employed) and the mandatory requests of two European Community Regulations (the STS – Short Term Statistics Regulation, n.1165/98 and the LCI -Labour Cost Index Regulation, n.450/2003), the Oros Survey was planned to extend the coverage to all business size classes. The huge number of small-size enterprises and the extremely dynamic nature of the Italian firms (frequent births and cessations) would have implied the design of a too big and onerous sample survey, with a considerable impact on the statistical burden on enterprises. Alternative solutions were supported: the use of the administrative data of the employers’ social contribution declarations to the Italian National Social Security Institute (INPS) was deeply considered and preferred. Nowadays the Oros Survey is mainly based on two INPS sources, containing information on social security data and structural details on the administrative units. The first source is the archive of the monthly social contribution declarations. It refers to the electronic forms (i.e. DM10) that all firms with at least one employee have to transmit to INPS within the 30th day from the end of the reference month. Given the Oros release delay, Istat asked INPS to transmit these electronic information as soon as it is uploaded on the central database, implying the availability of data completely raw, not yet subjected to the ordinary administrative check procedures. This set of information, transmitted to Istat about 35 days after the end of the reference quarter, is used to produce the provisional estimates (figure 1). That “provisional population” is extremely large and covers about 95-98% of the entire population which is available with a delay of about 12 months and used to produce final estimates. In the current situation, each quarter about 1.3 million employers are considered, covering about 10 million employees. Figure 1. Time schedule for the quarter t estimates: data acquisition, processing and release Administrative Register

Provisional Estimates

35 days

Quarter t

28 days

Provisional Population of DM10

Final Estimates

12 months 15 months

70 days

Final Population of DM10

The coverage of the preliminary data has always been very large, but has considerably grown over the time for administrative reasons: the number of firms adopting the electronic way to send the DM10 declarations (those arriving faster to Inps) has gradually increased, as far as the use of internet as delivering mode became compulsory. Since spring 2004 the preliminary population has nearly

reached the dimension of the full population (figure 2) allowing significant simplification in the estimation procedures, but implying a growing quantity of data to be treated. Before April 2004 the provisional set of information was considered as a non-random sample and implied the adoption of a predictive estimation method based on calibration (Baldi et. al., 2004). At the moment, the provisional estimates of the target variables are calculated, as in the final estimates, by simply summing up all the available data. Figure 2. Provisional and final population of DM10 forms. January 2000 – December 2007. 1,400,000

1,200,000

1,000,000

800,000

600,000

400,000

2000

Provisional Population

2002

Jul

Oct

Apr

Jan

Jul

2006

Oct

Apr

Jan

Jul

2005

Oct

Apr

Jan

Jul

2004

Oct

Apr

Jan

Jul

2003

Oct

Apr

Jan

Jul

Oct

Apr

Jan

Jul

2001

Oct

Apr

Jan

Jul

Oct

Apr

Jan

200,000

2007

Final Population

Source: Oros Survey.

The second INPS source is the Administrative Register (AR) containing structural information on the single administrative unit. It is downloaded at the end of each quarter and needs some treatment to be suitable for statistical purposes. Referring to large firms, the administrative data are combined with the Istat monthly survey (LES), mainly because at the beginning they were not well represented in the non-random sample and in general to get finest estimates. A particular attention is also paid to the estimation of temporary employment agencies, which have to be carefully treated for their peculiarities. In brief, four sub-populations are singled out and subjected to specific treatments: 1. small and medium enterprises (SME); 2. large enterprises surveyed by LES (estimated with LES data); 3. large enterprises not surveyed by LES (estimated with INPS data); 4. temporary employment agencies.

3. The quality strategy in a context of timely and extensive use of administrative data The use of administrative data to produce business statistics implies partially different statistical quality aspects from those emerging in the most conventional sample surveys (Eurostat, 2003).

In traditional surveys during the design phase many non-sampling errors may be prevented and/or reduced, while the general quality of the administrative source is completely independent from statisticians’ control (Wallgren and Wallgren, 2007). Moreover, some ex-post traditional check and editing techniques, like the questionnaire revision and the enterprise recalling, are also not applicable to registerbased surveys. In addition, the Oros Survey has to produce short-term indicators with a high timeliness. This time constraint in the release obliged the Italian NSI to capture data from the Italian Social Security Institute at a very disaggregated level without any previous process of aggregation and check. That has implied to pay attention to unusual quality issues making necessary the planning of a peculiar quality strategy along the whole process. One of the most critical aspect to face has been the translation of the administrative information into statistical data. Its effectiveness has a significant impact on the quality of the final indicators; at the same time, the lack of a complete metadata system on administrative data makes this step particularly hard. For this purpose, a metadata database has been ad hoc produced and quarterly updated by Istat, collecting laws, regulations and other important information on social security contribution rates to take into account their frequent changes over time. This database is essential to carry out the complex retrieval process of the statistical variables and the preliminary checks on the raw DM10 declaration data. Due to the extensive use of administrative data implying the capturing of a huge quantity of records every quarter, selective editing has been necessary. Checks are carried out only on the units which have a stronger influence on the target variables. To take a particular care of the large enterprises data quality an integration with the Istat Large Enterprises Survey data has been realised. Higher data quality can be guaranteed by a direct control of the survey experts but, at the same time, the combination of these two different sources implies taking into account variables harmonisation and linkage issues. Mismatches due to key variables problems have to be investigated and manually corrected. Some other large firms, not included in the target population of the statistical survey, have to be estimated using administrative data. To improve the quality of the estimations for this sub-population of firms, their data have been subjected to specific treatment as regards to measurement errors and non-responses imputation. Also some structural information recorded in the administrative register quarterly captured from INPS have been subjected to specific check procedures. In particular, the focus is put on some information whose quality can be lower because they are not crucial for the administrative purposes, even if they have an important role for the statistical use. For example, the fiscal code which represents the primary key for the linkage between administrative and statistical data, is checked and if necessary corrected, while the economic activity classification code is drawn from the Istat Statistical Business Register (ASIA). The difficulties in the use of administrative data, subjected to frequent changes of laws and rules on social security system, have required not only a systematic sequence of checks along the whole process, but also a final important step of macro data editing. Another significant consequence of the recurrent administrative rule changes is the frequent modifications in the procedures used. Every new version is saved and a systematic documentation of the whole production process is carried out. In particular, all checks are recorded to maintain useful time series of errors detected

and corrections carried out. This documentation is fundamental to monitor the data quality, to suggest improvement of the process and to guarantee its reproducibility and repeatability. A more user-oriented documentation of the quality features of the Oros indicators is periodically produced with standard quality reports and the updating of a set of quality indicators (Congia, Rapiti, 2008).

4. The treatment of business structural information Given the Italian firms dynamic in terms of birth and cessation, the representation of the current population is a fundamental information. Considering that the Italian Statistical Business Register (BR–ASIA) is available with a delay of about two-years from the reference quarter, the quarterly updated administrative register (AR) could have a strategic role in representing the firms’ demography. It contains a lot of structural information on the single administrative unit (administrative identification code, fiscal code, name, legal form, dates of registration and cancellation, etc.), it is regularly updated when any modification is transmitted by the employers and entrances of new units are registered. Nevertheless, several aspects related to the administrative nature of this rich set of information make the AR not immediately usable for the survey purposes. Some preliminary edits are needed to render the administrative data suitable for the statistical aims. In particular, quality issues on some variables and over-coverage problems have to be faced. A special attention is given to the verification and improvement of the quality of the fiscal code, a business identity number (BIN) used by different administrative systems to single out the enterprises as legal units. Given its wide diffusion, it is an important linkage variable against other sources of data (administrative and statistical). Because it is neither the primary identification key in the INPS AR nor a relevant administrative variable, the fiscal code may be affected by formal errors or be even missing. In order to prevent mismatches, it is subjected to an automatic procedure that checks and reconstructs the incorrect BINs through some subsidiary keys. The impact of these corrections has gradually decreased over the time as a result of the quality improvement of this information at the administrative level. The fiscal code is used to combine the single AR units with the BR–ASIA to drawn the official economic activity classification. Nevertheless the NACE code is assigned by INPS operators and recorded in the AR, it can be of lower quality because it is not relevant for the administrative purposes. Matching the AR with the most updated available BR, roughly the 90% of the Oros active units get the official economic classification. For the mismatched units (mainly due to new born or legal transformations), the administrative information on the economic activity is kept. It was tested that these residual units are fairly spread in the domains of interest and that this administrative variable has a good quality overall, but a regular supervision of the most relevant ones is constantly carried out. Finally, the survey target population has to be outlined in the AR, excluding the outof-scope units. It is the case of several thousand of units whose legal configuration does not belong to the survey interests (public sector) or whose economic activity is not included in the target sections. It has to be outlined as the AR suffers other over-coverage problems: while entrances are correctly registered, the activity temporary suspensions and firm closures are under-recorded because firms have no incentive to comply with the

obligation to communicate these events to the administrative system 3 . At the moment a prediction of the unit activity state is not realised because the availability of a high number of declarations makes not immediately necessary the identification of unit non-responses for the estimation of the target variables. The only exception is represented by the temporary employment agencies whose peculiarities imply an estimation procedure correcting for non-responses, where a predictive approach has been adopted to neutralize the effect of the over-coverage problem.

5. The translation of the administrative data into statistical variables The administrative data exploiting strategy of the Oros survey has been focused on the acquisition of the whole data source on the wage and contribution system. In fact, INPS could not aggregate in the very strict time scheduled the contribution declaration (DM10) data in the format required for the Oros purposes. So Istat has been obliged to capture the extremely disaggregated raw data, as they are transmitted by firms to the Italian Social Security Institute. This constraint has became an opportunity considering that it allows a more direct control on the aggregation/translation process and on its impact on the quality of the Oros indicators. On the other hand, it implies a very complex preliminary phase of checks and computation inside the single DM10 to get the target variables at micro level. The correct exploitation of this huge quantity of administrative data entails coping with the frequent changes in the basic INPS metadata. Actually, enterprises have to use the DM10 to take advantage of labour cost’s reduction policies and these laws on contribution continuously change. The availability of just fragmented and insufficient INPS metadata makes the translation of the administrative information into statistical data one of the most critical aspect to face. To guarantee a stable retrieval of target statistical variables, a metadata database has been ad hoc produced by Istat collecting laws, regulations and other technical aspects regarding social security contribution (Banca Dati Normativa su retribuzione e contribuzione - BDN). To be effective the BDN has to be almost quarterly updated, requiring a very hard and time consuming work. A peculiar treatment has been implemented to assure the quality in the aggregation and translation procedure of the monthly declarations. The DM10 form, used by the employers to declare compulsory contributions, appears as an extremely detailed grid partitioned in sections where information about the firm, the number of employees by type of employment, the wage bill, the paid days and the social contributions, credit terms and tax relieves are registered. Information is identified by four digits codes, used to classify the employment relationships, the working time, the contributions due or rebates to be received, the wage peculiarities, etc. Some statistical information is also requested, but not always filled in by the firms because unnecessary for the administrative purposes. For an accurate translation of the administrative data a deep knowledge of these codes is required, so a list of codes has been set up and its updating is a fundamental task in the process: each quarter legislation has to be examined to take note of the introduction of new codes and the elimination of other ones. Before retrieving the statistical variables, the DM10 go through a complex preliminary check procedure aimed at investigating and possibly correcting errors on codes, 3

These over-coverage problem interests about 20% of the AR units.

record duplications, incoherencies with current legislation, etc. The original dimension of raw data is about 10 million records per month because each form is split up into several records. In this step, information referred to each DM10 is summarized in a single record, for a total of about 1.3 million records per month. Finally, the retrieval of the target variables is carried out in two steps: the calculation of employment and wages, and the computation of social contributions. At first, the appropriate codes identifying the number of employees and the related wage bill have to be unambiguously selected, avoiding possible duplications. Secondly, other labour costs have to be calculated. As DM10’s codes refer to total social contributions (employer + employee), the employee social contributions have to be removed from this total through the application of appropriate legal rates, because they are already included in the gross wages. Besides, other labour costs such as employer injuries insurance premiums (INAIL) and termination of employment relationship allowance (TFR), not recorded in the DM10, have to be estimated using other sources of information.

6. The measurement errors treatment After the complicated retrieval of the statistical variables, in the followings steps of the Oros process checks are implemented at different levels in order to find out possible anomalous values and correct them. In particular a more traditional check procedure is carried out at a monthly micro level (Eurostat, 2007). Considering the high number of observations, this micro editing is set up on very selective criteria. The selection is based on a weight representing the probability of an error in the target variables, assigned to each of the 1.3 million of units. Units are checked through some known functional relations among the analysed variables aimed at evaluating both cross-sectional and longitudinal consistency using the information on the previous month. The main rules the editing procedure is based on are: • a positive amount of wage bills must correspond to a positive amount of employment, and often to a particular rate of social contributions; • the number of employees recorded in the current month should not significantly differ from that of the previous month; • the gross per capita wages, or the per capita paid days, should have similar and acceptable amounts in the analysed period; • the rate of social contributions on gross wages should fall within an expected range, etc. In the past, the largest values identified by the procedure as measurement errors were automatically corrected, but in the latest years the experience has suggested to avoid these automatic corrections because of the specific nature of the administrative data. Hence the most anomalous values are selected according to established cut-off thresholds, they are interactively analysed and, if necessary, corrected. The number of edits performed is globally not high but sometimes even the omission of one correction can have a substantial impact on the final estimates. Just to have an idea, in the 3rd quarter of 2007 an erroneous value of a single firm’s number of employees would have determined an year-on-year change of 0.8% in the section G (wholesale and retail trade; repair of motor vehicles and motorcycles) instead of a 3.0% correct change.

The peculiarities of the administrative data used have a considerable impact also on gross wage and other labour cost distributions making the identification of the cut-off thresholds particularly problematic. • The distribution of per capita gross wages, for example, shows that, besides the usual right tail area of the distribution, in the INPS data there is also a significant left tail area where a high number of units with very low per capita wages are concentrated. Normally these observations should be considered erroneous, but in this case they are the right representation of economic phenomena (for example firms with very few employees all receiving only supplementary earnings by the employer). In this left tail area of the distribution it is very hard to distinguish wrong figures and the final risk is an asymmetrical correction of errors. • The other labour costs distribution may show negative values, because of social contribution rebates. These aspects must be taken into consideration both to calculate correct check indicators and to single out all possible wrong data.

7. The non-responses imputation Differently from traditional surveys, where the list of non-respondents in each reference period is available, as difference between the sample units and the set of the respondents, in the case of Oros a list of units which should sent the DM10 form is not available. At the scheduled time for the acquisition of the provisional population, it may happens that some DM10 forms are missing due to delays depending on firms liability or administrative system flaws. These latecomer units are considered unit non-responses. Evidence shows that they represent the 2-5% of units, characterized by a MAR nature, and do not significantly affect the Oros wage and other labour cost changes estimates. Temporary employment agencies are an exception because they are characterized by their great relevance in terms of employment: about one hundred units covering the 3% of the total employees and 20% of section K (Real estate, renting and business activities) where they are all classified by INPS. These firms are extremely large (about 1,500 employees on average) and subjected to frequent changes. The absence of even few of these units in the provisional population may impact on the target variables estimates, implying some kind of treatment when missing data are suspected. Because no alternative sources are available on these units, actually they are not included in the target population of the LES, imputation is the way as unit nonresponses are adjusted. The first and most critical aspect to be faced is the single out, among the absent units, of the non-responses. Because of the AR over-coverage problem due to the unreported cancellations, the list of non-respondents is clearly known only when all the DM10 are available. In order to find out the units to be imputed, the distinction between an absence of a monthly declaration due to (even seasonal) inactivity and a real non-response depending on the declaration delivery delay, is needed. Without any further administrative information qualifying the absence, the activity state must be predicted through the analysis of the units’ behaviour in terms of presence/absence along a pre-determined span of time and with the help of some auxiliary information.

First, a list of reference units is built: here are included all the units which are active according to the AR information and that have delivered the DM10 in at least one of the months of the reference quarter or of the previous four quarters. The use of a set of quarters before the reference one is adopted because of the tested hypothesis that the quarters closest to that of estimation can be informative on the latter. Indeed, evidence has shown as the probability that a latecomer position in a quarter is latecomer also in its near quarter is actually low. Given at least one presence in the considered pattern, a unit non-response is defined when the absence is not systematic along the considered period (suspected inactivity for seasonality). Furthermore, considering the extreme dynamism of these units, before imputing it is necessary to check possible absences due to firm changes, like mergers or split-ups. At this aim, it is opportune to follow the employment flows among all units. A deterministic approach is used to impute and different rules are adopted depending on the characteristics of the variables to be imputed. The imputation of the employment and wages variables is mainly based on the longitudinal information available on each missing unit. Firstly, suitable values for the two variables (generically Y) are selected from the closest quarter (t-j) when the current missing unit (i) was respondent. Secondly, these values are fairly updated using panel information drawn from the current respondents (r). In particular, the average ( y& ) of the changes calculated on the respondents between t and the quarter from which information has been selected (t-j) is used. [1]

Yˆit = Yit − j (1 + y& t,t − j )

j=1,….,4

where y& t ,t − j = Me(Yrt / Yrt − j − 1 ) The reconstruction of the other labour costs variable is based on the multiplication between the estimated wages ( Rˆ ) and an average contribution rate ( oˆ ) (other labour costs on wages) calculated on the information available from the respondents in the current quarter. [2]

ˆ = Rˆ oˆ O it it t

where oˆt = Me( Ort / Rrt ) The adjustment for non-responses in the temporary employment agencies implies an increase in the number of employees in this specific sub-population of about 1-3%. The effect on per capita wages and total labour cost is less relevant, given the limited variability of these variables among the units (up to 0.5% in some quarters). Overall, the revision error of the target variables estimates for these units has significantly decreased over the time. This result is the consequence of an increasing knowledge of this particular group of firms’ behaviour, helping to discriminate more clearly, in cases of uncertainty, if imputing or not.

8. The large enterprises and the integration with survey data The quality of the Oros indicators is strictly related to the treatment of large enterprises. These firms, with 500 employees and more, have a considerable

influence on the estimates: in the Italian non-agricultural sector they account for about one thousand units employing 2 millions of workers, so they represent more than 20% of total employees. At the beginning of the Oros release, the preliminary estimates were based on an administrative non-random sample which was incomplete particularly due to undercoverage of this sub-population of large firms. Moreover some of the Italian large firms were not registered in INPS data because they paid social contributions to other Institutes. The final choice to guarantee the quality of the statistics produced was the integration of administrative data with the Monthly Large Enterprises Survey (LES) data produced by Istat. After 2004, when the non-random sample has become a provisional population extremely large, the INPS administrative sources could guarantee the coverage of almost all firms in the private sectors, but for the estimation of large firms the use of LES data was nonetheless considered preferable. This statistical source provides higher data quality mainly thanks to the direct contact with these large firms. The enterprise recalling is frequent in case of non-responses or suspected measurement errors and it can also guarantee a more rapid and efficient management of changes these firms are frequently subject to (e.g. mergers, split-ups, acquisitions etc.). Besides, only for few of them, administrative data still have coverage problems because of the existence of different institutes for the payment of social contributions. The combination between INPS and LES data, realized at micro level, implies two main aspects and a specific procedure for checks. First, the production of variables harmonised with those produced using administrative data. Specific operations are necessary to select the wage and other labour cost components to be included in the Oros target variables. Starting from monthly data, the computation of the quarterly ones is also needed. Second, a check and editing procedure to correctly single out the LES enterprises among the INPS data is necessary for the substitution of their economic values with those drawn from LES. Starting from the list of firms belonging to the survey, a complementary list of INPS firms must be realized avoiding omissions or duplications. There is only a key variable between the two sources, the fiscal code, but mismatches arise because it may be affected by formal errors or updated in different times. Moreover, changes like mergers, take-overs, hive-offs, split-ups these firms are subjected to, have to be taken into consideration. These events imply that a population does not consist of exactly the same units in different reference periods and problems arise when these changes are recorded in different times and according to different rules in the two sources (Istat, 2006). Besides mismatching, the fiscal code is not always enough to guarantee that it correctly represents the same enterprise in the two registers. In this case a significantly different number of employees is used as a signal of possible problems. On average, 12% of the total employees surveyed by the LES have to be manually checked and joined to the correspondent INPS firms using auxiliary information. One of them is the firm’s name, whose use is complicated because it is not standardized in the two sources. Other information to check the quality of the two lists created is drawn from external sources like the Italian Business Register historical database that contains chronological information on the enterprises’ activities and changes over the time.

Some large enterprises, not included in the LES panel because they are persistent non-respondents, are estimated using INPS data 4 . Since these firms are influential for their average size, first of all it has to be sure that changes occurred to them do not involve any LES enterprise. Then, it is important to evaluate the potential impact on the indicators of a change in one or more of these firms from a quarter to another (like modifications of the economic activity). These checks are mainly supported by the BR-ASIA through a record linkage procedure based on the fiscal code as matching variable.

9. Macro data validation Once indicators have been produced, macro data are submitted to further quality controls to identify possible anomalous values that may significantly affect the series released. This is a key step in the Oros quality strategy because the difficulties to be faced in the use and translation of administrative data make possible residual errors, in spite of the several previous checks. Since changes in contribution legislation with an impact on macro data are frequent, irregular but acceptable trends due to economic or legal factors must be as far as possible distinguished from anomalies due for example to an erroneous updating of the metadata database or outliers/errors not identified and corrected in the micro data editing step. These controls, mainly based on the analytic inspection of the time series at a subpopulation detail, are carried out through some statistical measures which have to respect some pre-defined acceptance thresholds. In order to extend checks to a more disaggregated level and to fully consider the time series information, that can be affected by seasonal patterns, noise or special events, an automatic detection of outliers is also performed. This analysis is based on TERROR, an application of the software TRAMO-SEATS (Caporello and Maravall, 2002) which detects suspected errors in the last observations comparing them with their forecasts estimated trough REG-ARIMA models. This procedure is very rapid and permits to handle a very large number of series in few seconds. The final indicators are also evaluated using figures drawn from other Istat statistical sources. Given the definition differences, Oros indicators are compared to the LES and quarterly National Account estimates on wages and total labour cost. Some evaluations on wage bargaining effects are also possible by comparing Oros estimates to the Indices of wages according to collective agreements (contractual wages). Furthermore, certain known relationships between variables whose coherence has always to be guaranteed, as the ratio of other labour costs on wages, are deeply examined. If the anomalous values emerged in the macro data checks hide errors, a drill-down to micro data is required despite very rigid time constraints (2/3 days). A set of further ad hoc checks on micro data helps the understanding of the problem origin. Finally, if necessary the errors correction is carried out at micro level in order to guarantee the coherence between macro and micro data.

10. Final remarks

4

These firms provide work for about 1% of total employment.

The use of administrative data characterized by frequent changes of social security regulations, in a survey with a high timeliness release, implies a greater dependence on the data supplier and stronger risks to incur in data quality problems than in a traditional survey. For this reason a quality strategy along the whole production process has been implemented. Considering the administrative nature of the input data, quality issues have been addressed only ex-post through a systematic sequence of checks and editing steps which should assure the interception of possible errors. These quality challenges have been faced without any previous similar experience in the use of very disaggregated administrative data for the production of short-term indicators. Ad hoc and innovative solutions have been realized to face the most critical quality aspects. The potential quality risks related to frequent administrative regulation changes have been managed through the quarterly updating of documentation (metadata database) and procedures. Considerations on the quality have also motivated the integration with survey micro data on large enterprises which are used instead of administrative information. For these relevant firms, survey data has been judged preferable mainly thanks to their direct control by survey experts. Overall, the Oros production process turns out to be reliable both in terms of effectiveness (quality of the entire process) and efficiency (relatively limited time consuming and low use of human and economic resources) without adding any further burden on enterprises.

References Baldi, C., Ceccato, F., Cimino, E., Congia, M.C., Pacini, S., Rapiti, F., and Tuzi, D. (2004), “Use of Administrative Data to produce Short Term Statistics on Employment, Wages and Labour Cost”, Essays, 15, Istat, Rome. Caporello, G., and Maravall, A. (2002), “A tool for quality control of time series data. Program TERROR”, Bank of Spain. Congia, M.C., Rapiti, F. (2008), “Quality reporting in a short-term business survey based on administrative data” paper presented at the European Conference on Quality in Official Statistics, 8-11 July 2008, Rome, Italy. Eurostat (2003), “Quality assessment of administrative data for statistical purposes”, Doc. Eurostat/A4/Quality/03/item6, available on the web site: http://epp.eurostat.ec.europa.eu/pls/portal/docs/PAGE/PGP_DS_QUALITY/TAB4714 1301/DEFINITION_2.PDF. Istat (2006), “Rilevazione mensile sull’occupazione, gli orari di lavoro e le retribuzioni nelle grandi imprese”, Metodi e Norme, 29, Rome. Eurostat (2007), “Recommended Practices for Editing and Imputation in CrossSectional Business Surveys”, available on the web site: http://edimbus.istat.it/dokeos/document/document.php?openDir=%2FRPM_EDIMBU S.

Wallgren, A., and Wallgren, B. (2007), Register-based Statistics. Administrative Data for Statistical Purposes, West Sussex: Wiley.