An alternative procedure for imputing missing data ...

4 downloads 3531 Views 313KB Size Report
solved by identifying a new substitution procedure, following an empirical approach ... Keywords Missing data substitution · Principal components analysis ...
An alternative procedure for imputing missing data based on principal components analysis Giovanni Di Franco

Quality & Quantity International Journal of Methodology ISSN 0033-5177 Qual Quant DOI 10.1007/s11135-013-9826-4

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media Dordrecht. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your work, please use the accepted author’s version for posting to your own website or your institution’s repository. You may further deposit the accepted author’s version on a funder’s repository at a funder’s request, provided it is not made publicly available until 12 months after publication.

1 23

Author's personal copy Qual Quant DOI 10.1007/s11135-013-9826-4

An alternative procedure for imputing missing data based on principal components analysis Giovanni Di Franco

© Springer Science+Business Media Dordrecht 2013

Abstract This work entailed tackling the significant problem of missing data which was solved by identifying a new substitution procedure, following an empirical approach based on the analysis of the information contained in the entire set of data collected. This procedures offers a number of advantages compared to other techniques commonly mentioned in the statistical–methodological literature. Keywords Missing data substitution · Principal components analysis · Random missing data · Systematic missing data

1 Defining the problem Recently, I took part to a research project conducted by the College of Europe, Bruges, funded by the European Commission (Guerrieri and Bentivegna 2011; Di Franco 2011b). Among others, our task was to build a European Digital Inclusion Index capable of capturing the development of digital inclusion in the 27 countries of the European Union (EU) between 2004 and 2009. This work entailed tackling the significant problem of missing data which was solved by identifying a new substitution procedure, following an empirical approach based on the analysis of the information contained in the entire set of data collected. This procedures offers a number of advantages compared to other techniques commonly mentioned in the statistical–methodological literature. Before presenting in detail the new procedure, we belive it’s useful to devote the following paragraph to the problem of missing data in the social research and the procedures generally adopted to address it.

G. Di Franco (B) Department of Social Sciences, University of Rome “La Sapienza”, Via Salaria 113, Rome, Italy e-mail: [email protected]

123

Author's personal copy G. Di Franco

2 Missing data and their treatment Missing data represent a typical problem in social research both when the unit of analysis is the individual, and the data is constructed through survey or is territorial with data drawn from national and/or international statistical sources. In addition to the atomistic assumption (Marradi 1993; Di Franco 2001), data matrices also require completeness. To meet this assumption, information must be avaible for all the cases in the matrix on all the properties/variables for which information was originally collected. Missing data may depend on a variety of factors which are either of random or systematic nature. In the case of missing data due to random factors, we can identify two instances: (a) completely random, and (b) partially random. In the former instance, the missing data do not depend from the variable investigated nor from any other variable in the data matrix. For example, data regarding the variable ‘income’ would be missing for reasons of pure chance if one could demonstrate that: (a) the average income of individuals who do not declare their income is the same as that of those who do; (b) each of the other variables in the matrix presents the same descriptives statistic both for the persons who do not declare their incomes and for those who do. In other words, if the missing data were due entirely to random factors, the set of individuals with missing data would be a random sub-sample of the overall research sample. Partially random missing data also do not depend from the variable investigated, but are influenced by one or more other variables in the data matrix. For example, missing data for the variable ‘income’ might be regarded as partially random if one could demonstrate that the probability of their occurrence depended, for instance, on the modality assumed by the variable ‘occupational-status’. Therefore, ‘self employed’ would be less inclined to declare their incomes than ‘employees’, and so on. When the presence of missing data is due to systematic factors, it relates to the values of the variable examined. For instance, individuals with high-income are less likely to declare their income than those in the medium to medium-low bracket. Several empirical studies have shown that in social research, especially when the unit of analysis is the individual, missing data are often due to systematic factors. This means that the set of cases with missing values cannot be considered as a random sub-sample of the overall research sample. Clearly, in this case the presence of missing data produces significant distortions on the results. For this reason, it is necessary to adopt as far as possible strategies that reduce their impact. In the absence of in-depth analyses identifying the factors behind the presence of missing data, it is impossible to establish whether they are due to random or partially random or systematic factors, or, as it may also occur, to a combination of factors. This situation represents an important problem because the majority of the imputation techniques have been designed based on the assumption that missing data are due to purely random factors. Given that this assumption is nearly always unrealistic, one needs to be aware that the violation of such assumption leads to inevitable distorsions, of unknown entity but nonetheless relevant, in the results of the imputation, and, consequently, on the overall outcome of the research. In general, when the missing data represents a small proportion of cases (between 1 and the 3–4 % of the overall sample), it is advisable to simply exclude these cases from the data analysis procedures. In other cases, when missing data concerns categorical and/or ordinal variables of particular interest to the research or when their size is grater than usual, the codes assigned to the missing data may be treated as labels denoting a new category: the ‘no response’ one, some analytical techniques, like frequency tables, or contingency tables, or multiple-

123

Author's personal copy An alternative procedure for imputing missing data

correspondence analyses, may present missing data in dedicated lines or columns. In this way it is possible to explore the socio-demographic characteristics (age, gender, educational level, etc.)1 of those individuals who did not answer questions of particular interest to the research. In some case the proportion of missing data is of such magnitude to render those variables totally unserviceable2 ; this happens often when investigating areas related to people’s private sphere (for example, income, political preferences, voting intentions or socially reprehensible behaviours). To avoid having to exclude these from the analyses, one need to choose a procedure to substitute the missing data. Three techniques are generally proposed for the treatment of missing data: (1) the elimination of cases for which data is missing; (2) single imputation; (3) multiple imputation. The first technique, called ‘complete case analysis’, simply omits from the analysis all the cases in the data matrix for which data is missing. As already illustrated, by ignoring the systematic differences between complete and incomplete samples, this technique produces unbiased estimates only if the cases cancelled are a random sub-sample of the original sample. Furthermore, standard errors will be generally larger because computed on a sample numerically inferior to the original sample. To minimize such problems, a general agreement has been established whereby if a variable presents missing data above 5 %, then it is not advisable to cancel these cases (Little and Rubin 2002). The other two techniques replace the missing values by applying either single or multiple imputation3 procedures. To impute means to assign a value to the cases for which data are missing. These values may be produced using both collected data and models with implicit or explicit assumptions. In the implicit models, the computations produce valid results only if certain constrsints (assumptions) are met. The risk with this technique is that it may erroneously lead the researcher to consider the final data matrix complete, forgetting that an imputation procedure for the missing data has been carried out. Implicit models include: – the ‘hot deck’ technique: missing data for a case are replaced with those from other cases considered ‘similar’; for example, missing data regarding the variable ‘income’ would be replaced with data from another case with similar characteristics with regard age, gender, residence, working condition, etc.; – the ‘cold-deck’ technique: the missing data are replaced with values from an external source: for example, from previous research on the same topic, from administrative documents, etc.; – the ‘substitution’ technique: cases for which data are missing are replaced with other cases originally not included in the sample. For example, if a person cannot be reached 1 In general, it is always advisable to examine the socio-demographic profiles of cases with missing data

very carefully. Furthermore, an anomalous incidence of missing information may indicate the emergence of problems during the data collection (difficulty in understanding the questions encountered by the interviewees, inadequate administration of the interviews by the interviewers, etc.). 2 In Italy this situation is typical of all surveys investigating voting intentions: nearly always the non-responses reach, and at time even surpass, 50 % of the sample. 3 The literature dealing with missing data substitution is ample and rapidly evolving. For further details see, among others, Enders (2010), Holenberghs and Kenward (2007), Chantala and Suchindran (2003), Akritas et al. (2002), Little and Rubin (2002), Allison (2001), Huisman et al. (1998), Little (1997) and Little and Schenker (1994).

123

Author's personal copy G. Di Franco

or refuses to cooperate, one may decide to select another from the same block. This technique is used essentially for ‘complete non-response’ cases (that is, when all the data for a case are missing).4 In explicit models the substitution of missing values is based instead on statistical criteria making explicit assumptions, as in the following cases: – substitution of missing data with the average, or median, or mode value depending on the categorial, ordinal or cardinal nature of the variable. These statistics are computing on the set of valid data (that is, considering only the cases which present data on that variable). The use of central tendency values is legitimate only when the distribution of the variable is symmetrical since, only in this case, they do not alter the distribution of the variable. In all other cases (empirically more frequent) where the distribution of the variables is asymmetrical, recourse to central tendency values creates serious distortions. Moreover, even though this procedure minimize the error on the single case due to the difference between its actual state (unknown) and the central tendency value, it also produces a reduction of that variable’s variance. In fact, all the missing data for the same variable (presumably with different values) are artificially concentrated on a sole central value. For this reason it has been suggested that missing data should be replaced not with one but with a number of different central values5 ; – imputation through regression analysis: the missing data are replaced with values obtained through regression analysis. The dependent variable is the single variable presenting missing data, while the independent variable/s is/are other variables showing strong associations with the dependent variable6 ; – ‘expectation maximisation’ imputation: the values replacing the missing data are obtained by means of an iterative procedure. First, preliminary estimates for the missing data are obtained on the basis of an initial estimates of the parameters of a models considering only the valid data (that is, excluding the missing data). Second, these estimated values are fed into a new model, and the procedure is repeated until a convergence on the maximum-likelihood criteria is reached. The time required to reach convergence depends on the proportion of missing data and the maximum-likelihood function chosen. There4 When, due to limites economic and/or time resources, it is not possible to apply substitution, the so-called

‘mortality’ of the sample is compensated for by recourse to opportune weighting techniques used to restore the original sample size. As these operation lead to serious problems, the researcher must apply them with the utmost caution (for a critique of weighting techniques see Di Franco 2010). 5 Accordingly, the entire sample is divided into a number of sub-samples on the basis of socio-demographic variables (usually age and educational level) used as stratifying criteria, and each missing value is replaced by the central value of the class it belongs to. Through this procedure, one seeks to take into account the potential differences between the cases showing missing data across the different classes. For instance, it is possible that the central tendency measures of variables constructed on the basis of some opinion questions vary according to the age or educational level of the respondants. Throught this expedient, it should be possible, to improve the quality of each of the substituted data and, at the same time, limit the size of the shrinkage of variance. To avoid the reduction of variance, a random procedure has also been suggested. This may be used with any kind of variables and involves substituting each missing value with a different random value. These random values should be taken from the same socio-demographic class of the case, and should be chosen in such a way that not all the values have the same probability, but probabilities which are proportional to the frequencies of the cases which present data for that variable. This procedure leaves unaltered the original central value, the variance, and the distribution of cases both for the complete sample and the sub-types identified. Clearly, it is not claimed that the substitution of missing data for the single case comes close to the actual state on that variable. 6 In the case of categorical missing data imputation through regression may be carried out through logistic regression analysis (Di Franco 2011a). For nominal variables, substitution through hot-deck or cold-deck imputation is more appropriate.

123

Author's personal copy An alternative procedure for imputing missing data

fore, if missing data are numerous, a considerable length of time may be required to reach convergence. Another problem eith this procedure concern the possibility that the maximum estimated may not represent a global maximum but a local one. Usually, to avoid the local maximum problem, a number of tests is carried out, by changing the initial estimates for the model. Single imputation techniques are rathersimple to apply, but they have importation limitations because of systematic under-estimation of the size of the variance and standard errors. In an attempt to overcome these limitations multiple-imputation techniques have been developed. Multiple imputation is carried out using a random process which reflects the uncertainty of the estimates. Essentially, instead of a single value for each missing data, N different values are computing (reflecting the uncertainty of the imputation procedure) and inserted into N alternative data matrices. The parameters of interest and their respective standard errors are estimated each of these N alternative matrices. The estimates thus obtained are combined using the N sets making it possible to compute the variance of the imputation as well.7 In multiple imputation any of the aforementioned techniques may be applied. For example, one may repeatedly use imputation by regression computing N values for the regression parameters. However, in such cases more sophisticated models like Markov’s chains using of the Monte Carlo algorithm are normally used. In sum, there is no definite solution to the problem of missing data substitution, but only a certain number of statistical techniques from which to choose according to the type of data to be imputed, the proportion of data missing compared to the total overall dimension of the data matrix (a few data missing from a large matrix do not require sophisticated imputation techniques), the significance for the study of the variable or variables presenting missing values (marital status is normally declared by all, while voting intentions usually score a non-response rate of abour 50 %), the type and size of the sample, the data-collection technique used, and the data-analysis techniques to be applied. None of the aforementioned imputation techniques is devoid of assumptions and the results of the imputation should be subjected to numerous controls, considering, on the one hand, respect of the statistical properties related to the characteristics of the distribution of the variable/s; on the other, and by way of priority, the semantic value of the variable/s with respect the aims of the research. In any case, substitution of missing data always produces distortions of the variables (reduction of variance), optimistic evaluations of the trustworthiness of the data in the matrix (when they are used to substitute or estimate missing data), and further distortions when relations between variables are analyzed (bivariate and, above all, multivariate analyses). One must conclude, therefore, that the techniques presented need to be applied critically, carefully evaluating the type of variable to be treated and the costs and benefits inplied. In our opinion, when missing data relates to few cases, it is preferable to exclude them from the analysis as already stated above; when their number is such as to limit the use of particular analytical techniques, it is advisable to investigate the possible causes for non-responses, and then choose a way to substitute them as to minimise the distortion which will necessarily be introduced. Whenever possible, it is undoubtedly preferable to eliminate the variable/s featuring a considerable number of missing data than introduce too many artificially generated values. 7 The N versions of the complete data matrix are analysed using standard statistical techniques and the results

combined using simple rules in order to reach single joint estimates, which formally incorporate the intrinsic uncertainty of the missing data. Therefore, the results of the estimates are averages computed on the N matrices of complete data.

123

Author's personal copy G. Di Franco

When the aim of the research is to develop a national or international index capable of providing an evaluation of the positions of various countries on a certain issue, the existence of missing data cannot be tolerated because this would thwart the achievement of the objective. Since I was taking part in such a research project intended to construct an index of digital development for the twenty-seven countries of the EU, we designed an alternative procedure for the missing data substitution based on principal components analyses (PCA), as we shall illustrate in the next section.

3 An alternative missing data substitution procedure based on PCA In the aforementioned project (Guerrieri and Bentivegna 2011; Di Franco 2011b) six data matrices were built8 for the twenty-seven countries of the European Union, one for each year from 2004 to 2009. The concept of ‘Digital Inclusion’ was defined on the basis of three fundamental dimensions: (1) Access; (2) Usage; (3) Impact. These three dimensions were in turn broken down into the following sub-dimensions: (a) Network; (b) Affordability; (c) Avaibility and quality for the first. (a) Autonomy; (b) Intensity; (c) Skills required for te second. (a) (b) (c) (d) (e) (f)

Educational area Employment and labour area; Health and wellness area; Government interaction; Economic area; Cultural, communicative and recreational area for the third.

Subsequently, a number of indicators were selected to represent each of these sub-dimension. This choice was made considering both the result of numerous pre-analyses, and following the guidelines published by the OECD (2008).9 See Tables 1, 2, and 3. In this respect, it is appropriate to quantify the amount of the missing data in the various matrices constructed for the years from 2004 to 2009. Tables 1, 2 and 3 show, in detail, the number of missing data for each indicator relative to each year considered. The situation for the missing data is as follows: in 2004, the complete data set for the 23 indicators selected would consist of 621 vales (27 × 23 = 621); missing data for all the 27 European countries amounts to 112. The percentage of missing data over the total data is, therefore, 18 %. In 2005 for the 23 indicators selected, complete data amounts 621; missing 8 The data used in the research project was drawn from European and international data banks (Eurostat, The

World Bank, ITU, etc.). Some referred to structural characteristics of the countries; others were drawn from surveys regarding the use of computers as well as access to and use of Internet (Di Franco 2011b). 9 For further details concerning the method used to construct the index see Di Franco (2011b).

123

Author's personal copy An alternative procedure for imputing missing data Table 1 Number of missing data for the indicators of the three subdimensions of dimension access for year Subdimension indicators Subdimension: network Broadband penetration rate (%)a International internet bandwidth per inhabitant (bit/s)b Secure Internet servers (for 1 million people)b Subdimension: affordability Information and communication technology expenditure per capita (US$)b Subdimension: availability and quality Internet subscribers (total fixed broadband) per 100 inhabitantsc Internet subscribers (total fixed) per 100 inhabitantsc Level of internet access of households (%)a Percentage of households using a broadband connectiona

04

05

06

07

08

09

2 2 0

2 2 0

0 3 0

0 1 0

0 – 0

0 – 0

6

6

6

6

6

6

1 0 4 7

0 1 3 3

0 3 0 1

0 4 0 0

0 4 3 3

0 0 0 0

a Eurostat b WDI c ITU

Table 2 Number of missing data for the indicators of the tre subdimensions of dimension usage for year Subdimension indicators Subdimension: autonomy Percentage of individuals who accessed internet at home in the last 3 monthsa Subdimension: intensity Percentage of individuals who accessed internet, on average, every day or almost every day in the last 3 monthsa Subdimension: skills Percentage of individuals who have copied or moved a file or foldera Percentage of individuals who have used basic arithmetic formulaea Percentage of individuals who have connected and installed new devicesa

04

05

06

07

08

09

8

4

0

0

0

0

4

3

0

0

0

0

6 5 –

1 1 –

0 0 0

0 0 0

0 0 0

0 0 0

a Eurostat

data are 83, 13 % of cases. In 2006 data for the 24 indicators is 648; the missing data reached 32. 5 % of the whole. The situation improves greatly starting from 2007, when of the 21 indicators selected, the complete data was 527; the missing data amounts to 16, 2.8 % of the cases. In 2008 out of the 20 indicators the complete data would be 540; the 19 missing data represents 3.5 %. Finally, in 2009 out of 20 indicators, the cases of complete data are 540; the missing data are only 7 (1 %). As stated before, to build the Digital Development Index, it was necessary to produce a complete matrix for each European country for each year considered. We conducted many tests replacing the missing data using some of the imputation techniques presented in section two. The results were deemed unsatisfactory for a number of reasons: – given the small number (27) of cases considered the impact of the substitution on the index is considerable. For example, on one indicator, 11 out of the 27 cases presented missing data for the year 2004. Substituting these missing values with the average value

123

Author's personal copy G. Di Franco Table 3 Number of missing data for the indicators of the six sub-dimensions of dimension impact for year Subdimension indicators

04

05

06

07

08

09

5

4

1







7

4

1







8

5

1













0

0

0

3

6

5

2

3

1

6

5

1

0

0

0

5

7

6







11

4

0

0

0

0

5

6

3

0

0

0

Percentage of individuals who used internet, in the 5 last 3 months, for internet bankinga Percentage of individuals who used internet, in the 6 last 3 months, for selling goods and services (e.g. via auctions)a 6 Percentage of individuals who used internet, in the last 3 months, for using services related to travel and accommodationa Subdimension: cultural, communicative and recreational area

4

0

0

0

0

7

1

3

0

0

5

0

0

0

0

5

4

0

0

0

0

5

4

0

0

0

0

5

5

1

0

0

0

6

5

1

0

0

0



6

1

0

0

0

6

5

0

0

0



Subdimension: educational Percentage of individuals who used internet, in the last 3 months, for formalised educational activities (school, university)a Percentage of individuals who used internet, in the last 3 months, for other educational courses related specifically to employment opportunitya Percentage of individuals who used internet, in the last 3 months, for post educational coursesa Percentage of individuals who used internet, in the last 3 months, for training and educationa Subdimension: employment and labour Percentage of persons employed using computers connected to the internet in their normal routine at least once a weeka Percentage of individuals who used internet for looking for a job or sending a job applicationa Percentage of persons employed working part of their time away from enterprise premises and accessing enterprise’s IT system from therea Subdimension: health and wellness Percentage of individuals who used internet for seeking health information on injury, disease or nutritiona Subdimension: government interaction Percentage of individuals who used internet, in the last 3 months, for interaction with public authoritiesa Subdimension: economic area (e-commerce, e-banking)

Percentage of individuals who used internet, in the last 3 months, for sending/receiving e-mailsa Percentage of individuals who used internet, in the last 3 months, for playing/downloading games and musica Percentage of individuals who used internet, in the last 3 months, for reading/downloading online newspapers/news magazinesa Percentage of individuals who used internet, in the last 3 months, for listening to web radios/for watching web televisiona Percentage of individuals who used internet, in the last 3 months, for downloading software* Percentage of individuals who used internet, in the last 3 months, for other communication uses (chat sites, etc.)a a Eurostat

123

Author's personal copy An alternative procedure for imputing missing data









for the remaining 16 cases introduced an important distortion since 41 % of cases would have concentrated on the average with an intolerable reduction of the variable’s variance; none of the variables with missing data had a symmetrical distribution; on the contrary, and because of the considerable differences between the Northern and Southern European countries, they were all strongly asymmetrical; the missing data strongly concentreated on the countries that entered the EU most recently (Romania, Bulgaria, Czech Republic, Lithuania, Latvia, Estonia, Slovenia, Cyprus, Malta, Slovakia). Before their entry, these countries were indeed not required to adapt their statistical surveys to the standard methodology elaborated by Eurostat (the agency which coordinates the national statistics agencies of the EU countries); given the considerable differences between the European countries, it was not possible to use the hot-deck technique. For example, even the three Baltic countries, which for geographical and historical reasons might have been considered very similar, presented very different values on many variables selected for this research; for all the reasons mentioned above, the random generation of values for the missing data did not represent a satisfactory solution; therefore, it was not advisable to resort to a statistical substitution technique.

Because of this, we designed a procedure which is not subject to the constraints associated with the random generation of missing data, and which does not require symmetrical distribution of the variables treated. The substitution procedure is based on PCA.10 For each year, analyses were carried out taking into account all the variables with complete data for all the European countries and which, as stated above, represented indicators relative to the various dimensions of digital inclusion. For each year, we extracted the first principal component and saved the component scores of the 27 European countries. The component scores are standardised values obtained through linear combination of all the variables included in the analysis and represent the best possible synthetis of the data. In other words, we can consider the first principal components for each year as a preliminary index of ‘Digital Inclusion’. Our procedure was designed so as to respect the following methodological characteristics: a. the analysis considers only the cardinal variables collected for a limited number of cases (the 27 countries of the EU); b. the missing data could not be attributed to random causes, but most likely to systematic factors; c. all the variables considered, even those that were not selected for the construction of the index, presented high positive correlation, many of which very high (greater than .85); d. to minimize the effects of possible distortion, inevitable when any kind of imputation technique, we proceeded to substitute the missing data only when these did not represent the absolute majority of the data for a variable (in our case the ceiling was 13 out of 27; in pratice only one case had missing data for 11 cases out of 27, in all the other cases these varied between 8 and 1; see Tables 1, 2 and 3); e. the missing-data imputation technique should not, as far as possible, compress the variance of the variables analyzed; f. a validation of the technique was carried out before the actual missing data substitution through empirical tests, which yielded very encouraging results (see below). 10 For an in-depth discussion of PCA (see Di Franco and Marradi 2003).

123

Author's personal copy G. Di Franco Table 4 Value on componential index for countries and for year Country Austria Belgium Bulgaria Cyprus Czech Republic Germany Denmark Estonia Spain Finland France Greece Hungary Ireland Italy Lithuania Luxembourg Latvia Malta Netherlands Poland Portugal Romania Sweden Slovenia Slovakia United Kingdom

2004

2005

2006

2007

2008

2009

0.56 0.50 −1.23 −0.45 −0.47 0.89 2.04 −0.11 −0.20 0.93 0.28 −0.84 −0.89 0.42 0.20 −1.19 1.04 −1.19 0.06 1.15 −1.19 −0.52 −1.67 1.98 −0.32 −0.91 1.13

0.56 0.32 −1.34 −0.45 −0.45 0.99 1.66 −0.03 −0.23 0.95 0.37 −0.91 −0.91 0.27 0.41 −1.08 1.11 −1.18 0.17 1.41 −1.24 −0.49 −1.79 1.76 −0.22 −0.97 1.31

0.47 0.26 −1.22 −0.48 −0.52 1.14 1.59 −0.03 −0.20 0.67 0.45 −1.00 −0.94 0.24 0.36 −0.93 1.07 −1.11 0.14 1.53 −1.24 −0.59 −1.75 1.83 −0.22 −1.00 1.49

0.41 0.29 −1.50 −0.85 −0.56 1.00 1.90 0.43 −0.11 1.62 −0.18 −1.21 −0.35 −0.22 −0.74 −0.65 1.27 −0.53 −0.40 1.64 −0.71 −0.78 −1.77 1.60 0.03 −0.42 0.79

0.28 0.15 −1.51 −0.88 −0.68 0.88 1.75 0.35 −0.02 1.42 0.42 −1.34 −0.26 −0.33 −0.77 −0.55 1.69 −0.54 −0.38 1.74 −0.78 −0.68 −1.76 1.42 0.05 −0.46 0.80

0.60 0.21 −1.55 −0.99 −0.69 0.84 1.72 0.45 0.03 1.24 0.57 −1.25 −0.22 −0.40 −0.66 −0.57 1.75 −0.54 −0.45 1.62 −0.88 −0.74 −1.88 1.31 0.06 −0.35 0.78

Having carried out a PCA on each of the six matrices, we had a set of component scores for each country, which reflected that country’s position in comparison to the other 26 for each year between 2004 and 2009. We labelled this new variable ‘Digital Inclusion Score’. If a country was very much advanced in its digital development, it scored very high in terms of units of standard deviation on the EU average of zero. For example, Sweden in 2004 had a score of almost two standard deviation (1.98) above the EU average. In the following years, Sweden, even though remaining above the EU average, had lower scores: 1.76 in 2005, 1.83 in 2006, 1.60 in 2007, 1.42 in 2008 and 1.32 in 2009 (see Table 4). This was not due the deterioration of the Swedish situations, but rather reflected a positive development in the condition of the other European countries. Conversely, those countries which were less developed in terms of digital inclusion had negative scores in standard deviation units. This was for instance the case of the new member states of EU (see Table 4). The number of variables used in the PCA are: ten for 2004; eight for 2005 and 2006; 46 for 2007; 67 for 2008; 32 for 2008 and 2009. The first principal component derived from the analysis for each year represent the following total variances: 56.5 % for 2004; 64.6 % for 2005; 62.1 % for 2006; 56 % for 2007; 61.4 % for 2008; 61.7 % for 2009. It is interesting to note that even for the years with the highest number of variables (2007, 2008 and 2009), the variance for the first component remains decisively high. This confirms what we have already previously stated on the existence of strong positive correlations between all the variables in each data matrix.

123

Author's personal copy An alternative procedure for imputing missing data Table 5 Correlations between actual data and replaced data by 18 variables Actual data 2004 ni99 ni992 ni993 2005 ni99 ni992 ni993 2006 ni99 ni992 ni993

Replaced data

r

m99 m992 m993

0.92 0.82 0.91

m99 m992 m993

0.92 0.86 0.94

m99 m992 m993

0.88 0.90 0.93

Actual data 2007 ni99 ni992 nhdsl 2008 ni99 ni992 niacc 2009 ni99 ni992 niacc

Replaced data

r

m99 m992 mhdsl

0.98 0.90 0.86

m99 m992 niacc

0.95 0.91 0.95

m99 m992 miacc

0.95 0.90 0.96

At this point we have standardised values for each country for each year (see Table 4) providing information on each country’s situation compared to the other EU countries on their digital development situation. The substitution of the missing data for each country for a given year on a given variable was carried out as follows: X mcy = X + σx × m cy where X mcy = missing data on variable X of a country c in the year y; X = avarage of variable X computed only on the valid data; σx = standard deviation of the variable X computed only on the valid data; mcy = component score for country c for year y on the component index. In other words, this formula calculates the missing value for a country ron a given variable for a certain year introducing a correction factor specific to that country, and which reflects its overall state of digital development compared to the other European countries. The other two elements are the variable’s average and standard deviation which are computed using only those countries that presented actual data them. For example, we could calculate the value for Bulgaria in 2009 on the variable ‘broadband penetration rate’ for which we have data for all twenty-seven countries as follows. The average and standard deviation for this variable, eliminating the data for Bulgaria, are respectively 21 and 8 %. The score for Bulgaria on the component index for 2009 is −1.55 (see Table 4): 21 + (8 × −1.55) = 21 − 12 = 9 The estimated value (9 %) is very close to the Bulgaria’s actual score which is 10 %. To assess the solidity and trustworthiness of our procedure we can generalise this first example. We considered three variables for each year with no missing data. We eliminated the data for these three variables on all the twenty-seven countries and replaced them with those computed through our procedure. A first appraisal of the results of our technique may be carried out by computing the correlation coefficients existing between the values estimated through this procedure and the actual data (see Table 5). The three variables selected for 2004, 2005 and 2006 are: (1) ni99 (internet users per 100 inhabitants), (2) ni992 (% of internet broadband subscribers per 100 inhabitants), (3)

123

Author's personal copy G. Di Franco Table 6 The actual data (Act.) and the replaced data (Repl.) of the variable ‘internet subscribers (total fixed broadband) for 100 inhabitants’ (label: ni992) for countries and for year Country

ni992_2004

ni992_2005

ni992_2006

ni992_2007

ni992_2008

ni992_2009

Austria

Act. Repl.

Act. Repl.

Act. Repl.

Act. Repl.

Act. Repl.

Act. Repl.

Belgium Bulgaria Cyprus Czech Republic Germany Denmark Estonia Spain Finland France Greece Hungary Ireland Italy Lituania Luxembourg Latvia Malta Netherlands Poland Portugal Romania Sweden Slovenia Slovakia United Kingdom

7 12 0 1 0 5 13 7 5 9 6 0 3 1 4 2 3 1 6 12 1 5 1 12 3 0 5

11 16 0 2 2 8 19 10 8 15 11 0 4 4 8 4 8 2 9 20 2 8 0 16 6 1 10

14 19 2 4 7 13 25 13 12 22 16 1 6 8 12 7 15 3 13 25 2 11 2 28 10 3 17

18 23 5 7 11 18 32 19 15 27 21 4 12 14 15 11 21 5 13 32 8 14 5 21 14 6 22

19 26 8 12 13 24 36 21 18 31 25 9 14 19 18 15 28 6 20 34 9 14 9 36 17 9 26

19 26 8 12 13 24 36 21 18 31 25 9 14 19 18 15 28 6 20 34 9 14 9 36 17 9 26

7 7 0 3 3 8 13 4 4 9 6 1 1 6 5 0 9 0 5 9 0 2 0 13 3 1 9

11 9 0 5 5 13 17 7 6 13 10 2 2 9 10 1 14 1 9 16 0 5 0 18 6 2 15

15 14 2 8 8 20 24 11 10 17 15 4 4 13 14 4 20 3 13 23 2 7 0 26 10 4 23

18 18 3 9 11 23 30 19 14 28 14 6 12 13 9 10 25 11 12 28 10 9 1 28 15 12 22

22 20 6 11 13 27 34 22 19 32 23 7 17 16 12 14 34 14 16 34 12 13 4 32 20 15 26

24 21 6 10 13 26 34 23 19 30 24 8 17 16 13 14 34 14 15 33 11 13 3 31 20 16 26

ni993 (% of internet subscribers through landline per 100 inhabitants). For 2007: (1) ni99, (2) ni992, (3) nhdsl (% of households using a DSL connection). Finally, for 2008 and 2009, the three variables selected are: (1) ni99, (2) ni992, (3) niacc (level of access to internet of households %). Table 5 shows correlation coefficients between the actual data and those estimated through our procedure for the eighteen variables. The results are in general very positive. The average correlation coefficient is 0.92. The highest correlation (r = 0.98, just slightly below 1) was that registered for variable ni99 in 2007. Four of the coefficient values are between 0.95 and 0.96 (ni99 2008; niacc 2008; ni99 2009; niacc 2009). Nine between 0.90 and 0.94 (ni99 2004; ni993 2004; ni99 2005; ni993 2005; ni992 2006; ni993 2006; ni992 2007; ni992 2008; ni992 2009). Only four coefficients were slightly below 0.90 (ni992 2004: r = 0.82; ni992 2005: r = 0.86; ni99 2006: r = 0.88; nhdsl 2007: r = 0.86). A detailed examination was also carried out by comparing all the substituted and actual values for each country and year. For reasons of space, it is not possible here to showall these tables; we present only one concerning the variable ‘% of Internet broadband subscribers’ between 2004 and 2009 (code: ni992; see Table 6). In all, 486 values were replaced (three variables for 27 countries for 6 years). The average difference between the actual and substitute data is ±3.75.

123

Author's personal copy An alternative procedure for imputing missing data Table 7 The avarages of the difference between the actual data and the replaced data for variabile and for year

2004 2005 2006 2007 2008 2009

v1

v2

v3

5.54 5.19 6.24 3.07 4.37 3.80

2.09 2.39 2.69 3.13 2.81 2.96

3.33 2.80 2.65 5.72 4.91 3.85

Variables legenda: v1 = ‘Internet users per 100 inhabitants’ (label: ni99) from 2004 to 2009; v2 = ‘Internet subscribers total fixed broadband per 100 inhabitants’ (label: ni992); v3 = ‘Internet subscribers total fixed per 100 inhabitants’ (label: ni993) from 2004 to 2006; v3 = ‘Percentage of households using a DSL connection’ (label: nhdsl) per il 2007; v3 = ‘Level of Internet access of households %’ (label: niacc) from 2008 to 2009 Table 8 The differences between the 486 actual data and the 486 replaced data in four classes of value for variables (absolute values and percentage total)

v1 v2 v3 Total %

0–3

4–7

8–10

>10

Total

74 116 91 281 57.82

55 44 49 148 30.45

20 2 14 36 7.41

13 0 8 21 4.32

162 162 162 486 100.00

Variables legenda: v1 = ‘Internet users per 100 inhabitants’ (label: ni99) from 2004 to 2009; v2 = ‘Internet subscribers total fixed broadband per 100 inhabitants’ (label: ni992); v3 = ‘Internet subscribers total fixed per 100 inhabitants’ (label: ni993) from 2004 to 2006; v3 = ‘Percentage of households using a DSL connection’ (label: nhdsl) per il 2007; v3 = ‘Level of Internet access of households %’ (label: niacc) from 2008 to 2009

Furthermore, we have proceeded to compute the average difference for each variable on each year. These are shown in Table 7. Examining these results, we can observe that the best results on all the years are concentrated on the variable ‘% of Internet broadband subscribers’, with an average difference between 2 and 3 %. The least satisfactory result (an average of 6.24 %) is on the variable ‘Internet users per 100 inhabitants’ in 2006. For all the other years the deviations oscillated between 3 and 5.5 %. Finally, for the third variable the average difference varies between 2.7 % and a maximum of 5.7 %. Table 8 shows the results all 486 average difference between the actual and estimated values, and divides them into four classes: (1) (2) (3) (4)

differences between 0 and 3 %; differences between 4 and 7 %; differences between 8 and 10 %; differences greater than 10 %.

The examination of Table 8 provides the following results: 57.8 % of the substitute data (281 values out of 486) diverge from the effective data in a range between 0 and 3 %; 30.5 % between 4 and 7 %. These two classes represent 88.3 % of the cases; of the remaining 11.7 %, 7.4 % show a difference of between 8 and 10 %, while a mere 4.3 % show a difference of more than 10 % compared to the actual data. The highest difference is 15 % (only one case).

123

Author's personal copy G. Di Franco

On the basis of the results presented in Tables 5, 6, 7 and 8, we believe we have provided our readers with sufficient elements to assess the goodness of our missing data substitution procedure.

4 Conclusions The main advantage of the procedure for missing data substitution presented in this paper consists in the fact that it is founded on empirical data and not on abstract statistical models with unrealistic assumptions and constraints. Furthermore, it is easy to apply, provided the availability of a complete data set of variables with strong associations with the ones presenting missing data. In this regard, one might wonder why we did not resort directly to those variables to construct of the Digital Development Index, thus avoiding the trouble of having to impute the missing data. The reason was essentially semantic: the variables identified for the construction of the index were those which, in our opinion, best represented the dimensions and subdimensions identifyng digital development. The choice of other variables would have implied a different operational definition of the concept that, again in our opinion, would have proved less satisfactory. While in no way claiming to have found the definitive solution to the issue of missing data substitution, we believe that our contribution may provide a useful addition to the social researchers’ toolbox. Further empirical tests and possible comparisons with other missing data substitution techniques, may, in the future contribute to consolidate our procedure. Although to date it has only been applied only to data related to territorial units of analysis (and only to cardinal variables), we believe it is also possible to apply it to individual data (and therefore to categorial variables), by using multiple correspondence analysis and similar techniques in place of PCA, as long as a sufficient number of variables with complete data and strong association with the missing data variables are available.

References Akritas, M.G., Kuha, J., Osgood, D.W.: A nonparametric approach to matched pairs with missing data. Sociol. Methods Res. 30(3), 425–454 (2002) Allison, P.D.: Missing Data. Quantitative Applications in the Social Sciences. Sage, Thousand Oaks (2001) Chantala, K., Suchindran, C.: Multiple Imputation for Missing Data. SAS OnlineDocTM, Version 8 (2003) Di Franco, G.: Tecniche e modelli di analisi multivariata. FrancoAngeli, Milan (2011a) Di Franco, G.: Appendix: EDDI European digital development index: definition of methodology. Guerrieri e Bentivegna, 220–259 (2011b) Di Franco, G.: Il campionamento nelle scienze umane. Teoria e pratica. FrancoAngeli, Milan (2010) Di Franco, G.: EDS: esplorare, descrivere e sintetizzare i dati. Guida pratica all’analisi dei dati nella ricerca sociale. FrancoAngeli, Milan (2001) Di Franco, G., Marradi, A.: Analisi fattoriale e analisi in componenti pricipali. Bonanno, Rome/Catania (2003) Enders, C.K.: Applied Missing Data Analysis. Guilford, Londra/New York (2010) Guerrieri, P., Bentivegna, S. (eds.): The Economic Impact of Digital Technologies. Measuring Inclusion and Diffusion in Europe. Edward Elgar, Cheltenham/Northampton (2011) Holenberghs, G., Kenward, M.G.: Missing Data in Clinical Studies. Wiley, Londra (2007) Huisman, M., Van Sondersen, E.: Handling missing data by re-approcching non-respondents. Qual. Quant. 32, 77–91 (1998) Little, R.J.A.: Biostatistical analysis with missing data. In: Armitage, P., Colton, T. (eds.) Encyclopaedia of Biostatistics. Wiley, Londra (1997)

123

Author's personal copy An alternative procedure for imputing missing data Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002) Little, R.J.A., Schenker, N.: Missing data. In: Arminger, G., Clogg, C.C., Sobel, M.E. (eds.) Handbook for Statistical Modeling in the Social and Behavioral Sciences, pp. 39–75. Plenum, New York (1994) Marradi, A.: Analisi monovariata. FrancoAngeli, Milan (1993) OECD: Handbook on Constructing Composite Indicators: Methodology and User Guide, ISBN 978-92-6404345-9, © OECD JRC European Commission (2008)

123

Suggest Documents