Dissimilarities between categorical variables

8 downloads 0 Views 284KB Size Report
Jan 17, 2011 - When we deal with two categorical variables, Giniks index of distributional ... In order to compare categorical variables, we have to rely on ...
Universidade de Brasília Departamento de Economia

Série Textos para Discussão

Dissimilarities between categorical variables Rodrigo Peñaloza

Texto No 351 Brasília, Janeiro de 2011

Department of Economics Working Paper 329 University of Brasilia, November 2006

Dissimilarities between categorical variables Rodrigo Peñaloza Departament of Economics, University of Brasilia Center of Research in Economics and Finance (CIEF), University of Brasilia

January 17, 2011

Abstract When we deal with two categorical variables, Gini’s index of distributional transvariation is a most usefull tool to measure the distributional di¤erence between them. By means of a modi…ed transvariation, which we call Euclidean transvariation, we showed that our measure of transvariation can be decomposed into the di¤erence of two terms: a measure of categorical separation and the average variability. This decomposition allows us to view the dissimilarities between two categorical variables through three di¤erent lenses: distribution, modality, and variability. Finally, by de…ning a simpler measure of statistical dependence based on Pearson’s 2 , we prove a relationship between statistical dependence and the transvariational impact of one variable onto another. JEL classi…cation: C49. AMS classi…cation: 62H17, 62H20, 91C05. Key words: nominal variables, transvariation, degree of dependence.

1

Introduction

In order to compare categorical variables, we have to rely on analytical tools based on distributions or vectors of relative frequencies. Our intention is to build an analytical technique that is simple enough to be applied to ordinary data for categorical variables, but with some degree of richness so as to give us a deeper understanding as to measurement of the dissimilarities between them. Correspondence to: Rodrigo Peñaloza. University of Brasilia (UnB), Department of Economics & Center of Research in Economics and Finance (CIEF), FACE. Brasilia, DF 70910-900 BRAZIL. Phone: +(55)(61)3307-3947, FAX: +(55)(61)3307-3438, e-mail: [email protected].

1

One of the standard analytical tools is the Gini’s concept of transvariation [Gini (1916)] and its multivariate version developed by Dagum (1968). We do not want to improve upon what has already been done, but based on simple ideas worked out by Gini (1916), Georgescu-Roegen (1967), and Souza (1977), we build a theory of dissimilarities somehow from scratch. Our ideas in essence have some intersection with others, but our approach di¤ers from them [see, for instance, Calò (2006)] in that we do not impinge any tool heavier than basic arithmetic rules, so the techniques we o¤er might be used by any one interested not only in economic issues, but in sociometric and demographic issues as well. Montanari and Monari (sine anno) give a good survey on Gini’s contribution to the issue of distributional change, so we refer to them for a review, though a review of literature has little sense here, for we actually based ourselves only on the most simple de…nition of transvariation and the notion of nominal variance as it appears in Souza (1977), Georgescu-Roegen (1968), and Brieman et alii (1984). We start from a modi…ed version of Gini’s transvariation. Our version is a normalized Euclidean norm for distributions. Though it seems just another metric, its "squareness" allows us to proceed to a decomposition. Perhaps not surprisingly enough, we got a decomposition whose componentes have interesting statistical meaning. One of them is interpreted as a measure of categorical separation and the other is a measure of average variability. We then de…ne a simple measure for statistical dependence between categorical variables that resembles Pearson’s 2 , but one which is more geometrically appealing. In the end we show that both modi…ed de…nitions are intimately linked. In section 2 we present the two sides of the dissimilarity concept: distributional and categorical. In section 3 we present what we have called the geometric 2 measure of dependence, and relate it directly to a kind of conditional transvariation, which we have called the transvariational impact of one variable onto another. Section 4 concludes the paper.

2

Distributional and categorical dissimilarities

In this section we will present the preliminaries of our method. We …rst de…ne Euclidean transvariation, which is a squared version of Gini’s transvariation, and then we de…ne a concepet of categorical separation. We de…ne all the measures in a normalized way, so they are adimensional numbers in the unit interval. Let X be a nominal variable comprised of m categories or modalities x1 ; : : : ; xm . We can always think of a nominal variable X as a question and of its categories x1 ; : : : ; xm as the possible answers. Let f = (f1 ; : : : ; fm ) be the vector of relative frequencies of answers. The pair hX; f im is called a categorical model of size m. 2

2.1

Euclidean transvariation

Let hX; f im and hY; gim be two categorical models of same size m. Gini’s transvariP ation index between hX; f im and hY; gim is given by 1 (X; Y ) = 12 m gi j : It is i=1 jfi straightforward to see that 0 1: If 1 (X; Y ) = 0, then both distributions are 1 equal, that is, fi = gi , for any i = 1; : : : ; m, and dissimilarity is nil. If 1 (X; Y ) = 1, then distributions f and g are concentrated in two disjoint categorical subsets, so dissimilarity is maximal. A most common case occurs when 1 is applied to two groups of individuals and each group is asked the same question X. In this case, we still have to consider two categorical models hX; f im and hY; gim , where Y now represents the fact that the question is asked to a di¤erent group. In addition, f = (f1 ; : : : ; fm ) and g = (g1 ; : : : ; gm ) are the corresponding vectors of relative frequencies. In this case, 1 (X; Y ) measures the distributional di¤erence between the two groups regarding question X. Gini’s transvariation index is just a normalized metric de…ned on the (m 1)-dimensional simplex. Based on this much obvious and much known remark, we propose a slight change in the metric, but one which will link us to a modi…ed version of Pearson’s 2 measure of statistical dependence between two categorical variables in a joint distribution table. De…nition: Let hX; f im and hY; gim be two categorical models q Pof size m. The Euclidgi )2 . ean transvariation between X and Y is given by 2 (X; Y ) = 12 m i=1 (fi Notice that 2 is also a normalized metric on the (m 1)-dimensional simplex m 1 = Pm f' 2 Rm 1 (X; Y ), for any hX; f im and hY; gim : + : i=1 'i = 1g: In addition, 2 (X; Y )

2.2

Categorical separation

One of the advantages from our modi…ed transvariation is its decomposition into two terms, the one related to the average variability of the distributions, the other to an index that measures the degree of categorical separation between hX; f im and hY; gim : But …rst we have to de…ne the variability measures we want to work with. De…nition: Let hX; f im be a categorical model. The nominal variance of X is given Pm 2 by 2 (X) = mm 1 (1 i=1 fi ):

When the distribution f is concentrated on a single category, variability is nil, so (X) = 0: Is f is uniformly distributed among categories, then variability is maximal, so 2 (X) = 1: The coe¢ cient mm 1 normalizes the variability into the unit interval. This is Souza’s measure of variability for distributions [Souza (1977)] and it is related to what Brieman et alii (1984) called the impurity function. 2

3

De…nition: Let hX; f im and hY; gim be two categorical models of size m. We de…ne the index of categorical separation between hX; f im and hY; gim as (X; Y ) = Pm 1 i=1 fi gi :

Let us give some hint as to why it is so called. Consider a categorical variable with modalities x1 ; : : : ; xm . Suppose there are two groups of individuals, say groups X and Y , that X has k individuals and Y has ` individuals. In each group, each individual falls into one, and only one, category. Suppose that in group X, k1 individuals fall into category x1 , k2 individuals fall into category x2 and so on to category xm , with km individuals. Obviously, k1 + k2 + + km = k. So the relative frequency fi = kki is (a consistent estimatior of) the probability that arandomly chosen member of X falls into category xi . Analogously, let `i be the number of individuals of group Y that fall into category xi : Again, `1 + `2 + + `m = `. Thus, the relative frequency g = ``i is the probability that a member of Y falls into category xi . Therefore, 1 gi is the probability that a member of group Y will not fall into category xi : Suppose that an individual from group X is randomly chosen. Then fi (1 gi ) is the probability that she will fall into category xi and no one from group Y will fall into the same category xi : In other words, it is the probability that everybody from Y will be di¤erent from the members of X that chose P gi ) is the probability that group X is characterized by some xi . Therefore, m i=1 fi (1 category di¤erent from the categories that characterize the members of Y . Now, since Pm Pm gi ). An analogous reasoning can i=1 fi (1 i=1 fi = 1, it is easy to see that (X; Y ) = Pm be applied to i=1 gi (1 fi ): De…nition: Let hX; f im and hY; gim be two categorical models of size m and let 2 (X) and 2 (Y ) be their respective nominal variances. The mean variability of X and 2 2 Y is the average (X; Y ) = (X)+2 (Y ) . Theorem:

2 2 (X; Y

) = (X; Y )

( mm 1 ) (X; Y ):

Pm Pm 2 Pm 2 P Proof: We have that gi )2 = 2 m i=1 (fi i=1 fi + i=1 gi i=1 fi gi . Since P Pm 2 P m m m m 1 2 2 2 2 f ), we have f ) (X). Similarly, (X) = m 1 (1 = 1 ( i i i=1 i=1 i=1 gi = m Pm m 1 2 2 1 ( m ) (Y ). Then, after simple calculation, we get to gi ) = 2(1 i=1 (fi Pm Pm m 1 1 2 2 2 2 ( m )( (X) + (Y )). Since 2 (X; Y ) = 2 i=1 (fi gi ) , we then have: i=1 fi gi ) 2 2 (X; Y

) = (1

m X

fi gi )

i=1

=

(X; Y )

(

(

m 1 )( m

2

m 1 ) (X; Y ) m

as was to be shown. 4

(X) + 2

2

(Y )

)

Our measure of transvariation tells us something about the di¤erences between two categorical variables. We can think of the average variability of X and Y as an internal variability of the pair (X; Y ) around which the categorical variances of X and Y position themselves. If we consider (X; Y ) as a categorical bidimensional variable, then: categorical di¤erentiation = |distributional variability {z di¤erence}+ |internal {z | {z } } 2 (X;Y 2

(X;Y )

)

( mm 1 ) (X;Y )

To illustrate the additional information about (X; Y ) provided by the measure of categorical separation, consider the pairs of categorical models fhX; f i4 ; hY; gi4 g and fhXo ; fo i4 ; hYo ; go i4 g, where: f = ( 21 ; 12 ; 0; 0) g = (0; 0; 12 ; 12 )

fo = ( 12 ; 12 ; 0; 0) p p go = (0; 3 4 5 ; 1+4 5 ; 0)

p

Then 2 (X; Y ) = 2 (Xo ; Yo ) = 22 = 0:7071. Then both pairs of categorical variables present the same Euclidean transvariation, that is, they present the same distributional change. In this sense, the Euclidean transvariation is not enough to signi…cantly evaluate the di¤erences between them. However, a quick look at the distributions internal to the pairs shows us that, in the …rst pair, variables are concentrated in disjoint sets of categories, so they are categorically fully separated, while in the second pair the variables mingle a little in the categorical sense. This perceptible di¤erence is captured by the p measure of categorical separation. Indeed, (X; Y ) = 1 and (Xo ; Yo ) = 5+5 = 0:9045: 8 Notice that the Gini’s transvariation indexes are 1 (X; Y ) = 1 and 1 (Xo ; Yo ) = 0:8090, hence it could be argued the the usual transvariation captures the relevant di¤erences. In a sense it does, but it does not actually distinguish between distributional change and categorical idiosyncratic characteristics. In addition, it is not possible to relate it to a measure of statistical dependence based on Pearson’s 2 as it is for the Euclidean transvariation. To get things practical, suppose, for instance, that the system fhX; f i4 ; hY; gi4 g describe the type of residence in two neighbouring cities at some point in time.1 The system fhXo ; fo i4 ; hYo ; go i4 g refers to the same thing, but at some other point in time. Then 2 would not capture the fact that something changed between the two periods, so we need the index of categorical separation as well. 1

Since m = 4, there four types of residence, described by the categories x1 , x2 , x3 , and x4 . The labels X and Y refer to the cities, not to the question asked.

5

3

Geometric

2

Here we take a simpler measure of statistical dependence, similar to Pearson’s 2 . We de…ne it as the geometric 2 and denote it by 2geom . We use the concept of Euclidean transvariation to create a measure of expected conditional distributional impact of each modality of some categorical variable onto another categorical variable. Hence we use the lenght of the vector comprised of all these modal conditional distributional impact to de…ne an overall measure of impact, called the transvariational impact of one variable onto another. Finally, we show that our geometric 2 measure of dependence is, up to a constant, equal to the transvariational impact, no matter what the direction of impact we take, be it from X to Y or else from Y to X. Consider two categorical variables X and Y with respective categories fx1 ; :::; xn g and fy1 ; :::; ym g and let = [fij ]n m its table of joint distribution. For each i = 1; :::; n, let P fi = m j=1 fij be the marginal frequency of the event [X = xi ]. Then fX = (f1 ; :::; fn ) is the marginal distribution of X. In the same fashion, de…ne, for each j = 1; :::; m, f j = Pn i=1 fij , which is the marginal frequency of the event [Y = yj ] and let fY = (f 1 ; :::; f m ) be the marginal distribution of Y . Now consider the categorical models hX; fX in and hY; fY im . Pearson’s 2 , de…ned by P P fi f j fij 2 2 = N ni=1 m j=1 fi f j ( fi f j ) , measures the degree of statistical depedence between X and Y . Here N is the size of the sampled individuals whom the questions X and Y were asked2 . If they are independent, then 2 = 0. This happens whenever the joint frequency fij of [X = xi ; Y = yj ] equals the product of the marginal frequencies fi f j :, that is fi f j : = fij : The greater the 2 ; the greater the degree of dependence between them. Pearson’s 2 is a weighted average of the quadratic percentage deviation from independence. We propose a simpler measure, not based on percentage deviation, just on the Euclidean metric. Let us symbolize the two categorical models hX; fX in and hY; fY im linked by the table = [fij ]n m of joint distribution by the triple hX; Y; in m and call it a categorical 2-dimensional (n m)-model. De…nition: Let hX; Y; in m be a categorical 2-dimensional (n m)-model. We de…ne P P the geometric 2 measure of statistical dependence by 2geom = ni=1 m fij )2 . j=1 (fi f j Our geometric

2 geom

measure has indeed a more geometric ‡avour3 and is easier to

2

Suppose a total of N individuals were sampled. let Nij be the N category xi of X and yj of Y . then fij = Nij . 3 Given any (n m)-matrix A = [aij ]n m , let kAk(n m); 2 = 2 gij = fi f j , set = [gij ]n m . Then 2geom = k k(n m); 2 . If

6

number of individuals who falled into Pn Pm ( i=1 j=1 a2ij )1=2 be its norm. If = , then 2geom = tr( 0 ), that

handle than Pearson’s. Let fY = (f 1 ; : : : ; f m ) be the marginal distribution of Y . Then we can de…ne the f conditional probability of [Y = yj ] given [X = xi ] by Pr[Y = yj jX = xi ] = fiji , where we obviously identi…ed the true probability with a consistent estimator given by the ratio of suitable relative frequencies. Denote the conditional distribuition of Y given X = xi by ): We will measure the distributional impact of the occurrence of fY jxi = ( ffi1i ; ffi2i ; : : : ; ffim i [X = xi ] over Y by the transvariation 2 (Y; Y jxi ): Since [X = xi ] occurs with probability Pr[X = xi ] = fi , we have that "i (Y; Y jxi ) = fi 2 (Y; Y jxi ) is the expected impact that the occurrence of the event [X = xi ] has on the distribution of the categorical variable Y . The overall impact of X over Y can be measured by the conditional transvariation Pn 2 (Y; Y jX) = i=1 fi 2 (Y; Y jxi ). Another way to do it, is by considering the vector of local impacts, "(Y; Y jX) = ("1 (Y; Y jx1 )); : : : ; "n (Y; Y jxn )) and by taking its squared pPn n 2 length k"(Y; Y jX)k2n; 2 , where kzkn; 2 = i=1 zi , for any z = (z1 ; :::; zn ) 2 R , is just the Euclidean distance on Rn . De…nition: The transvariational squared impact (or simply the transvariational impact) of X on Y is de…ned by the squared length of the vector of expected conditional perturbations, that is, T2 (X Y ) = k"(Y; Y jX)k2n; 2 . Theorem: Let hX; Y; in m be a categorical 2-dimensional (n = T2 (X Y ) = T2 (Y X): 2

m)-model. Then

2 geom

m 1 (Y; Y jxi ). Notice Proof: By our decomposition, 22 (Y; Y jxi ) = (Y; Y jxi ) m 2 2 that (Y; Y jxi ) satis…es 2 (Y; Y jxi ) = (Y ) + (Y jxi ), that is, 2 (Y; Y jxi ) = mm 1 (1 Pm fij 2 Pm 2 Pm fij 2 Pm 2 m m j=1 ( fi ) . j=1 f j )+ m 1 [1 j=1 ( fi ) ], hence 2 (Y; Y jxi ) = m 1 1 j=1 f j + 1 Then: " # m m X X m 1 1 f ij (Y; Y jxi ) = 2 f 2j ( )2 m 2 f i j=1 j=1

On the other hand, recall that the index of categorical separation between Y and Y jxi is: (Y; Y jxi ) = 1

m X j=1

fj

fij fi

Therefore the squared transvariation between Y and the conditional variable Y jxi is given is, the trace of 0 : Let Sp( 0 ) = f 1 ; 2 ; : : : ; m g be the spectrum of 0 , the set of its eigenvalues, Pm ordered downwards. Then 2geom = j=1 j , so the whole range of statistical multivariate techniques opens up, such as factor analysis, principal component analysis, etc. We have not gone this way, but it is a promising path though.

7

2 2 (Y; Y

by

jxi ) = (Y; Y jxi ) 2 2 (Y; Y

jxi ) = 1

m 1 m

(Y; Y jxi ), then: " m m X X fij 1 fj 2 f 2j f 2 i j=1 j=1 hP

m X fij ( )2 fi j=1

#

i P fij 2 fij m ( ) j=1 fi j=1 f j fi . By Pm 2 2 Pm 2 P factoring out 2f12 , we have 22 (Y; Y jxi ) = 2f12 fi f j + j=1 fij 2 m j=1 j=1 fi f j fij . i i P P P m n Therefore, 2fi2 22 (Y; Y jxi ) = j=1 (fi f j fij )2 . Recall that 2geom = i=1 m j=1 (fi f j 2 P P n n geom fij )2 , hence we have 2geom = 2 i=1 fi2 22 (Y; Y jxi ): Therefore 2 = i=1 fi2 22 (Y; Y jxi ), 2 2 P that is, geom = ni=1 "2i (Y; Y jxi ), so that geom = k"(Y; Y jX)k2n; 2 , which means that: 2 2 Simple calculations imply that

2 2 (Y; Y

jxi ) =

2 geom

2

1 2

= T2 (X

By a symmetric argument4 , it is easy to see that that is,

2 geom

2

= T2 (Y

m j=1

f 2j +

Pm

Y)

Pn

i=1

fi2

X): This proves the theorem.

2 2 (fy ; fyjxi )

=

Pm

j=1

f 2j

2 2 (X; Xjyj ),

If the occurrences of the events [X = xi ], for all i = 1; 2; : : : ; n; change a little or does not change at all the distributional pro…le of the occurrences of the events [Y = yj ], for all j = 1; 2; : : : ; m; then the dependence of Y on X is nil or else negligible. For the very same reason, the closer will be the vectors fY and fY jxi , for all i = 1; 2; : : : ; n; thus the smaller the geometric 2geom measure.

4

Conclusion

Categorical variables, such as nominal and ordinal variables, appear frequently in Economics, Sociology, Political Science, Demography, and a variety of other areas. In Regional Economics, for instance, we might want to compare the locational distribution of industry and commerce between two di¤erent regions or between two di¤erent periods for the same region. In Political Science, we might want to compare the geographical performance of political parties or the public opinion regarding some public policy throughout di¤erent regions. In cases such as these, we do not want to use inferential techniques, but rather exploratory tools. The distribution of modalities for a categorical variable is practically the only thing we have to work with. In this sense, Gini’s index of distributional transvariation have become one of the most useful methods to measure distributional di¤erences. Guided by this obvious fact, we de…ned a sibling measure of Gini’s, the measure of Euclidean transvariation, and decomposed it into the di¤erence of 4

Replace interchange the rôles of i, j, X, Y , n and m, and follow the same steps we took.

8

two terms: a measure of categorical separation and the average variability. We were thus able to study the dissimilarities between two categorical variables through three di¤erent perspectives: distribution, modality, and variability. We de…ned the concept of geometric 2 , which is a measure of statistical dependence, and related it to what we have called the transvariational impact of a variable onto the other. References 1. Brieman, L., J. Friedman, R. Olshen, and C. Stone (1984): Classi…cation and Regression Trees. Wadsworth International Group, Belmont, California. 2. Calò, D. (2006): "On a transvariation based measure of group separability". Journal of Classi…cation 23: 143-167. 3. Dagum, C. (1968): "Multivariate transvariation theory among several distributions and its economic applications". Econometric Research Program, Princeton University, Research Memorandum 100. 4. Georgescu-Roegen, N. (1971): The Entropy Law and the Economic Process. Harvard University Press, Cambrige, MA. 5. Gini, C. (1916): "Il concetto di transvariazione e le sue prime applicazioni", Giornale degli Economisti e Rivista di Statistica, in C. Gini (1959): 1-55. 6. Gini, C. (1959): Transvariazione. Libreria Goliardica, Roma. 7. Montanari, A. and P. Monari (sine anno): Gini’s contribution to multivariate statistical analysis. Working paper, Dipartimento di Scienze Statistiche, Università di Bologna. 8. Souza, J. (1974): “Dualidade e concentração”. Annals of the II Meeting ANPEC, Cedeplar-UFMG. 9. Souza, J. (1977): Estatística Econômica e Social. Editora Campus, Rio de Janeiro. 10. Souza, J. (1988): Teorias de Correlação e Associação Estatística, in Métodos Estatísticos nas Ciências Psicossociais, volume VII, Thesaurus, Brasília.

9