Exact conditional tests for a reciprocal interpretation of ... - La Sapienza

1 downloads 0 Views 116KB Size Report
classifications built on a two-way contingency table. Contents: 1. Introduction. — 2. General framework. - 2.1. Exploratory classification of large data tables. - 2.2.
JEAN-JACQUES DENIMAL (*) – SERGIO CAMIZ (**)

Exact conditional tests for a reciprocal interpretation of hierarchical classifications built on a two-way contingency table

Contents: 1. Introduction. — 2. General framework. - 2.1. Exploratory classification of large data tables. - 2.2. Building hierarchies. - 2.3. Block classifications and hierarchies cut-points. - 2.4. Interpretation of groups and identification of structure. - 2.5. Association measures for categorical data analysis. — 3. A review of interaction indexes. - 3.1. Notations. - 3.2. Bases associated to hierarchies. - 3.3. Interaction indexes. — 4. Statistical interpretations of the indexes. - 4.1. Interpretation of interaction indexes. - 4.2. The multiple hypergeometric model. — 5. Exact conditional Tests. - 5.1. Interpretation of nodes. - 5.2. Cut-points identification. 5.3. Interpretation of the restricted hierarchies. — 6. An example: the Nobel Awards. - 6.1. Choice of cut-points. - 6.2. Interpretation of nodes of H I SUP and H J SUP. — 7. Conclusions. Acknowledgments. References. Summary. Riassunto. Key words.

1. Introduction In this paper we introduce some indexes and corresponding exact conditional tests for significance, that may be used for the study of interactions among couples of hierarchies, built on rows and columns of contingency tables. In exploratory analyses focussing on a large contingency table, it is usual to classify both rows and columns, in order to reduce the table size and put in evidence the main dependence structure among the (*) UFR de Math´ematiques Pures et Appliqu´ees, Universit´e des Sciences et Technologies de Lille - F-59655, Villeneuve d’Ascq Cedex, France E-mail: [email protected]. (**) Dipartimento di Matematica “Guido Castelnuovo”, Universit`a di Roma “La Sapienza”, Piazzale Aldo Moro, 2 - I-00185, Roma, Italy E-mail: [email protected].

158 values of the crossing characters. For this reason, it is common daily practice to build two hierarchies, through exploratory cluster analysis techniques (Gordon, 1999), cut them according to suitable cut-points to get two partitions, and furthermore look for an interpretation of each group of both. This is currently based on the other character’s values in each group, through the use of the so-called interpretation aids (Lebart et al., 1995), but it could be rather performed considering the other character’s structure as a whole. Eventually, the rearrangement of groups of both partitions in the spirit of Bertin (1977) leads to a structured table, where the relations among the characters are even graphically put in evidence by the pattern obtained through that rearrangement. The obtained block clustering may be synthesized in a contingency table of smaller dimensions, rearranged according to the found pattern and structured by the nodes of hierarchies above the cut-points. In order to understand the importance of the found structure of the table, the relations existing between nodes of either hierarchy and nodes and partition groups of the other should be quantified, through suitable indexes of agreement, on condition that tests for their significance may be provided. In this way, interpretation aids, similar to those available for the study of a single partition, could be available for this purpose. In this paper we focus on this issue, introducing exact conditional tests for the Denimal’s (1997) interaction indexes, that fill the lack of interpretation aids in this frame. In section 2 the problem will be dealt with in its general framework, concerning both data classifications and contingency tables studies. In section 3 the indexes will be recalled, in section 4 their statistical interpretation will be given, so that, in section 5, exact conditional tests will be introduced. Then, in section 6, an application to Nobel Awards data will be briefly proposed as an example. 2. General framework 2.1. Exploratory classification of large data tables In exploratory studies of large contingency tables, crossing the categories of two characters, the usually performed procedure involves four steps: i) two hierarchies are built on both rows and columns; ii) two partitions are created, by cutting both hierarchies at suitable

159 cut-points, giving a table block clustering; iii) each group of both partitions is interpreted, i.e. the behaviour of the other character’s categories in the group is investigated, aiming at revealing what gets the group different from the others; and iv) a dependence structure between the groups of the two partitions is searched for a complete explanation of the table. Indeed, our proposed indexes and exact conditional tests may be used in most of these steps. They contribute to the coherence of the choices, and ease a synthetic interpretation of the results, as it will be clarified in the following. 2.2. Building hierarchies We shall not deal with the first step here, since our aim concerns the others better: suffice it to say here that we suppose that two hierarchies have been already built on both rows and columns through any suitable method (see, among others, Gordon, 1999), since the proposed indexes do not take into account the way the hierarchies were built. Nevertheless, it is worth pointing out that the common concurrent use of correspondence analysis (Benz´ecri et al., 1982) on the same table may suggest to use, also for the classification, a metric analogous to the chi-square one. With our indexes we set naturally in this frame and we shall show a special relation between them and Ward’s (1963) method, when it is used with the chi-square distance. 2.3. Block classifications and hierarchies cut-points The partition of a contingency table may be obtained either directly (Hartigan, 1975; Govaert, 1984) or by crossing two partitions, one for each crossing character. In this case, each partition may be obtained either directly (MacQueen, 1967; Diday, 1971) or by cutting a hierarchy. In the latter case, when no other better method is available, the identification of suitable cut-points is usually based on the observation of the sequence of values taken at each step by the objective function optimised during the hierarchy building procedure. To perform more adequatly this task, several methods may be found in literature (for a review, see Milligan and Cooper, 1985).

160 The exact conditional tests introduced here may be used instead, to identify indirectly the suitable cut-points of both hierarchies, based on the identification of the mutual relations among their items. 2.4. Interpretation of groups and identification of structure Once the first two steps have been achieved, the associations between the clusters of the two partitions must be investigated. For this aim, different statistical tools have been proposed in literature. Two methods focus on the explanation of either partition through all categories of the other character. Benz´ecri et al. (1980) suggest a purely exploratory method that decomposes the squared distance between the grand centroid and the centroid of each rows cluster in the canonical base composed by the columns. Then they derive indexes of relative importance of columns in clusters explanation. Lebart et al. (1995) developed tests able to identify typical categories, based on the hypergeometric law. Indeed, a third exploratory method deals with both partitions at the same time: Feoli and Orl´oci (1979) propose the analysis of concentration, a weighed correspondence analysis of the table obtained by crossing the two partitions, further developed by Podani and Feoli (1991). The use of correspondence analysis for the purpose of comparing different partitions was also explored by Hubert and Arabie (1992). Two drawbacks are common to the first two methods. First, they do not take into account the interactions between the items of the two hierarchies, in particular the nodes. Second, albeit very accurate in the identification of typical categories, their use becomes difficult on large data tables. Although used successfully by Camiz (1994) for the investigation of the best block clustering and structuring of a vegetation table, the third method lacks tests of significance of the identified associations. These difficulties were the starting point for the development of the Denimal (1997) indexes, successfully tested by Camiz and Denimal (1998a; 1998b) on socio-economic data, with some graphical tools for the display of results. Completed here with exact conditional tests, the indexes overcome the said drawbacks. Indeed, considering both hierarchies cut at a suitable level, attention is restricted to both partitions groups and upper nodes of the hierarchies. Thus, indexes measure the influence of nodes of each hierarchy on both nodes and groups of the other and statistical tests ensure the associations significance.

161 2.5. Association measures for categorical data analysis Different overall measures of association between rows and columns of a contingency table have been proposed in literature. Beside the usual chi-square and likelihood ratio chi-square statistics, Altham (1970a; 1970b) proposed some measures based on odds-ratios and an exact Bayesian test. Aiming at comparing two hierarchies, Fowlkes and Mallows (1983) introduced a Bk measure, derived from the matching matrix formed by cutting the two hierarchical trees at the same k level, say at the same number of classes, and counting the matching entries. Fowlkes and Mallows show that Bk may be interpreted as a version of Daniels’ (1944) generalized correlation coefficient. Hubert (1985) depicts a more general framework, including Daniels’ coefficient, by proposing different association indexes between measures of proximity. Similar ideas were developed by Hubert and Arabie (1985) including within their frame Rand’s (1971) measure of partition, and other usual association indexes, like chi-square and Goodman and Kruskal (1954) statistics. Comparing with Altham’s, our indexes may remind the logarithm of odds-ratios, at least in their formulation, although they come from a different approach, as will be shown further. Contrary to Bk coefficient, assessing the association between two partitions, they compare the couples of two-component groups corresponding to a couple of nodes of either hierarchy. However, a similarity with Bk can be noticed, as they can both be seen as generalised correlation coefficients. Moreover our indexes derive from a geometric representation of hierarchies in special bases, giving the orthogonal decompositions of chi-square distances, in agreement with correspondence analysis metric, so that their geometric interpretation is straightforward. Indeed, for every node, they represent the contribution to the deviation between the merging nodes by all nodes of the other hierarchy. In addition, our indexes may be seen as squares of correlation corefficients and, when a hypergeometric model is assumed, as squared standardized variables. It must be pointed out that the sign of the correlation will be helpful for the identification of the sense of the interaction.

3. A review of interaction indexes In this section, we sum up Denimal’s (1997) results with some adaptations. In particular, we shall use the Agresti (1990) notation.

162 3.1. Notations We suppose given a contingency table k I J crossing two sets of categories I and J with two hierarchies, HI built on the rows and H J built on the columns. So, the generic entry of the table is ki j and both rows and columns marginals and   grand total will  be written  respectively: ki+ = j∈J ki j ; k+ j = i∈I ki j and k = i∈I ki+ = j∈J k+ j . Considering two subsets of I and J, written pand q respectively, we  = k ; k = k shall write the cumulate occurrences: k pj j∈q i j ;    i∈ p i j iq k p+ = i∈ p ki+ ; k+q = j∈q k+ j ; k pq = i∈ p j∈q ki j . We deal with profiles, so that both I and J may be seen as clouds of points in R |J | and R |I | , respectively. In the space R |J | , the k k elements i ∈ I are given coordinates ( k i j ) j∈J and mass i+ ; in R |I | the k i+ ki j

elements j ∈ J are given coordinates ( k )i∈I and mass k+k j . So, as a +j consequence, each subset may be synthesized by its centroid in either k pj R |J | or R |I | , with respective coordinates: c J ( p) = ( k p+ ) j∈J ; c I (q) = k

iq ( k+q )i∈I . When p = I or q = J , the grand centroids c J = c J (I ) of I and c I = c I (J ) of J result. Both spaces R |I | and R |J | will be provided with the chi-square metric given by diagonal matrices M I = diag( k k )i∈I and M J = i+

diag( k k ) j∈J , whose associated norms will be written as   M I and   M J . +j

A node n of HI merging two groups p1 and p2 will be written n = ( p1, p2), and analogously m = (q1, q2) ∈ H J . We shall consider also the nodes ranks, which will be written respectively α(n), (1 ≤ α(n) ≤ |I | − 1) and β(m), (1 ≤ β(m) ≤ |J | − 1). Nodes will also be considered as groups merging their component groups. In this case, n = ( p1, p2) will be written p1∪ p2. Finally, by p we shall indicate the complement of p in I and analogous notations will be used for q in J .

3.2. Bases associated to hierarchies It is possible to use the nodes of both hierarchies HI and H J to define orthogonal bases on the spaces R |I | and R |J | , respectively. Cazes (1984) and Weiss (1978) already used these bases for different aims.

163 Considering the space R |J | , the base B J associated to the hierarchy β H J is the set of |J | vectors {e J | 0 ≤ β ≤ |J | − 1} such that for every j ∈ J: – β = 0, e0J ( j) =

k+ j k

β

– ∀m = (q1, q2) ∈ H J , β = β(m), e J ( j) =

 k +j     k  +q1  

if j ∈ q1

k+ j

−    k+q2     0

if j ∈ q2 otherwise .

In this way, all vectors but one correspond to nodes of H J and the following property easily results: Property 1: If R |J | is provided with the metric M J , then: a) the base B J is orthogonal; k +k β b) for every β ≥ 1, e J 2 = k+q1 ·k +q2 k. +q1 +q2

Similarly, the orthogonal base B I = {eαI | 0 ≤ α ≤ |I | − 1} of R |I | associated to the hierarchy HI is defined, having analogous properties. 3.3. Interaction indexes 3.3.1 – Definitions k·k

We define by A pq = k p+ ·kpq+q the association between any two groups p and q belonging to either hierarchy. This expression is greater or smaller than 1 depending on whether the observed cell k ·k frequency k pq is greater or smaller than p+k +q , the expected number of observations in the cell defined by p and q under the hypergeometric model (see the hypothesis H0 in §4). On this basis, we define as measure of interaction between two nodes n = ( p1, p2) ∈ HI and m = (q1, q2) ∈ H J the quantity Int(( p1, p2); (q1, q2)) = (A p2q2 − A p1q2 ) − (A p2q1 − A p1q1 ) .

(1 ) Indeed, since Altham’s (1970a; 1970b) odds ratio is θ =

k p1q1 ·k p2q2 k p1q2 .k p2q1 ,

(1 )

the ex-

164 3.3.2 – Decomposition of distances In the following, we shall consider the hierarchy HI , but the reciprocal may be stated for H J . Each cluster p of HI can be represented in R |J | by a sub-cloud of points and synthetically by its centroid c J ( p). For each node n = ( p1, p2) ∈ HI , the influences of the nodes m = (q1, q2) ∈ H J in the squared chi-square distance c J ( p1) − c J ( p2)2M J will be measured by considering the decomposition of this distance into the orthogonal base B J associated with HJ . As a consequence, the use of orthogonal bases associated with HI and H J easily leads to: Property 2: If n = ( p1, p2) ∈ HI and m = (q1, q2) ∈ H J , the following equalities hold:  k+q1 · k+q2 c J ( p1) − c J ( p2)2M J = k+q1 + k+q2 m∈H J

m=(q1,q2)

·

1 · (Int(( p1, p2); (q1, q2)))2 k 

c I (q1) − c I (q2)2M I =

n∈H I n=( p1, p2)

·

k p1+ · k p2+ k p1+ + k p2+

1 · (Int(( p1, p2); (q1, q2)))2 . k

3.3.3 – Indexes of interaction between two nodes Since we decomposed each distance between the components of a node according to the nodes of the other hierarchy, we can define a series of interaction indexes that measure the impact of the nodes of H J on each node of HI and vice-versa. Let us define the inertia of the dipole n = ( p1, p2) as k p2+ k p1+ ·c J ( p1)−c J ( p1∪ p2)2M J+ ·c J ( p2)−c J ( p1∪ p2)2M J . ν(n) = k k pression of Int(( p1, p2); (q1, q2)) is similar to the logarithm of the odds ratio ln(θ) = (ln(A p2q2 ) − ln(A p1q2 )) − (ln(A p2q1 ) − ln(A p1q1 )).

165 This inertia is also the difference between the within-groups variances of the two partitions defined before and after the formation of the node n = ( p1, p2). Simple calculations show that: ν(n) =

k p1+ · k p2+ 1 · · c J ( p1) − c J ( p2)2M I k p1+ + k p2+ k

where ν(n) measures the deviation between the component groups. In addition, using property 2 we deduce: ν(n)=

 m∈H J m=(q1,q2)

k+q1 ·k+q2 k p1+ · k p2+ 1 · · ·(Int(( p1, p2);(q1, q2)))2 . k+q1 + k+q2 k p1+ + k p2+ k 2

Reciprocally, for each node m = (q1, q2) ∈ H J , the analogous decomposition holds k+q2 · k+q2 1 · · c I (q1) − c I (q2)2M J , k+q1 + k+q2 k  k p1+ · k p2+ k+q1 · k+q2 ν(m) = · k p1+ + k p2+ k+q1 + k+q2 n∈H ν(m) =

I

n=( p1, p2)

·

1 · (Int(( p1, p2); (q1, q2)))2 , k2

where the terms of both decompositions may be used as a first series of interaction indexes between the nodes of the two hierarchies: (( p1, p2); (q1, q2)) =

k p1+ · k p2+ k+q1 · k+q2 1 · · k p1+ + k p2+ k+q1 + k+q2 k 2 · (Int(( p1, p2); (q1, q2)))2 .

(2 )

(2 ) If Ward’s (1963) method is used with chi-square metric, its fusion index is exactly ν(n). Therefore, it may be seen as the sum of all (( p1, p2); (q1, q2))s between the considered node and all nodes of the other hierarchy.

166 3.3.4 – Indexes of interaction between a group and a node Let us consider a group p of HI (i.e., the node n = ( p1, p2) seen as the union p = p1 ∪ p2 of its components). This study of the influence of nodes of H J on p may be performed taking into account the interactions between the pair ( p, p) and the nodes (q1, q2) of H J , where p stands for the complement of p. Then the previous results are applied in this particular case and, denoting by ν( p, p) the inertia of the dipole ( p, p), we can easily deduce: ν( p, p) =



k+q1 · k+q2 k p+ · (k − k p+ ) · · (Int(( p, p);(q1, q2)))2 . k+q1 + k+q2 k3

m∈H J m=(q1,q2)

Since now Int(( p, p); (q1, q2)) = ν( p, p) =

 m∈H J m=(q1,q2)

k k−k p+

· (A pq2 − A pq1 ), it follows that

k+q1 · k+q2 k p+ 1 · · · (A pq2 − A pq1 )2 . k+q1 + k+q2 k − k p+ k

A second series of interaction indexes can then be introduced and will measure the impact of the nodes m = (q1, q2) of H J on the cluster p of HI : ( p; (q1, q2)) =

k p+ 1 k+q1 · k+q2 · · · (A pq2 − A pq1 )2 . k+q1 + k+q2 k − k p+ k

Reciprocally, for q ∈ H J and ( p1, p2) ∈ HI , a third series of interaction indexes results: (q; ( p1, p2)) =

k+q 1 k p1+ · k p2+ · · · (A p2q − A p1q )2 . k p1+ + k p2+ k − k+q k

4. Statistical interpretations of the indexes We shall show that the interaction indexes  are squared correlation coefficients. For this purpose, we must take into account a 3 × 3 contingency table associated with the two nodes n = ( p1, p2) ∈ HI and m = (q1, q2) ∈ H J . This table is built by crossing the two

167 partitions of the whole set L of all k units: ( p1, p2, p1 ∪ p2)and (q1, q2, q1 ∪ q2), whose entries are represented in Figure 1. q1

q2

q1 ∪ q2

margin

p1

k p1q1 = w

k p1q2 = u

k p1+ − w − u

k p1+

p2

k p2q1 = t

k p2q2 = v

k p2+ − t − v

k p2+

k − k p1+ − k p2+ − p1 ∪ p2

k+q1 − w − t

k+q2 − u − v

k − k p1+ − k p2+

−k+q1 − k+q2 + +u + v + w + t

margin

k+q1

k − k+q1 − k+q2

k+q2

k

Fig. 1. The contingency table associated with two nodes ( p1, p2) and (q1, q2).

The interpretation of our interaction indexes will be based on this table, summarising the agreement between the two partitions. 4.1. Interpretation of interaction indexes Based on L, we associate to the nodes n = ( p1, p2) and m = (q1, q2) the following two variables U and V, defined ∀ ∈ L ,  1  U () = ,   k p1+  

U () = 0,

     U () = −1 ,

k p2+

 1  V () = ,   k+q1  

if  ∈ p1

if  ∈ q1

if  ∈ p1 ∪ p2  V () = 0,     V () = −1 , if  ∈ p2 k+q2

if  ∈ q1 ∪ q2 if  ∈ q2 .

Easy computations show that both U and V are centered and their variances and covariance are: 1 var(U ) = · k cov(U, V ) =



1 k p1+

+

1 k p2+



,

1 var(V ) = · k



1 k+q1

+

1



k+q2

1 · Int(( p1, p2); (q1, q2)) . k2

Therefore, the square of the correlation between U and V is equal to (cor(U, V ))2 = (( p1, p2); (q1, q2)) .

168 Furthermore, we can derive the following properties, special cases of the table in Figure 1: – cor(U, V ) = 1 if and only if the population is concentrated in a single upper-left to lower-right diagonal of the cross-classification table, as shown in Figure 2a). – cor(U, V ) = −1 if and only if the population is concentrated in a lower-left to upper-right diagonal of the sub-table crossing { p1, p2} and {q1, q2}, and in the cell ( p1 ∪ p2, q1 ∪ q2), as shown in Figure 2b). q1

q2

p1

k p1q1 0

p2

0

p1 ∪ p2 0

q1 ∪ q2

q2

0

p1

0

k p2q2 0

p2

k p2q1 0

0

p1 ∪ p2 0

k p1∪ p2,q1∪q2

a)

q1 ∪ q2

k p1q2 0 0

0 k p1∪ p2,q1∪q2

b)

Fig. 2. a) complete association, and b) complete counter-association between ( p1, p2) and (q1, q2).

– cor(U, V ) = 0 when Int(( p1, p2); (q1, q2)) = 0. Since this can k k k k + k p2q2 ) = ( k p1q2 + k p2q1 ), we be written as ( k p1q1 p1+ ·k+q1 p2+ ·k+q2 p1+ ·k+q2 p2+· ·k+q1 can say that in this case the diagonals of the 2 × 2 subtable have ”equal weights”. Notice that this happens in the usual case of independence. 4.2. The multiple hypergeometric model The statistic cor(U, V ) only depends on the entries of the 3×3 table crossing the two partitions ( p1, p2, p1 ∪ p2) and (q1, q2, q1 ∪ q2) shown in Figure 1. In order to define a null hypothesis H0 expressing the absence of relations between ( p1, p2) and (q1, q2), we may state that the k objects are randomly allocated to the cells, on condition that the table margins, namely k p1+ , k p2+ , k+q1 , k+q2 , k are fixed. This introduced null hypothesis is very common in non-parametric analysis, and is the same used by Fowlkes and Mallows (1983) and Lebart et al. (1995). Under this assumption, the elements of the matrix have a generalized hypergeometric distribution (Agresti, 1992), defined by the

169 probabilities P(u, v, w, t) = P(k p1q2 = u, k p2q2 = v, k p1q1 = w, k p2q1 = t) . There are cases where the conditioning on both sets of margins may seem artificial, so that other approaches, such as multinomial sampling, may be preferred (Agresti, 1992). This latter model, conditioned on just one margin distribution, is not convenient for our context, where the two hierarchies play identical roles. Besides, an advantage of conditioning on both margin distributions is the elimination of nuisance parameters, present in multinomial model. As a drawback, the hypergeometric distribution is highly discrete, so that the corresponding Fisher type tests may be considered too conservative at fixed significance levels (Wallace, 1983; Agresti, 1992). This argument becomes less acute when the dimensions of the table increase (here, 3 × 3), since the number of possible values in exact conditional sampling distributions tends also to increase (Agresti, 1990). The hypergeometric distribution values and moments are well known (Plackett, 1981). Besides, the marginal laws are usual hypergeometric laws, which allow us to consider the quantity A pq (see §3.3.1) as the quotient between k pq and its expectation. As a consequence, the expectation and the variance of Int(( p1, p2); (q1, q2)), within the framework of the hypergeometric model, can easily be deduced from these results: E[Int(( p1, p2); (q1, q2))] = 0 Var[Int(( p1, p2); (q1, q2))] =

k2 k p1+ + k p2+ k+q1 + k+q2 · · . k − 1 k p1+ · k p2+ k+q1 · k+q2

Therefore, we obtain: (Int(( p1, p2); (q1, q2)))2 = (k − 1) · (( p1, p2); (q1, q2)) Var[Int(( p1, p2); (q1, q2))] Then, up to the k − 1 coefficient, (( p1, p2); (q1, q2)) is the square of a standardized variable.

170 5. Exact conditional Tests 5.1. Interpretation of nodes Let us consider a node n = ( p1, p2) ∈ HI . The following test aims at searching the nodes m = (q1, q2) ∈ H J having a significant interaction with n. 5.1.1 – Building the test As described in the previous section, the absence of relations between n and m is expressed by the null hypothesis H0 based on the multiple hypergeometric law. Therefore, exact conditional tests can be carried out by computing, under H0 , the probabilities P[(( p1, p2); (q1, q2)) ≥ obs ], where obs is the observed value of . Once a risk level has been fixed (e.g. 5%), this test will enable the identification of the nodes m = (q1, q2) having a significant interaction with n = ( p1, p2). The evaluation of these probabilities (in the following called pvalues) involves a lot of computations, thus it is very time consuming. In order to speed it up, we implemented an algorithm, derived from the ideas of Mehta and Patel’s (1983) network algorithm. In our case, their algorithm may be stated as follows: consider all the 3 × 3 tables with fixed margins defined in Figure 1 as paths in a network, joining a starting node to an ending one. Each path is composed of three edges corresponding to the three columns of the table. Actually, the third edge is determined by the others, due to the fixed margins hypothesis. The tables whose first column is equal are represented as paths having the same first edge. Considering them, the visit of the second edge leads to a quick computation of the maximum of  for these tables and to compare it to obs : if it is smaller, all the considered paths are not influent in the computation of the p-value and may be ignored. Through this algorithm, the number of tables to take into account for the computation of the hypergeometric law is drastically reduced, resulting in a much faster computation than dealing with all tables having the same fixed margins.

171 5.1.2 – Interpreting a significant interaction When a significant interaction between two nodes is obtained, it is enough to calculate

k p1+ ·k p2+ k p1+ +k p2+

·

k+q1 ·k+q2 k+q1 +k+q2

· k1 · Int(( p1, p2); (q1, q2))

whose square equals . As this quantity has been interpreted as a correlation coefficient, its value close to 1 or −1 respectively shows a strong association or a strong counter-association between ( p1, p2) and (q1, q2) (Figure 2). 5.2. Cut-points identification The previous non-parametric tests may also be used for the identification of the cut-points of HI and H J . More precisely, considering HI , the difference ν(n), between the two successive within-groups variances obtained for each node n = ( p1, p2), can be decomposed (§3.3.3) into the sum: ν(n) =



(( p1, p2); (q1, q2)) .

m∈H J m=(q1,q2)

A node n of HI will be considered significant if there is at least a node m of the other hierarchy H J having a significant interaction  with it. As a rule, given a risk level, one may decide to cut the hierarchy HI at the highest level so that all the nodes located under the cut-point are not significant or at the lowest level so that all upper nodes are significant. In the given example (§6), the two rules will provide equivalent results. 5.3. Interpretation of the restricted hierarchies Once both cut-points have been fixed, one can limit the attention to the subsets of the hierarchies composed by the nodes above the cutpoints, denoted by HI SUP and H J SUP. Thus, one deals with both terminal groups, i.e. those of the obtained partitions, and non-terminal ones, corresponding to the retained nodes seen as groups. Both kinds of groups of either restricted hierarchies may be explained by the nodes of the other.

172 So, considering a group p of HI SUP, the exact conditional test to identify the nodes m = (q1, q2) of H J SUP explaining p will be carried out by using the statistic (( p, p); (q1, q2)). Reciprocally (( p1, p2); (q, q)) will be used to identify the nodes n = ( p1, p2) explaining q of H J SUP.

6. An example: the Nobel Awards As an application, let us consider the set of Nobel Awards winners from the beginning in 1901 to 1998(3 ). A contingency table k I J results by classifying them according to both citizenship and award subject. Thus, the set I is composed of 44 countries and the set J of subjects is J = {Medicine, Economics, Physics, Chemistry, Peace, Literature} so that each entry corresponds to the number of winners of each country for each specific award. The table is reported in Figure 3 together with both hierarchies HI and H J . They were built using Ward’s method applied to Euclidean distance among the items on the plane spanned by the first two factors of correspondence analysis. For clarity purposes, the HI representation is limited to 6 nodes above the chosen cut-point, i.e. to HI SUP. In the table, the nodes of HI (resp. H J ) are labelled following the items labelling, say from 46 to 89 (resp. from 7 to 11). 6.1. Choice of cut-points Let us consider first the hierarchy HI on countries. As described in §5.2, we calculated for each node of HI the smallest p-value obtained from the tests carried out on . Having fixed a risk level of 1%, a cut-point of HI results: it keeps six upper nodes, labelled 89, 88, 87, 86, 85, and 84, summarizing 79.8% of total variance. In Figure 4 the minimum p-values and the cumulative percentages of variance are displayed for all the nodes of HI . The seven terminal groups of HI SUP defined from this cut-point are represented in Figure 3, labelled 81, 82, 71, 74, 68, 78, and 83. (3 )Data were taken from the Nobel Awards Web site: http://www.nobel.se.

173 11 10

9 8 3 Australia Austria 82 Hungary Netherlands Pakistan USA

1

6

2

5

4

Medicine Physics Economics Chemistry Peace Literature 3 0 0 1 0 1 5 3 0 1 2 0 1 0 0 1 0 0 2 6 1 3 1 0 0 1 0 0 0 0 77 66 26 46 23 11

86 Canada 81 Germany United Kingdom

1 15 13

2 20 20

1 1 7

3 26 24

2 4 14

0 6 8

Argentina 71 Belgium Switzerland

1 4 6

0 0 4

0 0 0

1 1 5

2 4 12

0 1 2

Burma Costa Rica Israel 74 Palestine South Africa Tibet Timor Viet Nam

0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 1 3 1 4 1 2 1

0 0 1 0 1 0 0 0

Denmark France Italy 78 Japan Portugal Soviet Union Sweden

5 8 3 1 1 2 7

3 11 3 3 0 7 4

0 1 0 0 0 1 2

1 7 1 1 0 1 4

1 9 1 1 0 2 5

3 12 6 2 1 4 7

Czechoslovakia Egypt Finland Guatemala 83 India Ireland Mexico Norway Poland

0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0

0 0 0 0 1 0 0 2 0

1 0 1 0 0 0 0 1 0

0 1 0 1 1 1 1 2 1

1 1 1 1 1 3 1 3 3

Chile Colombia Greece 68 Ireland Jugoslavia Nigeria SaintLucia Spain

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

2 1 2 1 1 1 1 5

89

84

88

85

87

Fig. 3. The data table and the hierarchies H I and H J .

174

Fig. 4. Minimum p-value (%, squares) and cumulative % of variance (circles).

Likewise, four upper nodes labelled 11, 10, 9, and 8, have been kept for H J . The five terminal groups of H J SUP are thus Medicine, 7 (gathering Economics and Physics), Chemistry, Peace, and Literature.

6.2. Interpretation of nodes of HI SUP and H J SUP Here, we briefly describe the significant interactions between nodes of restricted hierarchies. For each couple of nodes, we give the p-value (abbreviated p) of the interaction test based on . If necessary, in node descripion, the couples are permuted, in order to keep the correlation positive. In this way, we could as well arrange the table in Figure 3, thus revealing the table structure pattern. Strong interactions may be noticed at first between the highest node 89 = (86, 88) of HI and the two nodes 11 = (10, Literature) ( p = 0.000) and 10 = (9, Peace) ( p = 0.000). They are explained by the fact that countries of group 86 gather most scientific awards. The following node 88 = (84, 87) also shows strong interactions with 11 = (10, Literature) ( p = 0.000) and 10 = (9, Peace) ( p = 0.000). This is explained by the opposition between Literature awards, more present in the group 87 of countries, and Peace awards, more concentrated in the group 84. As regards the node 87 = (85, 68), a significant interaction with 11 = (10, Literature) ( p = 0.000) is noticed: in fact, the countries in the group 67 received only Literature awards. If the node 86 = (82, 81) is now considered, significant interactions are outlined by both nodes 9 = (Medicine, 8) ( p = 0.003) and 8 = (7, Chemistry) ( p = 0.001). More precisely, Medicine awards are more present in the countries of the group 82 than in those in 81, where the other

175 sciences summarise higher presences. Within them, the proportion of Chemistry awards is higher in the group 81. This can explain the fairly strange allocation of USA, by far the most important winner, in a group where all other countries awards are very poor (we are dealing with profiles). In fact, the USA proportion of Chemistry awards is lower than in group 81, in particular compared to Medicine and Physics. Node 11 = (10, Literature) ( p = 0.000) explains again the differences among components of the group 85 = (78, 83), since in the group 83 the presence of scientific awards is very small. Finally, the node 84 = (71, 74) ( p = 0.000) is highly influenced by the node 10 = (9, Peace) since the countries of the group 67 were given nearly only Peace awards.

7. Conclusions Considering a contingency data table and two hierarchies, built in any way on both rows and columns, indexes were introduced to quantify the influence, thus the explanation ability, of either hierarchy nodes on both nodes and groups of the other. Thus, the proposed exact conditional tests provide the user with tools for the identification of significant relations. The method founds its primary use in the exploratory frame and can be completed by a following application of Lebart et al. (1995) indexes, limited to the most interesting relations. So, the information given by our approach will be fine-tuned through the identification of typical characters of a group, especially in opposition to the others. Relations of our method with analysis of concentration (Feoli and Orl´oci, 1979), non-symmetrical correspondence analysis (Lauro and D’Ambra, 1984) and log-linear models (Van Der Heijden et al., 1989) may be further investigated, as well as generalizations to multiway contingency tables (Denimal, in press). Acknowledgments We gratefully thank the anonymous referee for his fruitful advice that highly contributed to improve our work. This paper was developed in the frame of C.N.R. grants 97.05087.CT12, 97.03844.CT15, 98.00337.CT12, and 99.01534.CT10 and Faculty of Architecture of University of Roma grant for “Tecniche d’Analisi Esplorative di Dati Socioeconomici Territoriali”. The first author was also granted as visiting Professor in Rome University La Sapienza during September 2000.

176 REFERENCES Agresti, A. (1990) Categorical Data Analysis, John Wiley and Sons, New York. Agresti, A. (1992) A Survey of Exact Inference for Contingency Tables, Statistical Science, 7, n. 1,131-177. Altham, P.M.E. (1970a) The Measurement of Association of Rows and Columns for an rxs Contingency Table, Journal of the Royal Statistical Society, Series B, 32, 63-73. Altham, P.M.E. (1970b) The Measurement of Association in a Contingency Table: three Extensions of the Cross-ratios and Metric Methods, Journal of the Royal Statistical Society, Series B, 32, 395-407. Benz´ecri, J.P. et al. (1982) L’analyse des Donn´ees, Vol. 2, Dunod, Paris. Benz´ecri, J.P., Lebeaux, M.O., and Jambu, M. (1980) Aides a` l’interpr´etation en classification automatique, Cahiers de l’Analyse des Donn´ees, 5, n. 1, 101-123. Bertin, J. (1977) La graphique et le traitement graphique de l’information, Flammarion, Paris. Camiz, S. (1994) A Procedure for Structuring Vegetation Tables, Abstracta Botanica, 18, n. 2, 57-70. Camiz, S. and Denimal, J.J. (1998a) Interpretation of a Cross Classification: a New Method and an Application, in: Rizzi, A., Vichi, M., and Bock, H.H. (eds.) Advances in Data Science and Classification (Proceedings of the Meeting IFCS 98, Roma), Springer, Berlin, Studies in Classification, Data Analysis and Knowledge Organization, 555-560. Camiz, S. and Denimal, J.J. (1998b) A New Method for Cross-Classification Analysis of Contingency Data Tables, in: Payne, R. and Green P. (eds.) Compstat 98 Proceedings in Computational Statistics, Physica-Verlag, Heidelberg, 209-214. Cazes, P. (1984) Correspondances hi´erarchiques et ensembles associ´es, Les Cahiers du Bureau Universitaire de Recherche Op´erationnelle, 43-44, 43-142. Daniels, H.E. (1944) The Relation between Measures of Correlation in the Universe of Sample Permutations, Biometrika, 33, 2, 129-135. Denimal, J.J. (1997) Aides a` l’interpretation mutuelle de deux hi´erarchies construites sur les lignes et les colonnes d’un tableau de contingence, Revue de Statistique Appliqu´ee, 45, n.4, 93-110. Denimal, J.J. (in press) Mutual Interpretative Aids for Hierarchical Clusterings Built on a Multiway Contingency Table: Definitions and Properties, Publications internes IRMA, Universit´e des Sciences et Technologies de Lille. Diday, E. (1971) La m´ethode des nu´ees dynamiques, Revue de Statistique Appliqu´ee, 19, n. 2, 19-34. ´ L. (1979) Analysis of Concentration and Detection of Underlying Feoli, E. and Orloci, Factors in Structured Tables, Vegetatio, 40, 49-54. Fowlkes, E.B. and Mallows, C.L. (1983) A Method for Comparing two Hierarchical Clusterings, Journal of the American Statistical Association, 78, 553-569. Goodman, A.L. (1986) Some Useful Extensions of the usual Correspondence Analysis Approach and the Log-Linear Models Approach in the Analysis of Contingency Tables, International Statistical Review, 54, n. 3, 243-309.

177 Goodman, L.A. and Kruskal, W.H. (1954) Measures of Association for Cross Classifications, Journal of the American Statistical Association: pp. 732-764. Gordon, A.D. (1999) Classification, Chapman and Hall, London. Govaert, G. (1984) Classification simultan´ee de tableaux binaires, in: Diday, E. et al. (eds.), Data Analysis and Informatics, North Holland, Amsterdam, 4, 223-236. Hartigan, J.A. (1975) Clustering Algorithms, J. Wiley and Sons, New York. Hirotsu, C. (1983) Defining the Pattern of Association in Two-Way Contingency Tables, Biometrika, 70, 579-589. Hubert, L. (1985) Combinatorial Data Analysis: Association and Partial Association, Psychometrika, 50, n. 4, 449-467. Hubert, L. and Arabie, P. (1985) Comparing Partitions, Journal of Classification, 2, 193-218. Hubert, L. and Arabie, P. (1992) Correspondence Analysis and Optimal Structural Representations, Psychometrika, 56, 119-140. Lauro, N. and D’Ambra, L. (1989) L’analyse non symm´etrique des correspondances, in: Diday, E. et al. (eds.), Data Analysis and Informatics, North Holland, Amsterdam, 3, 433-446. Lebart, L., Morineau, A., and Piron, M. (1995) Statistique exploratoire multidimensionelle, Dunod, Paris. Mehta, C.R. and Patel, N.R. (1983) A Network Algorithm for Performing Fisher’s Exact Test in r x c Contingency Tables, Journal of the American Statistical Association, 78, n. 382, 427-434. MacQueen, J. (1967) Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of V Berkeley Symposium 1965, 281-297. Milligan, G.W. and Cooper, M.C. (1985) An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50, 159-179. Pitman, E.J.G. (1937) Significance Tests that May be Applied to Samples from any Populations. II: the Correlation Coefficient Test, Journal of The Royal Statistical Society, Series B, 4, 225-232. Plackett, R.L. (1981) The Analysis of Categorical Data, second edition, Griffin, London. Podani, J. and Feoli, E. (1991) A General Strategy for the Simultaneous Classification of Variables and Objects in Ecological Data Tables, Journal of Vegetation Science, 2, 435-444. Rand, W.M. (1971) Objective Criteria for Evaluation of Clustering Methods, Journal of the American Statistical Association, 66, 846-850. Van Der Heijden, G.M., De Falguerolles, A., and De Leeuw, J. (1989) A Combined Approach to Contingency Table Analysis using Correspondence Analysis and Log-Linear Analysis, Applied Statistics, 38, n. 2, 249-292. Wallace, D.L. (1983) Comment to Fowlkes and Mallows’ A Method for Comparing two Hierarchical Clusterings, Journal of the American Statistical Association, 78, 569-576. Ward, J.H. (1963) Hierarchical Grouping to Optimize an Objective Function, Journal of American Statistical Association, 58, 236-244. Weiss, M.C. (1978) D´ecomposition hierarchique du chi-deux associ´ee a` un tableau de contingence a` plusieurs entr´ees, Revue de Statistique Appliqu´ee, 26, n. 1, 23-33.

178 Exact conditional tests for a reciprocal interpretation of hierarchical classifications built on a two-way contingency table Summary Given a two-way contingency table and two hierarchical classifications on both rows and columns, indexes are proposed to identify suitable cut-points and mutual influences between nodes of either hierarchy and both nodes and group of the other. The indexes distribution is found and exact conditional tests are given, to check for significance of the found values. An application to Nobel Awards is given as an example.

Test condizionali esatti per l’interpretazione reciproca di classificazioni gerarchiche d’una tabella di contingenza a doppia entrata Riassunto Data una tabella di contingenza a doppia entrata sulla quale son costruite due classificazioni gerarchiche, si propongono indici capaci d’identificare i punti di taglio dei dendrogrammi e le influenze reciproche fra i nodi d’una gerarchia e nodi e gruppi dell’altra. La conoscenza della distribuzione degli indici permette d’aggiungere test condizionali esatti per stabilire la significativit`a dei valori trovati, Come esempio e` inclusa un’applicazione ai Premi Nobel.

Key words Contingency tables; Hierarchical clustering; Hypergeometric law; Exact conditional tests; Chi-square metric.

[Manuscript received January 2000; final version received October 2001.]