Cluster Inference by Using Transitivity Indices in ... - Semantic Scholar

3 downloads 0 Views 230KB Size Report
Sep 12, 2007 - name their best friends in the class. Friendship can be represented by a directed graph and mutual friendship by an undirected graph. In cluster ...
Cluster Inference by Using Transitivity Indices in Empirical Graphs Ove Frank; Frank Harary Journal of the American Statistical Association, Vol. 77, No. 380. (Dec., 1982), pp. 835-840. Stable URL: http://links.jstor.org/sici?sici=0162-1459%28198212%2977%3A380%3C835%3ACIBUTI%3E2.0.CO%3B2-E Journal of the American Statistical Association is currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/astata.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.org Wed Sep 12 17:58:24 2007

Cluster Inference by Using Transitivity Indices in

Empirical Graphs OVE FRANK and FRANK HARARY*

A random graph model is introduced for similarities observed between the objects sampled from an unknown cluster structure. We investigate this model and show how some common transitivity indices in empirical graphs can be used for making statistical inferences about cluster structures. KEY WORDS: Random graph; Cluster; Transitivity; Statistical estimation. 1. INTRODUCTION

Empirical graphs provide data on binary relationships between the elements in a set. The set might consist of different varieties of a food product that are pairwise compared in order to decide which one has a better taste. Paired-comparison experiments like this one yield asymmetric relationships that can be represented by directed graphs. In a sociometric choice investigation the elements might be the children in a school class who are asked to name their best friends in the class. Friendship can be represented by a directed graph and mutual friendship by an undirected graph. In cluster analysis it is' desired to partition a set of objects into subsets of similar objects (clusters) by using information about pairwise similarities or dissimilarities between the objects. A book on cluster analysis with many applications is Hartigan (1975). We also refer to Hubert (1974) for a review of the use of graph concepts in cluster analysis. Empirical graphs, like all kinds of empirical data, are subject to observation errors, sampling variation, and other kinds of uncertainty, and they can be described and analyzed by using random graph models. Random graphs generated from an underlying unknown similarity relation are of importance for making statistical inference in cluster analysis. Various aspects of this were discussed by Ling (1973), Baker (1974), Ling and Killough (1976), Hartigan (1977), Hubert and Baker (1977), and Frank (1981). If no measurement errors or other observational uncertainties were present in the data, the clusters could be represented by the components of a transitive graph. (Transitive graphs and other graph concepts needed will be defined in Section 2.) When uncertainty is present, the cluster analysis can be based on a random model of

* Ove Frank is Professor, Statistics Department, University of Lund, S-22007 Lund, Sweden, and Frank Harary is Professor, Mathematics Department, University of Michigan, Ann Arbor, MI 48109.

the kind of data available. Frank (1978a, 1979) proposed and investigated such a model for similarity measurements with random errors within and between the clusters. Sampling variation in transitive graphs was considered by Frank (1978b). In Section 3 we introduce a random graph model that combines sampling variation with measurement errors. In order to test a specified random graph model by using a transitivity index of an empirical graph, it is important to know when the index value obtained is significantly different from that expected under the random model. Simple random graph models were used by Frank (1980) for investigating the properties of various transitivity indices when no other transitivity is present but that caused by pure randomness. Our present purpose is to investigate some common transitivity indices when transitivity is present because of a specified clustering model. Section 3 specifies such a model, Section 4 gives two examples, and Section 5 gives the main results. This model is a generalization of the pure randomness model, and some results given earlier by Frank (1978a, 1980) are deriied as special cases. Section 6 discusses briefly various statistical inference problems for the model and shows how the transitivity indices can be used for estimation and testing. Several estimators and tests are proposed that seem to be practically convenient but unfortunately not very easy to evaluate by available theoretical methods. Computer simulation experiments and further theoretical investigations are needed to evaluate and improve the methods. Therefore, Section 6 should be considered mainly as an expository demonstration of possibilities for drawing statistical inference in clustering models, and we have no intention of pursuing any of the inference problems in detail here. This will be the subject of further research in the field of statistical graph theory. 2. TRANSITIVITY INDICES

In general, we follow the graph-theoretic terminology of Harary (1969). Thus a graph is a finite irreflexive symmetric relation. The order of a graph G is the number of vertices (points), and its size is the number of edges (lines). In a complete graph every pair of vertices is joined by an edge. In an induced subgraph H of G, every edge of G joining two vertices of H is present. O Journal of the American Statistical Association December 1982, Volume 77, Number 380 Theory and Methods Section

Journal of the American Statistical Association, December 1982

836

Transitivity is one of the most fundamental properties of binary relations. A relation is transitive if, whenever (u, v) and (v, w) are two related pairs of elements, then also (u, w) is a related pair. Transitivity is essential for several different structural hypotheses, and various indices have been proposed for measuring transitivity; see, for instance, Harary and Kommel (1979), Holland and Leinhardt (1971, 1975), and Frank (1980). For graphs, transitivity simply means that all connected components are complete, and transitive graphs are idealized similarity relations (Harary 1964). Let G be a graph on vertex set V = (1, . . . , n). The order of G is n, and the size of G is denoted by I G I = R. The number of induced subgraphs of G of order 2 and size r is denoted by D, for r = 0, 1. Induced subgraphs of order 2 are called dyads of G, and Do, D l are called the dyad counts of G. Obviously D l = R and Do + D l = (g). The number of induced subgraphs of G of order 3 and size r is denoted by T, for r = 0, 1, 2, 3. Induced subgraphs of order 3 are called triads of G, and To, TI, T2, T3 are called the triad counts of G. In particular, T2 = 0 for transitive graphs. The total number of triads is equal to

Since each edge in G is contained in exactly n it follows that

-

2 triads,

' ' T2 = 0 or T2 + T3 = (;), 7'' 5 7 5 with 7 ' 1T I Tiff iff T2 = 0 or T2 + 3T3 L (;), and 7"' 1T 5 1 iff T2 = 0 or T2 + 3T3 5 (;). If T2 + T3 = 0, then T = 1, 7 ' = 0 and 7'' and 7"' are conveniently defined as 1, which is consistent with (7) and (8). The minimum value of 7 is 0 only if n = 3 or n = 4. For general n r 3, the minimum value of 7 can be obtained from Frank and Harary (1980, Theorem 1). They give the maximum value of T2, which implies the following results.

7"'

Theorem I. The proportion of transitive triads T in any graph of odd order n = 2m - 1 or even order n = 2m satisfies 7 2

4-

3/(4(2m - 1))

(9)

for all m r 2. The lower bound on T is obtained for a complete bipartite graph with parts of order m and n m. We note that if 7 < 1 or 7 ' > 0, then 7'' and 7"' can be determined from 7 and 7 ' according to 7''

= 7'/(1 - 7

+ TI),

3. A CLUSTERING MODEL

In order to define a random graph G on a vertex set (2) V = (1, . . . , n), we first introduce n independent identically distributed random variables X I , . . . , X , on a Various transitivity indices can be determined from the finite set (1, . . . , c). The elements of this set are called triad counts. We consider the following four different colors. Color i has a positive probability pi and p l + transitivity indices in graphs of order n r 3: the prspor+ p, = 1. Two distinct vertices u and v have an edge in tion of transitive triads is G with probability Pii if Xu = i and X, = j. Here Pii depends on u and v through i and j only, and Pij = pii T = (To + TI + T3) = - T2 (3) L 0 for all i, j E {I, . . . , c). Conditional on the vertex colors, all edges are stochastically independent. the proportion of 3-cycles among the triads is We can think of the colors as unknown clusters of relative sizes p , , . . . , p, in some population. Also, the 7 ' = T3 (4) number c is generally unknown and should be estimated the proportion of 3-cycles among the connected triads is from data; we comment on that in Section 6. The vertices are the objects to be clustered. These objects are selected 7" = T3/(T2 + T3), if T2 + T3 > 0 , (5) by simple random sampling with replacement from the population. The graph G represents-the uncertain simiand the proportion of 3-cycles induced by the 2-paths is larity measurements of all unordered pairs of sampled T2 + T3 > 0. 7"' = 3 T3/(T2 + 3 T3), if (6) objects. The probability P i j is the probability of obser;ing a similarity between two objects from clusters i and j, Obviously a graph is transitive iff 7 = 1. A connected that is, the probability of observing a true similarity if i graph is transitive iff 7 ' = 1. A graph with at least one = j , and a false similarity if i # j. Thus, 1 - P i i is the component of order 3 or more is transitive iff 7'' = 1 error probability between objects from the same cluster (which occurs iff 7"' = 1). i, while for i # j, Pij is the error probability between If T2 + T3 > 0, then objects from distinct clusters i and .j. 0 5 7 ' 17" 5 T"' 5 1 1t may be noted that many objective functions for clus(7) tering that have been suggested in the literature can be with 7 ' = 0 iff T3 = 0, 7 ' = 7'' iff T3 = 0 or T2 + T3 defined by using pi and Pi,. In fact, in a finite population = (;), 7'' = 7"' iff T2 = 0 or T3 = 0, and 7"' = 1 iff T2 of N objects, let pi be the proportion of objects in cluster = 0. Moreover, for T2 + T3 > 0 we have i, and let Pij be the proportion of objects in cluster i that 0 1 7 ' 1 7 ~ 1 (8) are classified as belonging to cluster j. Then several measTI

?-

2T2

+ 3T3 = (n - 2)R.

/ (1) / (1) / (1).

837

Frank and Harary: Cluster Inference in Graphs

ures of association for cross-classifications based on the frequencies Nij = NpiPij appear somewhat similar to the quantities encountered in the formulas (12), (18), and (22)-(27). Goodman and Kruskal (1979) give a recent compilation of four famous papers on association measures. Goodness-of-fit measures for clusterings are discussed by Hubert (1972, 1973, 1974), Hartigan (1975), and others. The general clustering model involves c(c 3)/2 free parameters, namely the positive integer c, the positive probabilities p , , . . . , p, constrained by p l + ... + pc = 1, and the nonnegative probabilities Pijfor i Ij with i, j E (1, . . . , c}. If Pii = a and Pij = p for i # j, then there are c + 2 parameters. Without restriction we can assume that the clusters have been labeled so that the relative cluster sizes are nonincreasing: p l r ... r p,. The number of parameters can be further reduced by assuming a truncated geometric distribution pi = abi for i = 1, . . . , c or a truncated Poisson distribution pi = abili! for i = 1, . . . , c. If we assume a discrete uniform distribution pi = l/c for i = 1, . . . , c, then a simple model involves only three parameters c, a , P. If we allow c = w, then the geometric and Poisson distributions yield other simple models with three parameters.

+

4. TWO EXAMPLES Let us comment briefly on the usefulness of some of these models for two specific applications in order to provide some substantial motivation for considering them. In applied linguistics it might be meaningful to investigate an individual's vocabulary by examining his or her speech or writings. Without going into details, we assume that we can decide when two inflected forms shall be considered as identical words, and that we can treat names, numerals, homonyms, and compound words properly. The assumption that n sampled words shall be independent can be made approximately correct by choosing, for instance, each fifth word in a text. The probabilities p l r p2 r ... r pc are the nonincreasing frequencies of the words in the vocabulary. A common assumption is that the ith most frequent words have frequencies obeying - Zipf - s law (pi = alib for i = 1, . . . , E and b 3 1; see, e.g., ~ c ~ e1973) i l or some other parameterized model. It is also common to characterize word frequency distributions by their entropy, pi log pi, pi2, $3 = C pi3, or by their first few moments s2 = . . . . Assume that each sampled word is successively compared to each word sampled earlier and is claimed to be similar or not. The misclassifications can be due to coding errors, punching errors, or any other kind of error occurring in a computerized processing of a large number of words. A simple error model is the a , P model, and a more elaborate model might, for instance, admit Pij depending on pi and pj. In an actual application it might also be possible to specify error mechanisms involving stochastic dependence. 1f should be clear here that + and c + are quite natural assumptions. The statistical inference about the vocabulary that should be of

-x

n

interest is typically the estimation of such vocabulary parameters as the moments s2, s3 or the number of words having frequencies above a specified value. In ecology, c can be the number of species or taxonomic groups in an animal or plant population, and p l , . . . , p, their relative abundances. By making paired comparisons between n sampled units (animals or plants) we obtain uncertain information about which ones are similar (belong to the same taxonomic group). The a , Perror model can be tried as a first approximation to an actual error mechanism, or it might be possible to specify a better model by using some knowledge about the factors causing the errors. We should point out that even in the case of no errors (a = p = 0 and a transitive observed graph G) it is generally very complicated to find the exact probability distribution of G and apply maximum likelihood estimation methods. The probability distribution of the numbers of components of G of various orders was given by Engen (1978, p. 18) for the case with no errors, but this distribution is not of much help in the search for inference methods because of its intractable form.

5. SOME PROPERTIES OF THE RANDOM GRAPH G In order to discuss statistical inference for the clustering model we need the following results for the random graph G defined in Section 3. Theorem 2. The size R of the random graph G has expected value

and variance

(") x C pipjPij (1 2 C

var R

=

C

i=lj=l

C

-

C

C X1 ~ i p j p i j i= 1 j=

Proof. Let G,, denote the subgraph of G induced by a 2-subset {u, v ) of vertices. Thus G,, is a dyad, that is, an induced subgraph of G of order 2 and of size 0 or 1. We can represent R as the sum of I G,, 1 = R,, over all 2-subsets of vertices. Conditional on the vertex colors Xu = i and X, = j, the adjacency indicators R,, are independent Bernoulli variables with expectation Pij for all 2-subsets of vertices {u, v ) . Hence the expression for ER follows. In order to obtain the variance we need the covariance cov(Rst, R,,) for vertices s < t and u < v . Now, this covariance is equal to E cov(Rst , Ruu I Xs? Xt, Xu Xu)

+ cov[E(R,r I Xs, Xt), E(Ruu I Xu, Xu)],

(13)

Journal of the American Statistical Association, December 1982

838

where the first term is 0 unless s case it is equal to

=

u, t

=

v, in which

=

i. It follows that ETi

=

(;)ai, where

ai = P ( 1 G,,, The covariance is equal to

1

=

i).

(19)

and the second term is equal to

fors

=

u, t

=

where the sum is over 3-subsets {r, s, t) and {u, v, w) of vertices. We have to separate the cases where there are 0, 1, 2 or 3 common vertices in the two 3-subsets, and then we obtain

v

P ( 1 Grst 1 = i, I Gun, 1 = j) =

aiaj for k

=

0

=

cij

fork

=

1

(15)

=

bij

for k

=

2

By noting that the (;)2 possibilities for choosing s < t and u < v divide into (z), ("T~)(;),and 2(n - 2) (;) possibilities, respectively, for the three cases in (15), the formula for var R follows. The following corollary is readily obtained by specifying Pij = (1 + P(l - tiij), where tjij is 1 for i = j and 0 otherwise. This choice of Pij means that the error probabilities are equal to ci within the clusters and equal to P between all pairs of distinct clusters. For convenience we put y = 1 - ci - p.

=

ai8ij for k

=

3,

C

C

C

/

C

C

otherwise.

Corollary 3 . If Pij =

P

\

2

where k is the number of common vertices in {r, s, t) and {u, v, w). By an easy combinatorial argument, it can be verified that these four cases correspond to ("c3)(;), 3 ( " ~ ~ ) ( ; 23(n ) , - 3) (z), and (;) choices of two 3-subsets, respectively. Therefore, substitution of (21) into (20) leads after simplification to (18). We note that the probabilities a,, bij, and cij can be determined by combinatorial arguments; for instance, C

C

C

(1) (P + ys2) + 6(;)

(P2 + 2Pys2 + i2s3)

C

C

a 3 =i Z =l z j =X l k~= li ~ j ~ k P i j ~ i k ~ j k ,

and =

C

+ ySij, then (I6)

var R

(21)

bZ2=

5 5 pipj { [2 5 pkpik ( l + (1 Pij) (5 P k p i k ~ j k ) 2 } C Z F'iPjPij (5pkpikpjk)

-

pij

i = I j= 1

2

pjk)]

k= 1

-

7

(24)

k= 1

where

S, =

Xf'=lpi' for r

=

C

2 , 3.

b33

Theorem 4. The triad counts To, T,, T2, T3 of the random graph G have the expected values ETi = (;)ai and the covariances

cov (Ti, Tj)

=

(;)

=

2

c

5

i = 1 j= 1

- -

c22

pi

= i= 1

1

(25)

k= 1

-

L

L

X ~ j ~ k [ p i j p i k (-l pjk)

j= 1 k = 1

ai (6ij - aj)

+ 12 (i) ( bij - aiaj) + 30

(cij - aiaj), (18)

where ai is the probability that a triad of G has size i, bij is the probability that two triads of G having two common vertices have sizes i and j, and cij is the probability that two triads of G having one common vertex have sizes i and j , for i, j E (0, 1, 2, 3). Proof. Let G,,,, denote the subgraph of G induced by a 3-subset {u, v, w) of vertices. We can represent Ti as the number of 3-subsets of vertices such that ( G,,,, I

and

iPi (2 X c

c33

=

i=1

c

~ j ~ k p i j p i k p j k

(27)

j=1 k=l

Corollary 5. The proportion of transitive triads 7 of the = 1 - a2 and "ndom graph G has the expected value variance var 7

=

[9(n

-

3)(n - 4)(c22 - az2)

+ 18(n - 3)(b22 - az2)

Frank and Harary: Cluster Inference in Graphs

839

+ 6a2(l - a2)]ln(n - l ) ( n - 2) for n + w . In particular, Pij

ET = 1

-

3p2(1

-

and

var

T =

9[B2(s3-

P)

-

+ y s i j implies that

p

=

3py(2

llc. For a truncated geometric distribution we obtain

-

s3

=

(1

-

b)2(1 + b c

3p)s2

Generally, we have the following inequality between c - 3y2(1 - 3p - y)s3 (29) and the power sums. Theorem 7. Let c be an integer greater than 1, and pl , . . . , pc positive numbers with sum 1 . Set s2 = p12 + s ~ +~2BC(s4 ) - ~ 2 ~ 3 )

... + pC2and s3 = p I 3 + ... + p,3. Then + C2(ss - ~ 3 ~ ) l !+n 0 ( l / n 2 ) , (30) > m a X [ l / S 2(, 1 - S 2 - ZS3

where B

=

2py(2

+ b2")l(1 + b + b2)(1 - bc)2.(36)

-

3 p ) , C = 3y2(1 - 3P

+ 2 ~ 2 ~ ) /-( ~22~ 3+ ~

y),

-

(37)

2 ~ ) ]

unless the pi are all equal. Proof. Consider

and C

s,

=

x p {

for r

=

2,.

. . ,5.

(31)

i= 1

Proof. The first part of the corollary follows from (3) for r = 1, . . . , c and set mo = 1 . According to a classical and (18) with i = j = 2. The particular case involves result by Newton (cf. Hardy, Littlewood, and Polya 1952, some tedious algebra using (22) and (26). P 521, Corollary 6. The proportion of 3-cycles T' of the rangraph has the expected ET' = ' 3 and variance var 7' = [9(n - 3)(n - 4)(c33 -

~

3

~

)

+ 18(n - 3)(b33 + 6a3(l - a3)]ln(n - l ) ( n - 2) = g(c33 - ~ 3 ~ ) + l nO ( l l n 2 ) (32) for n + w . In particular, Pij = p + y s i j implies that ET'= P3 + 3p2ys2 + y2(3p + Y ) S ~ (33) ~

3

~

)

m r - l m r + l < m?

(39)

for r = 1 , : . . , c - 1 unless all the pi are equal. For r = 1 , 2 we obtain m 2 < m 1 2and m l m 3 < m 2 2 . ' N o w , m l = llc, m2

=

( 1 - s2)/c(c - I ) ,

m 3 = ( 1 - 3s2 + 2s3)/c(c - 1)(c - 2 ) , (40) and (37) follows by substitution and simplification.

This theorem enables us to draw inferences about c from estimators of s2 and s3. We now briefly discuss estimation and testing of the parameters of the random graph G. If a = p = 0 and pl , . . . , p, are unspecified, then G and

is a transitive graph, and we find immediately from (16) var T' = 9[B2(s3- s ~ +~2BC(s4 ) - ~2.~3)

and (33) with p = 0 and y = 1 that s2 and s3 have the unbiased estimators J2 = R l ( ; ) and j3 = 7'. The variances + C2(ss - ~ ~ ~ )+] Ol (nl l n 2 ) , (34) of the estimators are obtained from (17) and (34) with p where = Oand y = 1 . If a and p are specified and p l , . . . , p, unspecified, B = 2 p 2 y , C = y2(3P + y ) , then it follows from (16),(29),and (33)that s2 and s3 have and three alternative unbiased estimators that are linear functions of R and T, of R and TI, and of T and T'. For instance, s, = p { , for r = 2 , . . . , 5 . (35) from (16) and (33) we obtain the estimators

x C

i= 1

);(

Proof. The first part of the corollary follows from (4) 32 = [ R /

- P ] / ( l - a - P),

and (18)with i = j = 3. The particular case follows from (23) and (27). 33 = T' + 2p3 - 3P2R 6. STATISTICAL INFERENCE ( 1 - a - p)2(1 - a + 2P), (41) In our discussion of statistical inference we find it convenient to use the first power sums $ 2 , $3 as parameters and their variances and covariances can be determined of the cluster sizes and a and p as parameters of the edge from ( 2 ) , ( 3 ) , ( 4 ) ,and (18)by some rather tedious algebra. observations. We note that for a uniform distribution, s2 If a = p is unspecified and p , , . . . , p, unspecified, then estimators denotes the common relative cluster size; that is, s2 = $2, $3 of p , s 2 , s3 can be found from

/($I/

[

p,

Journal of the American Statistical Association, December 1982

840

three equations obtained from (16), (29), and (33) with y mators and tests. Many unsolved problems remain even = 1 - 2P. If J2 and J3 are first eliminated, then the for very simple particular cases of the present model. resulting equation for @ is a polynomial of the third [Received February 1981. Revised March 1982.1 degree: REFERENCES where

If p is small, the approximation @ = CIB might provide a reasonable estimator of P , and J2 and J3 are then readily obtained from the equations. If a and p are unspecified and p, , . . . , p, is a distribution with specified s2 and s 3 , then it follows from (16), ,(29),and (33) with y = 1 - a - P that a and P can be estimated from R and r , from R and r l ,or from T and 7 ' . By applying a least squares method to the three equations we find estimators & and @ that are functions of R , 7 , and rl. If a and p are unspecified andpl = ... = p,. = l / c with c unspecified, then (16), (29), and (33)with s2 = l / c and s3 = 1/c2yield estimators of a , p , and c that are functions of R , r , and r l . If a = p is'unspecified and pi = abi for i = 1 , . . . , c with b and c unspecified, then estimators of b , c , and p can be obtained from (16), (29), and (33) with s2 and s3given by (36).The estimators are functions of R , 7 , and rl. If a and p are unspecified and p l , . . . , p, is an unspecified distribution with c 5 2, then we can use the fact that c 5 2 is equivalent to 1 - 3s2 + 2s3 = 0 and find estimators of a , 6, and s2 from (16), (29), and (33) with $3 = (3s2 - 1)/2. If a and p are unspecified and c = 1 (that is, the hypothesis of pure randomness in Frank 1980), then (16), (29),and (33) with y = 1 - a - P and s2 = s3 = 1 can be used to find estimators of a and P that are functions of R and 7 , Of R and Or Of and 7 ~B~ . applying a least squares method we can find estimators that are functions of R , T , and 7 ' . In all the cases just considered, the hypothesis can be tested by comparing the values of R , 7 , and 7' with their estimated expected values and variances. Computer simulation experiments are needed to evaluate these esti1,

BAKER, F.B. (1974), "Stability of Two Hierarchical Grouping Techniques; Case I: Sensitivity to Data Errors," Journal of the American Statistical Association, 69, 440-445. ENGEN, S. (1978), Stochastic Abundance Models, London: Chapman and Hall. FRANK, 0. (1978a), "Inferences Concerning Cluster Structure," in Compstat 1978. Proceedings of the 3rd Symposium on Computational Statistics, eds. L.C.A. Corsten and J. Hermans, Vienna: PhysicaVerlag, 259-265. -(1978b), "Estimation of the Number of Connected Components in a Graph by Using a Sampled Subgraph," Scandinavian Journal of Statistics, 5, 177-188. (1979), "Estimating a Graph From Triad Counts," Journal of Statistical Computation and Simulation, 9, 31-46. (1980), "Transitivity in Stochastic Graphs and Digraphs," Journal of Mathematical Sociology, 7, 199-213. (1981), "A Survey of Statistical Methods for Graph Analysis," in Sociological Methodology 1981, ed. S. Leinhardt, San Francisco: Jossey-Bass, 110-155. FRANK, O., and HARARY, F. (1980), "Maximum Triad Counts in Graphs and Digraphs," Journal of Combinatorics, Information and System Sciences, 5, 1-9. GOODMAN, L.A., and KRUSKAL, W.H. (1979), Measures of Association for Cross Classi$cations, New York: Springer-Verlag. HARARY, F. (1964), "A Graph Theoretic Approach to Similarity Relations," Psychometrika, 29, 143-151. (1969), Graph Theory, Reading, Mass.: Addison-Wesley. HARARY, F., and KOMMEL, H. (1979), "Matrix Measures for Transitivity and Balance," Journal of Mathematical Sociology, 6, 199-210. HARDY, G.H., LITTLEWOOD, J.E., and P ~ L Y A ,G. (1952), Inequalities, Cambridge: Cambridge University Press. HARTIGAN, J.A. (1975), Clustering Algorithms, New York: John Wiley. (1977), "Distribution Problems in Clustering," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, 45-71. HOLLAND, P., and LEINHARDT, S. (1971), "Transitivity in Structural Models of Small Groups," Comparative Group Studies, 2 , 107-124. (1975), "Local Structure in Social Networks," Sociological Methodology 1976, ed. D. Heise, San Francisco: Jossey-Bass. HUBERT, L. (1972), "Some Extensions of Johnson's Hierarchical Clustering Algorithms," Psychometrika, 37, 261-274. (1973), "Monotone Invariant Clustering Procedures," Psychometrika, 38, 47-62. (1974), "Some Applications of Graph Theory to Clustering," Psychometrika, 39, 283-309. HUBERT. L.. and BAKER. F. (1977). "An Emvirical Com~arisonof ~ a s e l i n e~ d d e l for s ~oodness:of-fit in r-diameier ~ierarchicalClustering," in Classi$cation and Clustering, ed. J. Van Ryzin, New York: Academic Press, 131-153. LING, R.F. (1973), "A Probability Theory of Cluster Analysis," Journal of the American Statistical Association, 68, 159-164. LING, R.F., and KILLOUGH, G.G. (1976), "Probability Tables for Cluster Analysis Based on a Theory of Random Graphs," Journal of the American Statistical Association, 71, 293-300. McNEIL, D.R. (1973), "Estimating an Author's Vocabulary," Journal of the American Statistical Association, 68, 92-96.

Suggest Documents