Automatic Detection of Collocation - CiteSeerX

0 downloads 0 Views 362KB Size Report
Abstract Collocation is a very important relation between words, which can be widely applied to ..... Use the methods in this paper recursively: if (w, w ) is tested ...
Automatic Detection of Collocation Jiangsheng Yu∗ Zhihui Jin Zhenshan Wen Institute of Computational Linguistics Peking University, China, 100871 Abstract Collocation is a very important relation between words, which can be widely applied to semantic parsing (e.g., word sense disambiguation), machine translation (e.g., automatic alignment of bilingual corpus), computational lexicon, etc. Firstly, we summarized the methods of likelihood interval, likelihood ratio test, u test and χ2 test for collocation theoretically, and then utilized them to extract the collocations from a large scale corpus automatically. By experiment (some results are listed in the appendix), the relationship between the statistical models are explored and analyzed. Some further researches are discussed in the conclusion. The corpus we used is a half year collection of People’s Daily with segmentation and POS tagging, which contains at least 1,103,455 Chinese sentences. Keywords collocation, independence, hypothesis testing, likelihood interval, likelihood ratio, χ2 test, normal distribution

1

Introduction

The study of collocation includes the identification of proper nouns, phrasal verbs, terms, etc, which can be widely applied to 1. Computational Lexicography, e.g., bilingual semantic lexicon ([17]), term bank, etc. 2. Word Sense Disambiguation ([20]), Machine Translation ([18]), Information Retrieval, Information Extraction ([5]), etc. For linguists, a collocation is an expression consisting of two or more words that correspond to some conventional way of saying something. Unfortunately, there is no precise definition of collocation yet. For instance, some regard a collocation as a sequence of two or more consecutive words, which has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ([14]). The ways of confirming a collocation include, at least, non-substitutability: strong tea (not powerful tea), powerful computer, white wine, · · · translation unit: make a decision ↔ prendre une d´ecision (not faire une d´ecision), · · · non-modifiability: kick the bucket (not kick the old bucket), get a frog in one’s throat, · · · ∗ This research is supported by Beijing Natural Science Foundation, No. 4032013 and National Project 973, No. G1998030507-4. The E-mail address and homepage of the first author: [email protected], http://icl.pku.edu.cn/yujs.

It becomes fairly difficult to calculate the syntactic and semantic properties of a collocation. In our opinion, the first step is to find out all the possible collocations from a large scale corpus, and the further work is left to other filters. Our understanding of collocation is based on the concept of co-occurrence defined as follows: Definition 1.1 Given a set of words, say Σ. If w, w0 ∈ Σ∗ and P(w, w0 ) 6= P(w)P(w0 ) in a large scale corpus, then the ordered pair (w, w0 ) is called a co-occurrence. Without loss of generality, we suppose w appears in front of w0 in a sentence. Definition 1.2 A co-occurrence (w, w0 ) means that they often appear in a context, dependent on each other. Moreover, if they are consecutive nouns, then (w, w0 ) is called a collocation. Additionally, the co-occurrence of verb + noun, adj + noun, verb + adv, · · · , whatever consecutive or in distance, is also a collocation. In this paper, we just focused on the collocations in format of noun + noun and verb + noun. Because the sample can only provide the frequency of w, an approximation of probability, whether P(w, w0 ) = P(w)P(w0 ) or not depends on the statistical testings. In the following sections, we’ll utilize the methods of likelihood interval, likelihood ratio test, u test and χ2 test to extract collocations from a large scale corpus automatically, illustrated by examples. The first two up-to-date approaches to collocation extraction, proposed by the authors, have not been tried before. Cochran’s method, a little bit different from the χ2 test for collocations, is

preferred especially when treating with relatively small sample. Besides these techniques, pointwise mutual information is also available, but it does not capture the intuitive notion of an interesting collocation very well ([7], [14]), which is omitted in this paper. After the descriptions of the main methods, we will show some results of the collocation experiment and give the corresponding explanations. Lastly, some further work will be discussed in the conclusion. The necessary knowledge about Mathematical Statistics could be found in [1] and [16].

2

Likelihood Interval Method

P(w, w0 ), the probability of the order pair (w, w0 ) can be estimated by θ2 = C(w, w0 )/N , where N is the number of sentences (separated by all punctuations) in a given corpus and C(w, w0 ) is the count of sentences containing the noun string ww0 . If w is independent of w0 , then we have P(w, w0 ) = P(w)P(w0 ). In this case, P(w, w0 ) can be also estimated by θ1 = C(w)C(w0 )/N 2 . If C(w, w0 )/N 6= C(w)C(w0 )/N 2 in some sense, then we can affirm that (w, w0 ) is a collocation. Definition 2.1 Suppose that the density function of X is p(x, θ), in which θ ∈ Θ = {θ1 , θ2 } is the unknown parameter and θ1 6= θ2 . We call L(θ|x) = P(x, θ) =

N Y

p(xi , θ)

(1)

i=1

the likelihood function of sample x = (x1 , · · · , xN ). The maximum likelihood estimate (MLE) of θ is θˆ = argmax L(θ|x) = argmax ln L(θ|x) θ

(2)

θ

Sometimes for convenience, L(θ|x) and ln L(θ|x) are denoted by L(θ) and l(θ) respectively. Definition 2.2 If θˆ exists and is unique, the relative likelihood function (RLF) of θ is defined by R(θ) =

P(x, θ) L(θ) = ˆ max P(x, θ) L(θ) θ

(3)

Obviously, 0 ≤ R(θ) ≤ 1. The RLF ranks all possible parameters according to their plausibilities in the light of the data. Definition 2.3 Given 0 ≤ δ ≤ 1, the set of θvalues for which R(θ) ≥ δ is called an δ likelihood interval (LI) for θ. Alternatively, the endpoints of the δ-LI can be found as roots of the equation ˆ − ln δ = 0 l(θ) − l(θ)

(4)

l(θ) has a Taylor’s series expansion at θ = θˆ l(θ) =

∞ (i) ˆ X l (θ) i=0

i!

ˆi (θ − θ)

(5)

ˆ is small, the cubic and higher terms can If |θ − θ| ˆ = 0, we have be omitted. Since l0 (θ) ˆ ≈ l(θ) − l(θ)

ˆ l00 (θ) ˆ2 (θ − θ) 2

(6)

Substitute (6) into (4), we have Property 2.1 Let θ = P(w, w0 ), then l(θ) = C(w, w0 ) ln θ + (N − C(w, w0 )) ln(1 − θ). The MLE is θˆ = C(w, w0 )/N . The δ-LI is " # s s 2 ln δ 2 ln δ I = θˆ − , θˆ + (7) ˆ ˆ l00 (θ) l00 (θ) where the left and right boundaries are: s v u 2 ln δ −2 ln δ u ˆ ˆ = θ ± θ± u 0 00 ˆ t C(w, w ) N − C(w, w0 ) l (θ) + ˆ2 θˆ2 (1 − θ) Substitute θˆ by C(w, w0 )/N , we get r −2C(w, w0 )(N − C(w, w0 )) ln δ 0 C(w, w ) ± N N If P(w)P(w0 )/N 2 lies in the interval (7), then w and w0 are independent. Otherwise, (w, w0 ) is a possible collocation. Example 2.1 In N = 679376 sentences, suppose that C(new) = 15828, C(companies) = 4675, and C(new companies) = 8. Let δ = 0.9, by the formula of Property 2.1, the likelihood interval is I = [1.0 × 10−5 , 1.4 × 10−5 ] But C(w)C(w0 )/N 2 = 15828 × 4675/6793762 ≈ 1.6 × 10−4 ∈ / I. Thus, w and w0 are not independent, i.e., (new, companies) is a collocation. The random experiments we admit are those sentences in the given corpus, which are assumed to be independent although it is not the fact strictly. Consequently, P(w) can be estimated by C(w)/N , where C(w) is the count of sentences that contain w. That means, even w occurs in a sentence several times, we still count it once. By the way, N could not be the number of words in the corpus since the independence of sample points comes into conflict with the collocations. For instance, [14] regarded N as the number of tokens and made the wrong conclusion that (new, companies) is not a collocation. The rigorous endpoints of δ-LI can be calculated by Newton’s method (see [9] and [19]).

• Θ=

½ ¾ C(w)C(w0 ) C(w, w0 ) θ1 = , θ = 2 N N2

H0 : P(w, w0 ) = θ1 ↔ H1 : P(w, w0 ) = θ2 • Because C(w, w0 ) ∼ B(N, θ), we have l(θ) = C(w, w0 ) ln θ + (N − C(w, w0 )) ln(1 − θ) and the MLE is θ2 . If N is very large, then the likelihood ratio statistic for H0 is χ2 Figure 1: δ-LI of Example 2.1

3

Likelihood Ratio Test

The likelihood ratio tests for simple hypotheses can be found in [9], and the details are omitted except those basic concepts. It is easy to see that random variable C(w, w0 ) ∼ B(N, θ) (the number of successes in N Bernoulli trials with success probability θ), where θ is the only unknown parameter in the probability model.

=

2[l(θ2 ) − l(θ1 )] N C(w, w0 ) = 2C(w, w0 ) ln + C(w)C(w0 ) 1 − C(w, w0 )/N 2(N − C(w, w0 )) ln 1 − C(w)C(w0 )/N 2 ∼ χ2 (1)

• Given a significance level α, the critical region is W = [χ2α (1), +∞), described by

Definition 3.1 Let W be the critical region, define power function as follows: ρW (θ)

= P(reject H0 |θ) = P((X1 , X2 , · · · , Xn ) ∈ W |θ)

(8)

α = sup ρW (θ) is the significance level, which in-

Figure 2: Collocation domain of LR test

dicates the type 1 error of rejecting the true state.

Example 3.1 For α = 0.01, χ2α (1) = 6.635, by the formula of statistic χ2 and the data in Example 2.1, we get χ2 ≈ 59 À χ2α (1). That is, H0 should be rejected.

θ∈Θ

Definition 3.2 The likelihood ratio (LR) is defined by L(θ2 |x) λ(x) = (9) L(θ1 |x) It’s easy to see λ(x) > λ0 ⇔ l(θ2 ) − l(θ1 ) > ln λ0 . Lemma 3.1 (Neyman-Pearson) Given a real number α ∈ (0, 1), let W0 = {x|λ(x) > λ0 } satisfy λ0 ≥ 0 and Z Z · · · L(θ1 |x)dx = α W0

then for any critical region W ⊆ Rn , ρW (θ1 ) ≤ α implies that ρW0 (θ2 ) ≥ ρW (θ2 ). Corollary 3.1 W0 in N-P Lemma is uniformly most powerful (UMP) critical region. The procedure of likelihood ratio test is described as follows, which should yield the best results guaranteed by the N-P Lemma theoretically. Also, [6] declared that more applicable methods based on the LR test are available which work well with relatively small samples.

Now, there are two ways to obtain confidence regions from the likelihood function: one is to take a likelihood interval with the desired coverage probability, the other is to obtain a significance region from the LR test of H0 : θ = θ1 . If the distribution of χ2 depends on θ1 , significance regions obtained from the LR test need not be likelihood intervals, and the two methods will usually give slightly different results. Anyway, their prominent advantage is the avoidance of asymptotic normality assumptions that are often used unjustifiably.

4

u Test

One of the traditional methods for collocation is t test that assumes the population of C(w, w0 ) is asymptotically normally distributed ([14]). This assumption is unjustifiable without any goodness of fit test (such as Kolmogorov test, Pearson’s χ2 test, Shapiro-Wilk test, etc) and usually leads to flawed results. As we indicated before, the detection of a (w, w0 ) is a Bernoulli trial. Suppose that

w is independent of w0 , the frequency of (w, w0 ) could be approximated by p = C(w)C(w0 )/N 2 . Then by De Moivre-Laplace Theorem, we’ll construct a statistic u with the standardized normal distribution, denoted by u ∼ N (0, 1). The procedure of u test for collocation is as follows: • H0 : P(w, w0 ) = P(w)P(w0 ) or more precisely, P(w, w0 ) = C(w)C(w0 )/N 2 = p ↔ H1 : P(w, w0 ) 6= p (two-tail test is required)

method is often abused to small samples and the unsightly results are imputed to the model unfairly. At the end of this section, we will introduce Cochran’s solution to the traditional χ2 method with Fisher’s exact test. Let nij denote the time of observing (X, Y ) = (xi , yj ), where i, j = 1, 2, we have a special case of Pearson’s χ2 test based on the following data:

• By De Moivre-Laplace Theorem, we have 0

C(w, w ) −p ∼ N (0, 1) u = rN p(1 − p) N

(10)

• Given a significance level, say α. If |u| ≤ uα/2 , then accept H0 , else reject H0 (or accept H1 , i.e., (w, w0 ) is a collocation). Property 4.1 Let α = 0.05, then uα/2 = 1.96. It is concluded that (w, w0 ) is a collocation if ¯ ¯ ¯ C(w, w0 ) ¯ ¯ − p ¯¯ ¯ ¯ > 1.96 |u| = ¯¯ r N (11) ¯ p(1 − p) ¯ ¯ ¯ ¯ N Suppose that N is very large, by (11), we get the approximative criterion of collocation: ¯ ¯ ¯ N C(w, w0 ) − C(w)C(w0 ) ¯ ¯ ¯ p |u| ≈ ¯ ¯ > 1.96 (12) 0 ¯ ¯ N C(w)C(w ) Example 4.1 The data are the same with those in Example 2.1, by (12) we have |u| ≈ 10 À 1.96, which also indicates that (new, companies) is a collocation. The domain of (11) is described by

Y \X

w

¬w

marginal distribution of X

w0 ¬w0

n11 n21

n12 n22

n1· n2·

n·1

n·2

N

marginal distribution of Y

Figure 4: Contingency table Theorem 5.1 (Pearson) By the null hypothesis H0 : P(X|Y ) = P(X), we have χ2 =

N (n11 n22 − n12 n21 )2 ∼ χ2 (1) n1· n2· n·1 n·2

(13)

Example 5.1 The data are from Example 2.1, we get the contingency table: Y \X companies ¬companies sum χ2 =

new 8 15820 15828

¬new 4667 658881 663548

sum 4675 674701 679376

679376 · (8 · 658881 − 4667 · 15820)2 ≈ 96.40 4675 · 674701 · 15828 · 663548

Given α = 0.01, χ2α (1) = 6.635. By the fact that χ2 À χ2α (1), (new, companies) is a collocation. When treating of small samples, the performance of Pearson’s χ2 test is not perfect. For χ2 test, it is necessary that mij = ni· n·j /N , the expectation of nij , is not less than 5. Without the condition, Pearson’s method becomes meaningless. Cochran suggested in [4], 1. If N > 40, formula (13) is modified by χ2 =

N (|n11 n22 − n12 n21 | − N/2)2 ∼ χ2 (1) n1· n2· n·1 n·2

2. If 20 ≤ N ≤ 40, there are two cases: Figure 3: Collocation domain of u test

5

2

χ Test

Theoretically and practically, Pearson’s χ2 test based on the contingency table is efficient to check whether two random variables X and Y are independent by large samples ([11], [1]). But, this

(a) if each mij ≥ 5, then modified χ2 value is still practicable. (b) if the minimum of mij is less than 5, then χ2 test is modified by Fisher’s exact test. 3. If N < 20, modified by Fisher’s exact test. Generally, Cochran’s method works well whatever the sample size. For instance, the modified χ2 value in Example 5.1 is 95.45 À χ20.01 (1).

6

Experiment and Analysis

The corpus we treated with is a half year collection of People’s Daily, with more than 1, 103, 455 Chinese sentences. An ordered pair (w, w0 ) is a collocation if it is guaranteed by all the three hypothesis testings and the LI method. For the three statistics of u, LR and χ2 , we calculated their Spearman correlation coefficients based on rank pairwise.1 Some results of noun + noun collocations and verb + noun collocations, sorted by χ2 value, are listed in the Appendix. Also, we proved that u test is equivalent to χ2 test provided N is large enough. By the experiments of Spearman’s rank correlation, LR statistic is showed to be different from χ2 (or u). Moreover, the fact that there is no obvious mathematical relationship between LR statistic and χ2 statistic confirms the rationality of integrating LR test and χ2 test.

6.1

noun + noun Collocation

As [6] claimed, the χ2 test is usually biased towards low-frequency items, whereas the LR test is better at identifying dependencies among frequent words. If (w, w0 ) is not a collocation, each hypothesis testing identifies it as a collocation with probability α. What’s more, the intersection result by the four methods, descendingly sorted by some value (for instance, χ2 value), seems not bad if we just focus on the first half possible collocations. While, an empirical approach to the order statistic is needed to provide us a more reasonable standard. Without doubt, the sample size should be an important parameter. Spearman rs , the correlation coefficient based on rank, is calculated: u value LR value χ2 value

u value 1.00000 0.8392099 < .0001 0.9999975 < .0001

LR value 0.8392099 < .0001 1.00000 0.8398665 < .0001

Figure 6: Scatter matrix of u, LR and χ2 ranks In the result descendingly sorted by χ2 value, the more previous the better are the collocations. Dunning claimed in [6] that in the cases where the traditional contingency table method works well, the LR test does nearly identical. We surveyed the Spearman’s rank correlation between χ2 rank and LR rank in the previous n “best” collocations sorted by χ2 rank. The phenomenon described by Dunning has not been observed in the case of noun + noun collocation, since the correlation varies acutely. But roughly speaking, the more results, the more correlated.

χ2 value 0.9999975 < .0001 0.8398665 < .0001 1.00000

Figure 5: Spearman rs Correlation Coefficients N = 15734, Prob > |r| under H0 : ρ = 0 Derived from (10) and (13), if N is large enough, we’ve proved that u2 /χ2 ≈ 1 in the Appendix. Therefore, u test is strongly relative to χ2 test, which can also be tested by the following scatter matrix of u, LR and χ2 ranks. 1 Kendall’s τ -statistic is also available for the correlation coefficients, by which the similar results are gotten. Note that the physical values of statistics u, LR and χ2 are meaningless since the corpus is not randomly generated. In contrast, the order statistic keeps the reasonable relationship between observations robustly. So, the nonparametric approaches to NLP are worthy to be highlighted.

Figure 7: Spearman Correlation Coefficients between χ2 and LR ranks in the previous χ2 results

6.2

verb + noun Collocation

To detect the verb + noun collocations, we examined 56 randomly selected verbs and their followed nouns in sentences (including those in distance). Checked by hand, the precision and recall of these verb + noun collocations are Rate

Precision 92.76%

Recall 90.15%

Figure 8: Precision and recall

u value LR value χ2 value

u value 1.00000 0.6904948 < .0001 0.9999365 < .0001

LR value 0.6904948 < .0001 1.00000 0.6964686 < .0001

χ2 value 0.9999365 < .0001 0.6964686 < .0001 1.00000

Figure 9: Spearman rs Correlation Coefficients N = 2643, Prob > |r| under H0 : ρ = 0 The scatter matrix of u, LR and χ2 ranks is

error is less than that of B. It has something to do with the way of sampling and the properties of population. We are comparing the three models by Experiment Design and Analysis, a theory proposed by R. A. Fisher in 1920’s (see [15]). One difference must be clarified that the method of hypothesis testing is based on a tremendous sample, however sometimes the corpus is small for the efficient identification of collocations. Usually, the syntactic and semantic rule-based method, such as by the tree bank, takes the place of those statistical approaches in this case. We have tried to eliminate the “bad” candidates by Chinese Concept Dictionary, a WordNet-like semantic lexicon, which will appear in the sequent paper. In conclusion, there are two methods for the automatic extraction of collocation containing more than two words: 1. Consider the independency of all words in sentences simultaneously, and find out the longest dependent subsequence, such as make a decision. The computation is burdensome unless some approximation algorithm with less complexity can be designed.

Figure 10: Scatter matrix of u, LR and χ2 ranks By the isolated points of Spearman correlation coefficients between χ2 rank and LR rank in the previous χ2 results, there is no Dunning phenomenon ([6]) either.

2. Use the methods in this paper recursively: if (w, w0 ) is tested to be a collocation, then in the corpus w and w0 are combined to a “word”. Repeat the procedure till no collocation can be extracted. Unfortunately, the method leads to a lower confidence level if the collocated phrase contains many words. What’s more, sometimes it fails in the extraction of collocation, because any two consecutive words in a collocation do not always form a collocation. The methods of testing independence summarized in this paper can be implemented efficiently and not limited to the study of collocation, the same for further researches mentioned in this section. For instance, the approaches to text clustering, data mining of coincidences, detection of domainspecific terms, etc.

Acknowledgement

Figure 11: Spearman Correlation Coefficients between χ2 and LR ranks in the previous χ2 results

Conclusion Suppose that we have two statistical tests A and B with the same probability of type 1 error, A is better as long as the probability of its type 2

Many thanks to Dr. Baobao Chang for his kindly discussion with the authors, as he has done a lot of work on the automatic alignment of bilingual corpus. Also, we appreciate the help of Prof. Shiwen Yu and Prof. Huiming Duan who provided us the training corpus. Dr. Weidong Zhan and Dr. Bin Sun have supplied us free books of mathematics, for which we are in debt to them. And finally, our thanks to the participants in the seminar of Machine Learning who gave us lots of good suggestions on the statistical models.

References [1] P. J. Bickel and K. A. Doksum (2001), Mathematical Statistics — Basic Ideas and Selected Topics (Second Edition). Prentice-Hall, Inc. [2] I. A. Bolshakov and A. Gelbukh (2002), Heuristics-based replenishment of collocation databases. In: E. Ranchhold, N. J. Mamede (Eds.) Advances in Natural Language. Proc. PorTAL-2002: Portugal for Natural Language Processing. Lecture Notes in Computer Science, No. 2389, Springer-Verlag, 2002, pp. 25–32. [3] K. W. Church and P. Hanks (1989), Word Association Norms, Mutual Information and Lexicography. In ACL 27, pp. 76-83. [4] W. G. Cochran (1954), Some Methods for Strengthening The Common χ2 tests. Biometrics, 10, pp417-451. [5] L. Dekan (1999), Using Collocation Statistics in Information Extraction. In Proceedings of the Seventh Message Understanding Conference (MUC-7). [6] T. E. Dunning (1993), Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19 (1), pp. 61-74. [7] T. Fontenelle et al (1994), DECIDE, MLAPProject 93-19, Deliverable D-1a: Survey of Collocation Extraction Tools. Technical Report, University of Liege, Belgium. [8] T. P. Hettmansperger (1998), Robust Nonparametric Statistical Methods. John Wiley & Son, Inc. [9] J. G. Kalbfleisch (1985), Probability and Statistical Inference, Volume 2: Statistical Inference (Second Edition). Springer-Verlag New York Inc. [10] S. Kaufmann (1999), Cohesion and Collocation: Using Context Vectors in Text Segmentation. In Proceedings of the 37th Annual Meeting of the Association of Computational Linguistics (Student Session), pages 591-595. [11] E. L. Lehmann (1957), Testing Statistical Hypotheses. John Wiley & Sons, New York. [12] E. L. Lehmann (1975), Nonparametrics: Statistical Methods based on Ranks. Hoden-Day, San Francisco. [13] E. L. Lehmann (1999), Elements of LargeSample Theory. Springer-Verlag New York, Inc.

[14] C. D. Manning and H. Sch¨ utze (1999), Foundations of Statistical Natural Language Processing. The MIT Press. [15] D. Montgomery (1991), Design and Analysis of Experiments. John Wiley & sons, Inc. [16] V. K. Rohatgi (1976), An Introduction to Probability Theory and Mathematical Statistics. John Wiley & Sons, Inc. [17] F. A. Smadja and K. R. McKeown (1990), Automatically Extracting and Representing Collocations for Language Generation. In Proc. of the 28th Annual Meeting of the ACL, pp. 252-259. [18] F. A. Smadja, K. R. McKeown and V. Hatzivassiloglou (1996), Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22 (1), pp. 1-38. [19] M. A. Tanner (1996), Tools for Statistical Inference — Methods for the Exploration of Posterior Distributions and Likelihood Functions (Third Edition). Springer-Verlag New York, Inc. [20] D. Yarowsky (2000), Word Sense Disambiguation. In: R. Dale, H. Moisl and H. Somers (Eds.) The Handbook of Natural Language Processing, New York: Marcel Dekker, pp. 629-654.

Appendix

Figure 12: noun + noun collocations sorted by χ2 value

Figure 13: verb + noun collocations sorted by χ2 value

Proof We will show that the u test is equivalent to χ2 test as N , the size of corpus, is large enough. The Pearson’s contingency table for testing the independency of w and w0 is Y \X w0

w C(w, w0 )

¬w0

C(w) − C(w, w0 )

sum

C(w)

¬w C(w ) − C(w, w0 ) N + C(w, w0 ) −C(w) − C(w0 ) N − C(w)

sum C(w0 )

0

N − C(w0 ) N

Figure 14: Specified contingency table By the definitions of statistics u and χ2 , we have

=

[C(w, w0 ) − C(w)C(w0 )/N ]2 C(w)C(w0 )/N 0 0 N {C(w, w )[N + C(w, w ) − C(w) − C(w0 )] − [C(w) − C(w, w0 )][C(w0 ) − C(w, w0 )]} C(w)C(w0 )[N − C(w)][N − C(w0 )]

=

1−

2

u χ2

≈ 1

C(w) + C(w0 ) C(w)C(w0 ) + N N2

if N is large enough

Suggest Documents