Comparison of Multinomial Classification Rules

Comparison of Multinomial Classification Rules Ana M. Pires and Jo˜ ao A. Branco Department of Mathematics Technical University of Lisbon (IST) Av. Rovisco Pais, 1096 Lisboa Codex, Portugal

Abstract: In this paper two modifications of the full multinomial classification rule for the discrete feature discrimination problem are proposed. One is derived from a likelihood ratio test and the other from the Bayesian predictive approach. A simulation experiment was conducted to compare the performance of the new rules with two other existing rules for the same problem: the usual estimative or plug-in multinomial rule and the Dillon-Goldstein rule which is based on a distance principle. For equal group sample sizes all rules are equivalent, however for different sizes they behave differently. The simulation study has shown that the three alternative rules lead to similar results, in terms of expected actual error rates, which are better than those for the plug-in rule.

Keywords: Classification rules; Discrete data; Discriminant analysis; Multinomial model. 1. Introduction Suppose that we want to discriminate between two distinct populations or groups, Π1 and Π2 , on the basis of a vector of discrete random variables X = (X1 , . . . , Xk ), each assuming a finite number of distinct values, d1, . . . , dk. Assume that, conditional on Πi , X follows the multinomial distribution, denoted by fi (x), with s=

k Y

dj

j=1

possible states and characterized by the (s-dimensional) vector of probabilities p i (i = 1, 2). The optimal rule for the classification of an entity of unknown origin with feature vector x is, assuming equal costs of misclassification and a priori probabilities of the populations,

1

classify in Π1 if f1 (x) > f2 (x), classify in Π2 if f1 (x) < f2 (x), classify randomly if f1 (x) = f2 (x). The usual (estimative) sample rule derived from the optimal rule is obtained replacing the unknown probabilities by their maximum likelihood estimates, that is: classify in Π1 if

n1 (x) n2 (x) n1 > n2 ,

classify in Π2 if

n1 (x) n2 (x) n1 < n2 ,

classify randomly if

n1(x) n2 (x) n1 = n2 ,

where ni is the number of observations of group i in the training sample and ni(x) is the corresponding observed absolute frequency of the state defined by x. We will hereafter call this rule the estimative rule or the M rule. As pointed out by Dillon and Goldstein (1978) one of the undesirable properties of the M rule is the way it treats zero frequencies. If n1(x) = 0 and n2 (x) 6= 0, a new observation with feature vector x will be allocated to Π2 irrespective of the sample sizes n1 and n2. In other words the rule does not take into account the fact that the occurrence of a zero frequency in only one group is more probable when the total number of observations is not large and n1 and n2 are of different order. Those authors proposed a rule based on Matusita’s distributional distance which is more sensitive to this issue. The rule, subsequently called the Dillon-Goldstein rule or the D rule, is (using the notation ni (x) = nij if x belongs to state j) classify in Π1 if

[n2j (n1j + 1)]1/2 + [n1j (n2j + 1)]1/2 +

P

P

k6=j (n1k n2k )

1/2

k6=j (n1k n2k )

1/2

n2 (n1 + 1) < n1 (n2 + 1)

1/2

,

in Π2 if the reverse inequality holds and randomly else. We shall note that if n1 = n2 the D rule reduces to the M rule and also that Krzanowski (1987) showed their asymptotic equivalence. However, if n1 < n2 and n1j = 0 but n2j > 0, the D rule will classify x in Π1 only if √ n2j