sis, Discriminant Analysis, Logistic Regression, Statistical Learning. 1 Eduardo ....
Bayesian classification methods, both supervised and unsupervised,
emphasising general ideas rather ...... for Nonlinear Classification and
Regression.
Psychology Science, Volume 46, 2004 (1), p. 52-64
Bayesian classification methods 1
EDUARDO GUTIÉRREZ-PEÑA
Abstract Consider the problem of assigning a class label to a set of unclassified cases. If the set of possible classes is known in advance, this is a problem of supervised classification; if, on the other hand, the set of possible classes is not known, then it becomes a problem of unsupervised classification. In this paper we review some Bayesian approaches to classification, both supervised (e.g. discrimination) and unsupervised (e.g. clustering). For the former case, we also derive a fully Bayesian theoretical rule that can be used as the basis for specific model-based classification rules. Key words: Bayesian Decision Theory, Cluster Analysis, Configural Frequency Analysis, Discriminant Analysis, Logistic Regression, Statistical Learning
1
Eduardo Gutiérrez-Peña, IIMAS-UNAM, Apartado Postal 20-726, 01000 México D.F., Mexico, E-mail:
[email protected]
Bayesian classification methods
53
1. Introduction The term ‘classification’ has traditionally been used in the statistical literature with at least two meanings: one, general, describes classification as a learning process; the other, specific, sees classification as relating to a particular class of methods such as discriminant analysis (e.g. Hand, 1981) or clustering (e.g. Hartigan, 1982). Recent literature takes a broader view and refers to classification – in the general sense – as Statistical Learning (see, for example, Hastie, Tibshirani and Friedman, 2001). Statistical learning has attracted the attention of many researchers not only in statistics, but also in computing science, engineering and other related fields. Each of these areas has its own terminology and, to some extent, its own methods. Hastie et al. (2001) provide a comprehensive account as well as a unified overview of the most popular techniques currently in use. Classification is concerned with the grouping of similar objects; for us to be able to classify we first need to make similarity judgments. There exist basically two types of use of a classification: predictive, e.g. discriminant analysis, and descriptive, e.g. clustering. In both cases, the objective is to learn from data. To fix ideas, we start by briefly describing these two common instances of classification tasks. The basic problem of discriminant analysis is to assign an observation, on the basis of its value, to one of two or more groups. Discriminant analysis in concerned with the process of deriving classification rules from samples of previously classified objects (the training data). This implies that some kind of classification scheme is already in place, but it may be subjective, or unusable for some reason, so that the essential aspects of the scheme must be extracted and transformed into a practical classification rule. On the other hand, the main task of cluster analysis is to group objects into classes according to their ‘similarity’. In this case, no prior division of the objects into categories is available. In both of these cases, the aim in mathematical terms is to find a function mapping objects to an index set consisting of class identifiers (class labels). Each object is represented by a vector of measurements and each measurement provides one variable in a multidimensional space. Of course exactly what is measured will depend on the particular problem at hand. Learning problems such as those described above can be roughly categorised as either supervised or unsupervised. In areas such as computing science or engineering, the first of these two learning tasks is known as supervised classification, supervised pattern recognition or class prediction, while the second is usually referred to as unsupervised classification, unsupervised pattern recognition or class discovery. Let Y denote a response variable for a set of predictor variables X = ( X 1 ,..., X p ) . The predictions are based on a training sample ( y1 , x1 ),...,( yn , xn ) of previously ‘resolved’ cases, where the joint values of all the variables are known. In supervised learning, the ‘teacher’ is presented with an answer yˆi for each xi in the training sample. She then provides either the correct answer or an error associated with the answer from the ‘student’. Such error is usually characterised by some loss function L( yˆ , y ) . In contrast, in unsupervised learning we must learn ‘without a teacher’. Assuming that (Y , X ) are random variables represented by some joint probability density p( y, x) = p ( y | x) p( x) , then supervised learning can be formally characterised as a density estimation problem where we are concerned with determining properties of the conditional density p( y | x) based on the training sample ( y1 , x1 ),...( yn , xn ) . Usually the
E. Gutiérrez-Peña
54
properties of interest are the ‘location’ parameter µ that minimises the expected loss at each value x,
µ( x) = arg min yˆ EY | X { L ( yˆ , Y )} . On the other hand, in unsupervised learning we only have a set of n observations
x1 ,..., xn of the p -dimensional vector X having joint density p ( x) . Here, the goal is to directly infer the properties of this density without the help of a ‘teacher’. In lowdimensional problems, there exist several effective nonparametric methods for directly estimating the density p ( x) . In high dimensions, we must often settle for estimating a rather crude mixture model, where each component represents a distinct class of observations; see Section 3. Due to their nature, Bayesian methods are necessarily model-based. Fortunately, recent numerical and computational advances such as the development of Markov Chain Monte Carlo techniques have made it feasible to fit and analyse increasingly complex models (see, for example, Gilks, Richardson and Spiegelhalter, 1996). The aim of this paper is to provide an overview of Bayesian classification methods, both supervised and unsupervised, emphasising general ideas rather than specific problems or techniques. However, in order to motivate and illustrate the main ideas, sometimes we shall find it convenient to focus on discriminant analysis (Section 2) and cluster analysis (Section 3). Also, no attempt will be made to discuss variable and model selection, or performance assessment criteria such as misclassification rates. These and related issues are thoroughly discussed, for example, in Hand (1997). The paper is organised as follows. In the next section we derive a general, fully Bayesian supervised classification rule. Several popular classification rules follow from this general rule depending on specific modelling considerations. In Section 3, we discuss Bayesian methods for unsupervised classification and, in particular, Bayesian cluster analysis. Section 4 explores the use of cluster analysis to identify types and antitypes in the context of standard configural frequency analysis. Finally, Section 5 contains some concluding remarks.
2. Supervised classification In supervised learning, the goal is to predict the value of an outcome measure based on a (possibly large) number of input measures. These measures can be either quantitative or qualitative. In the statistical literature, measures are usually referred to as variables. Qualitative (categorical) variables are also called factors. This distinction in output type has led to naming conventions for the predictions tasks: regression when we predict quantitative outputs, and classification when we predict qualitative outputs. Output measures are usually called responses (or independent variables) and input measures are called predictors (or independent variables). In other fields, the terms attributes or features for variables, and targets for a categorical response are also common.
Bayesian classification methods
55
2.1 Classification as a decision problem The problem of supervised classification is best stated as a statistical decision problem. To this end, we first describe and solve a prototypical decision problem with the following elements: 1. 2. 3. 4.
Space of decisions: D , contains all potential decisions under consideration. Space of states of nature: Ω , describes uncertain events that may affect our decisions. Prior distribution: p ( w) , defined on Ω , describes our knowledge (or lack thereof) concerning the unknown states of nature. Loss function: L(d , w) , describes the cost incurred by taking decision d ∈ D , when the true state of nature is w∈ Ω .
Given an observation x from p(x|w), the prior distribution is updated and the combined information described by the posterior distribution
p (w | x) =
p( w) p( x | w) ∑ w∈Ω p ( w) p ( x | w)
which is obtained via Bayes theorem. According to Bayesian decision theory (e.g. Bernardo and Smith, 1994), the best decision is the one that minimises the posterior expected loss, i.e.
d *( x) = arg min D Lx (d ), where
Lx (d ) =
∑ L(d , w) p(w | x)
w∈Ω
If w is a continuous parameter, the above expressions still hold provided that sums are replaced by integrals.
2.1.1 Regression In this case the response is a quantitative (continuous) variable, and we have D ≡ Ω ≡ , d ≡ yˆ , w ≡ y , and L( yˆ , y ) = ( yˆ − y ) 2 . Hence
Lx ( yˆ ) = ∫ ( yˆ − y )2 p( y | x)dy, which is maximised when
yˆ = yˆ *( x) ≡ E (Y | x).
,
56
E. Gutiérrez-Peña
We shall now leave regression problems aside, and devote the remainder of this section to the construction of Bayesian classification rules. Bayesian regression methods are treated in detail in Denison, Holmes, Mallick and Smith (2002).
2.1.2 Classification Here the response is a qualitative (categorical) variable, and so D ≡ {1, 2,..., C} , Ω ≡ {1, 2,...,C} , d ≡ yˆ and w ≡ y . In other words, we assume that the input variable x comes from one of C distinct classes, labelled 1, 2,..., C . It is common in classification problems to use loss functions of the form L( yˆ , y ) = l yy ˆ , where l yy ˆ ≥ 0 denotes the loss incurred by choosing the class yˆ when the true class is y . Suppose, for example, that l yy ˆ = 0 if ˆ ≠ y . Then Lx ( yˆ ) = 1 − p( yˆ | x) , which is minimised when yˆ = y , and l yy ˆ = 1 if y
yˆ = yˆ ∗ ( x) ≡ arg max yˆ p( yˆ | x),
(1)
where
p( y | x) =
p( y ) p( x | y )
∑
C j =1 p (
j ) p( x | j )
.
(2)
In other words, for the 0-1 loss the solution to the supervised classification problem is given by the mode of the posterior distribution for the true class.
2.2 Discriminant analysis So far we have assumed that the class conditional densities, p( x | y ) , are completely specified. Under such (ideal) conditions, the optimal classification rule (1) attains the minimal misclassification rate and is known as the Bayesian classifier. In applications, however, we never really know the functional form of p( x | y ) . Nevertheless, in supervised classification problems we do have a training sample ( y1 , x1 ),...,( yn , xn ) , from which p( x | y ) can in principle be estimated. To this end, a common simplifying assumption is that the predictor variables are indep pendent within each class, so that p( x | y ) = ∏ l =1 p ( x l | y ) . Of course this assumption is seldom warranted in practice. Nevertheless, the class conditional densities are often approximated by the product of their marginal densities on each predictor variable. The resulting procedures are known as naïve Bayesian classifiers.
Bayesian classification methods
57
2.2.1 Parametric models The simplest approach to the estimation of p( x | y ), y = 1, 2,..., C , is to assume that
p( x | y ) = f y ( x | θ y ), where the density f y ( x | θ y ) is known except for a finite-dimensional parameter θ y . From the classical point of view, each θ y is then estimated based on the observations corresponding to the y-th class, namely x( y ) = { xi : yi = y; i = 1,..., n} . Denoting such estimates by θˆy , the resulting classification rule is obtained from (1) simply by replacing p(x|y) by its plug-in estimate pˆ ( x | y ) = f y ( x | θˆy ) in (2). The ‘prior weights’ p ( y ), y = 1, 2,..., C , are usually ‘estimated’ by the sample proportions pˆ ( y ) = n y / n , where n y is the number of observations in the set x( y ) . This pseudo-Bayesian procedure does not account for the uncertainty in the estimation of θ y and can therefore lead to over-optimistic error rate estimates. As an illustration, suppose that C = 2,θ y = ( µ y , ∑ y ) and f y x | θ y = N ( x | µ y , ∑ y ) , where N (⋅ | µ, ∑ ) denotes a p-variate normal density with mean µ and variance-covariance matrix ∑ . Then the classical linear and quadratic discriminant functions can be derived in this way if we assume that ∑1 = ∑ 2 and ∑1 ≠ ∑ 2 , respectively. From a Bayesian perspective, the additional uncertainty concerning the unknown value of each parameter θ y must be described in terms of a prior distribution π y (θ y ) defined on the parameter space Θ y . Let θ = {θ1 ,...,θ C } and X = x (1) ,..., x (C ) . We can then reformulate the original decision problem, discussed in Section 2.1, as follows:
(
{
1. 2. 3.
4.
)
}
Space of decisions: D = {1, 2,..., C} . Space of states of nature: Ω = ∪Cy=1 { y} × Θ y Prior distribution: p ( y,θ ) = p( y ) p (θ | y ) , with p (θ | y ) = π y (θ y ) . When no prior information about the true class is available, it is common to take p ( y ) = 1/ C for all y = 1, 2,..., C . Loss function: L( yˆ , y ) = l yy ˆ , as before.
(
)
Let x be a new observation (to be classified). Then the posterior distribution is given by
p( y,θ | x, X) = p ( y | x, X ) p (θ | y, x, X ) . Note that, given the form of the loss function, all we need to find in order to compute the posterior expected loss is p ( y | x, X) , which can be found to be
p ( y | x, X) =
where
p ( y | X ) p ( x | y , X) , ∑ Cj=1 p ( j | X) p( x | j , X)
(3)
E. Gutiérrez-Peña
58
p ( y | X) =
p( y ) p( X | y ) , j ) p(X | j )
∑ Cj=1 p(
(4)
and
(
)
p( x | y, X) = ∫ f y x | θ y π y (θ y | X)dθ y is the Bayesian posterior predictive distribution for samples from the y-th class. Here,
p( X | y ) = p(x( y ) | y ) = ∫ f y (x( y ) | θ y )π y (θ y )dθ y and
π y (θ y | X) = π y (θ y | x( y ) ) =
π y (θ y ) f y (x( y ) | θ y ) ∫ π y (θ y ) f y (x( y ) | θ y )dθ y
(5)
is the posterior distribution of θ y given the training data. In the case of the 0-1 loss introduced in Section 2.1.2, the resulting classification rule takes the form
yˆ = yˆ * ( x) ≡ arg max yˆ p ( yˆ | x, X)
(6)
The above derivation can be interpreted as follows. First, the training sample is used in order to update the probability of each class via (4), and to learn about the unknown values of the parameters θ1 ,..., θC by means of (5). Then, the new observation x is classified according to (6). Note that the posterior probabilities (3) are similar to those in (2), which are used for classification in the ideal case where the class conditional densities are fully specified. Here, however, instead of the classical plug-in estimate f y ( x | θˆy ) , the Bayesian approach ‘estimates’ the unknown class conditional density through the posterior predictive density p ( x | y, X) . Similarly, the ‘prior weights’ are updated using (4) rather than the sample proportions pˆ ( y ) = n y / n . This provides a fully Bayesian classification rule.
2.2.2 Nonparametric models Parametric models are often too restrictive to capture some of the most important features of data sets encountered in applications. In an attempt to get round this problem, nonparametric methods consider p ( x | y ) , the class conditional density itself, as the unknown ‘parameter’. Classical procedures for density estimation, including kernel methods, are discussed, for example, in Silverman (1986). Once a (nonparametric) estimate pˆ ( x | y ) has been obtained, we can proceed in essentially the same way as in Section 2.2.1, now using pˆ ( x | y ) instead of the (parametric) plug-in estimate f y ( x | θˆy ) .
Bayesian classification methods
59
Another popular nonparametric approach, known as the k-nearest-neighbours method, directly estimates p ( y | x) p(y|x) in the following way. 1. 2.
3.
Let δ i = x − xi be the distance (according to some predefined metric) between the i-th predictor vector and the location where we wish to predict the class label. Let δ (i ) represent the i-th ordered distance so that δ (i ) is the minimum observed distance and δ ( n ) is the maximum. Also, let y(1) ,..., y( n ) be the corresponding class labels, i.e. ordered in the same way as the δ (i ) . Choose a value for k (k = 1, 2,..., n) , and predict y using the most frequently occurring class in y(1) ,..., y( k ) .
In other words, the k-nearest-neighbours method finds the k data points which are closest to x and predicts y by choosing the mode of the class labels relating to those k nearest points. This effectively estimates p ( y | x) by means of a sample mode; cf. the example in Section 2.1.2. From a Bayesian perspective, things are not all that different. Bayesian density estimation procedures are well developed (see, for example, Escobar and West, 1995; and Fraley and Raftery, 2002), and can be used to produce nonparametric versions of (3), from which a Bayesian nonparametric classification rule – analogous to (6) – can be derived. In addition, there exists a probabilistic version of the k-nearest-neighbours method that can be used to estimate the posterior probabilities p ( y | x) directly. Denison et al. (2002, Ch. 8) describe the Bayesian implementation of this method via Markov Chain Monte Carlo techniques.
2.3 Logistic regression The logistic model arose originally in a parametric context from the desire to model the posterior probabilities p ( y | x) directly via linear functions in x. To fix ideas, suppose for the moment that there are C = 2 classes only. Then p ( y | x) is typically assumed to be such that
p (1| x) Pr(Y = 1| x) T log ≡ log =α + β x , 1 − p (1| x) Pr(Y = 2 | x)
(7)
where α ∈ IR and β ∈ IR are unknown parameters. Thus, the log-odds are supposed to be a linear function of x. Dellaportas and Smith (1993) discuss the Bayesian analysis of generalised linear models, including logistic models of the form (7). Nonparametric logistic models of the form p
p (1| x) log = g ( x) , 1 − p (1| x) where g(·) is an unknown, smooth function, are discussed for example in Gutiérrez-Peña and Smith (1998) and Denison et al. (2002). Once an estimate pˆ ( x | y ) based on (7) has been obtained, a classification rule can be derived from (1) in basically the same way as in Section 2.1.2.
E. Gutiérrez-Peña
60
Extensions to more than two classes are straightforward and are based on a set of C – 1 log-odds, such as
p( y | x) Pr(Y = y | x) T log ≡ log = α y + β y x; p C x Y C x ( | ) Pr( | ) =
y = 1, 2,..., C − 1 .
3. Unsupervised classification As pointed out in Section 2, in supervised learning the goal is to predict the value of an outcome measure based on a number of input measures; in contrast, in unsupervised learning, there is no outcome measure, and the goal is to describe the patterns and associations among a set of ‘input’ measures. Amongst the most useful methods of unsupervised learning are those of cluster analysis, also called data segmentation. Cluster analysis seeks to partition a set of n observations x1 ,..., xn , into a set of K mutually exclusive groups such that objects within each group are more closely related to one another than objects assigned to different groups. Here K may or may not be specified. When the data consist of continuous observations, the most common model-based approaches assume an underlying mixture (usually of multivariate normal distributions), with each mode of the estimated mixture corresponding to a cluster in the data. When the variables are assumed uncorrelated within each distribution in the mixture, the resulting method is known as latent profile analysis (cf. Converse and Oswald, 2004). In the case of multivariate categorical data, this approach is known as latent class analysis. Here, similarly to the continuous case, the so-called manifest variables are assumed independent within each latent class. This assumption is often unrealistic, but extensions of this basic model exist (Zhang, 2003). In latent class analysis, the latent variable that determines the structure of the data is categorical. A related class of methods, known as latent trait analysis, obtains when such latent variable is continuous (cf. Wolfe, 2004). Latent class analysis is often useful in the evaluation of diagnostics test in the absence of a benchmark or ‘gold standard’. Suppose for example that we have several tests for detecting the presence or absence of a given disease, but not a ‘gold standard’ test indicating disease status without uncertainty. Then latent class analysis can be used to provide estimates of the diagnostic accuracy of the different available tests (cf. Baillargeon, 2004). Traditional latent class analysis methods are mostly based on the fit of a suitable loglinear model; since the Bayesian analysis of such models is now routine (see, for example, Congdon, 2001), similar Bayesian latent class methods are possible.
3.1 Bayesian cluster analysis Binder (1978) constructs the following general formulation in which a prior density is specified jointly for the number of clusters, the grouping of the data, and the parameters of the component densities of an assumed mixture model,
Bayesian classification methods
61
K
p( x | θ ) = ∑ w j p j ( x | θ j ) j =1
Let zi = ( zi1 ,..., ziK ) denote an indicator vector such that zij = 1 if the observation x belongs to cluster j, and zij = 0 otherwise; i = 1,..., n ; j = 1,..., K . The set z = ( z1 ,..., zn ) of indicator vectors then allows us to represent any particular grouping of the data. Thus, the prior can be written as i
π (k , z,θ ) = π (k )π (z | k )π (θ | k , z ) This combines with the likelihood
{ i =1 j =1 n K
}
∏ ∏ p j ( xi | θ j )
zij
to give the posterior density π (k , z,θ | x) (note that this will be zero unless z corresponds to an actual grouping into k clusters). The marginal posterior probabilities of particular groupings,
π (z | x) = ∑ ∫ π (k , z,θ | x)dθ k
then provide the basis for choosing the ‘best’ clustering of the data. Binder (1978) describes the solution to this decision problem for various loss functions based on different measures of ‘closeness’ between the chosen grouping and the true grouping. In some special cases, this procedure reproduces familiar clustering criteria. In most cases, however, the above procedure will be analytically intractable. Binder (1981) discusses approximate Bayesian clustering rules. Recent computational advances have made the Bayesian analysis of mixtures possible (see, for example, Escobar and West, 1995; and Richardson and Green, 1997), and have motivated the development of sophisticated Bayesian unsupervised classification procedures, including modelbased hierarchical clustering (Fraley and Raftery, 2002).
3.2 Unsupervised as supervised learning Hastie et al. (2001; Chap. 14) discuss an interesting technique for transforming the problem of estimating p ( x) into one of supervised function approximation. This approach is promising since supervised learning methods are well developed in the literature compared with traditional unsupervised learning techniques. We now briefly describe this approach. Let p ( x) be the unknown density to be estimated based on the sample x1 ,..., xn , and let p0 ( x) be a specified density function used for reference. For example, p0 ( x) might be the uniform distribution over the range of the variables. A simulated sample of size n can be drawn from p0 ( x) . Pooling these two data sets, and assigning weight 1/2 to the samples from p ( x) and weight 1/2 to the simulated samples drawn from p0 ( x) , we obtain a random sample from the mixture density pm ( x) = { p ( x) + p0 ( x)} / 2 . If we assign the value Y = 1 to each sample point drawn from p ( x) and Y = 0 to those drawn from p0 ( x) , then
E. Gutiérrez-Peña
62
µ( x ) = E (Y | x) =
p( x) p ( x) + p0 ( x)
can be estimated by supervised learning using the pooled sample ( y1 , x1 ),...,( yn+ n 0 , xn+ n 0 ) as training data. The resulting estimate µˆ ( x) can then be inverted to provide an estimate for p ( x) :
pˆ ( x) = p0 ( x)
µˆ ( x) 1 − µˆ ( x)
Generalised versions of logistic regression (see Section 2.3) are especially well suited for this application since the log-odds λ ( x) = log { p ( x) / p0 ( x)} are estimated directly, and then
{ }
pˆ ( x) = p0 ( x) exp λˆ ( x) .
4. An application to configural frequency analysis Configural frequency analysis is a descriptive method for cell-wise inspection of crossclassifications (von Eye, 1990 and 2004). It screens a cross-classification in search of types and antitypes, that is, cells that contain, respectively, more and fewer cases than expected from a given base model. This is achieved by inspecting the discrepancies between mi , the ˆ i , the estimated expected frequency for the same cell observed frequency in cell i , and m under the base model. Typical discrepancy measures, such as
zi =
n (mi − mˆ i ) mˆ i (n − mˆ i )
or
X i2 =
(mi − mˆ i ) 2 , mˆ i
where n is the total number of individuals, are based on standard statistical tests. Gutiérrez-Peña and von Eye (2000) introduced a Bayesian approach to configural frequency analysis. In that paper, not only is each cell assigned a probability of being a type, antitype, or neither, but also a joint probability distribution on the overall patterns of types and antitypes is constructed. Their method makes direct use of the observed frequencies in the cross-classification. However, sometimes the original full data set is not available, and the search for types and antitypes must be carried out on the basis of the discrepancy measures alone. In such cases, unsupervised classification methods may prove useful. Specifically, Bayesian cluster analysis can be applied to a set of discrepancy measures, such as { zi } or X i2 , in order to identify groups of cells that could be regarded as types or antitypes (or neither). In the following example, previously analysed by von Eye and Brandtstädter (1997) and Gutiérrez-Peña and von Eye (2000), we illustrate the use of the Bayesian clustering method discussed in Section 3.1.
{ }
Bayesian classification methods
63
Example. The data for this example stem from a study on sleep behavior and comprise responses from n = 273 individuals (Görtelmeyer, 1988). Tables 4 and 6 of Gutiérrez-Peña and von Eye (2000) present the results of the classical and Bayesian configural frequency analyses (respectively) of these data for the resulting 7 × 2 cross-classification. The classical analysis is based on the z-test, and the corresponding discrepancies are 3.31, -3.31, 3.18, -3.18, 2.83, -2.83, -0.01, 0.01, -1.38, 1.38, -0.49, 0.49, -4.41, 4.41. In this case, we set K = 3 and assume that the mixture component densities are univariate normal with the same variance. Vague priors are assigned to the parameters of the mixture components as well as on the set of data groupings. The resulting clusters are given by
{(1, 1), (2, 1), (3, 1), (7, 2)} , {(1, 2), (2, 2), (3, 2), (7, 1)} and {(4, 1), (4, 2), (5, 1), (5, 2), (6, 1), (6, 2)} , which correspond, respectively, to types, antitypes, and cells that are neither types nor antitypes. These results are in perfect agreement with those in Tables 4 and 6 of Gutiérrez-Peña and von Eye (2000).
5. Concluding remarks In this paper we have attempted to provide a general overview of Bayesian classification methods, both supervised and unsupervised. We have argued that such methods: (i) can be given a solid justification on the basis of Bayesian decision theory, and (ii) are necessarily modelbased. We have also discussed some issues relating to statistical modelling in this context, both from a parametric and a nonparametric perspective. In Section 4 we explored the use of cluster analysis to identify types and antitypes in the context of standard configural frequency analysis. The exploratory Bayesian approach proposed there should prove most useful in situations where the number of cells in a crossclassification is very large, so that the Bayesian method may become infeasible. Acknowledgements: this work was partially supported by the Sistema Nacional de Investigadores, Mexico.
References 1. Baillargeon, R. (2004). Modelling intra-individual change over time in the absence of a ‘gold standard’. Psychology Science (this volume). 2. Bernardo, J.M. and Smith, A.F.M. (1994). Bayesian Theory. Chichester, Wiley. 3. Binder, D.A. (1978). Bayesian clustering analysis. Biometrika 65, 31–38. 4. Binder, D.A. (1981). Approximations to Bayesian clustering rules. Biometrika 68, 275–285. 5. Breiman, L., Friedman, J. Olshen, R. and Stone, C. (1984). Classification and Regression. Belmont, Wadsworth. 6. Congdon, P. (2001). Bayesian Statistical Modelling. Chichester, Wiley.
64
E. Gutiérrez-Peña
7. Converse, P.D. and Oswald, F.L. (2004). Profile analysis or typological analysis: asking and answering the right questions. Psychology Science (this volume). 8. Dellaportas, P. and Smith, A.F.M. (1993). Bayesian inference for generalised linear and proportional hazards models via Gibbs sampling. Applied Statistics 42, 443–459. 9. Denison, D.G.T., Holmes, C.C., Mallick, B.K. and Smith, A.F.M. (2002). Bayesian Methods for Nonlinear Classification and Regression. New York, Wiley. 10. Escobar, M.D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577–588. 11. Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 97, 611–631. 12. Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. London, Chapman & Hall. 13. Görtelmeyer, R. (1988). Typologie des Schlafverhaltens (Typology of sleep behaviour). Regensburg, Roderer. 14. Gutiérrez-Peña, E. and Smith, A.F.M. (1998). Aspects of smoothing and model adequacy in generalised regression. Journal of Statistical Planning and Inference 67, 273–286. 15. Gutiérrez-Peña, E. and von Eye, A. (2000). A Bayesian approach to Configural Frequency Analysis. Journal of Mathematical Sociology 24, 151–174. 16. Hand, D.J. (1981). Discrimination and Classification. Chichester, Wiley. 17. Hand, D.J. (1997). Construction and Assessment of Classifications Rules. Chichester, Wiley. 18. Hartigan, J.A. (1982). Classification. In Encyclopedia of Statistical Sciences (eds. S. Kotz, and N. Johnson). New York, Wiley. 19. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning - Data Mining, Inference and Prediction. New York, Springer-Verlag. 20. Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society B 59, 731– 792. 21. Silverman, B. (1986). Density Estimation for Statistics and Data Analysis. London, Chapman & Hall. 22. von Eye, A. (1990). Introduction to Configural Frequency Analysis: The Search for Types and Antitypes in Cross-Classifications. Cambridge, University Press. 23. von Eye, A. (2004). Configural frequency analysis - a method for the analysis of existing types. Psychology Science (this volume). 24. von Eye, A. and Brandtstädter, J. (1997). Configural frequency analysis as a searching device for possible causal relationships. Methods of Psychological Research - Online 2, 1–23. 25. Wolfe, E.W. (2004). Identifying rater effects using latent trait models. Psychology Science (this volume). 26. Zhang, N.L. (2003). Hierarchical latent class models for cluster analysis. To appear in Journal of Machine Learning Research.