Receiver Operating Characteristic Curves and Fusion of Multiple Classifiers. â. Justin M. Hill a. , Mark E. Oxley b and Kenneth W. Bauer a a. Department of ...
Receiver Operating Characteristic Curves and Fusion of Multiple Classifiers∗ Justin M. Hilla, Mark E. Oxleyb and Kenneth W. Bauera a Department of Opertational Sciences, b Department of Mathematics and Statistics Air Force Institute of Technology, Wright-Patterson AFB, OH, U.S.A.
Abstract – A classifier typically has parameters in them that can vary, thus a classifier is, in fact, a family of classifiers. Varying the parameters and graphing the probability of false positive verse probability of true positive yields the Receiver Operating Characteristic (ROC) curve. Given a collection of multiple classifiers, and assume the Receiver Operating Characteristic (ROC) curve is know for each classifier. Can the ROC curve for the fused classifier be determined? We answer this question in this paper. This paper also present a unique the representation of the fusion of multiple where the fusion is based on Boolean rules.
foundation in statistical decision theory [1] and were originally developed as tools for electronic signal detection [2]. ROC analysis has been extensively applied to human perception and decision-making problems [3, 4] and is also commonly used in biomedical research [5, 6]. Alsing et al. [7] provide a comprehensive review of the use of ROC curves in ATR and medical research. For an in-depth technical discussion of ROC curves, consult Egan [8] and Swets & Pickett [9].
Keywords: Receiver Operating Characteristic (ROC) curve, Classifier, Boolean algebra, Kronecker tensor product, Hadamard product.
Problem Statement: Consider a collection of classifiers that are to be fused. Given that one knows the ROC curves of the individual classifiers and the fusion rule can the ROC curve of the fused classifier be determined without having to perform tests to determine its performance?
1
2
Introduction
Background
Let (E, E) be a measurable space. The nonempty set E is a sample set of outcome e, (that is, e ∈ E), and E is a σ-algebra of measurable events E (that is, E ∈ E). These outcomes and events usually are too difficult to quantify, therefore, one introduces a feature map X defined on E that produces an object x called a feature (typically a vector of real numbers). Let X denote the set of possible features (that is, x ∈ X ). Assume that (X , X) is a measurable space, where X is a σ-algebra of sets of features, and the mapping X is a measurable mapping, that is, for each set of features X ∈ X the inverse image X −1 [X] ≡ {e ∈ E : X(e) ∈ X} ∈ E. Consider the two-class problem. Suppose there exist two sets E1 , E2 ⊂ E that form a partition of E, that is, E 1 ∪ E2 = E and E1 ∩ E2 = ∅. The feature map X produces the sets of features, X1 = X[E1 ] and X2 = X[E2 ] (the images of E1 and E1 under the map X.) The feature map X, in the general case, is not one-to-one, therefore, X 1 and X2 may not form a partition of X . Thus, in general, X 1 ∩ X2 = ∅ ∗ The views expressed in this paper are those of the authors and do not reflect the official policy or position of the United States Air Force, and X1 ∪ X2 = X , so outcomes in different classes might Department of Defense, or the US Government. be mapped to the same feature. That is, there might exist
Receiver operating characteristic (ROC) curves are commonly used for summarizing the performance of imperfect diagnostic systems, especially in automatic target recognition (ATR) and in biomedical research, when classification accuracy alone is not sufficient. A ROC curve is the graph of a relation which summarizes the possible performances of a signal detection system faced with the task of detecting a signal (target) in the presence of background noise (clutter). This relation is usually used to relate the detection or “hit” rate (probability of detection, i.e., probability of true positive) to the false alarm rate (probability of false alarm, i.e., probability of false positive) as an internal decision threshold is varied. For a typical ROC curve, the decision threshold is varied from a very conservative value, i.e., a value that results in zero detection rate and zero false alarm rate, to a very aggressive value, i.e, a value that results in 100% detection rate and 100% false alarm rate. ROC curves have their
815
e1 ∈ E1 , e2 ∈ E2 such that X(e1 ) = X(e2 ) = x so that x ∈ X1 and x ∈ X2 , therefore, X 1 ∩ X2 = ∅. One builds a (two-class) classifier that maps a feature into the label set L = {1 , 2 }. The label 1 may corresponds to a target and label 2 corresponds to non-target, or clutter. Let A : X → { 1 , 2 } be a measurable mapping, then the inverse images A −1 [{1 }] and A−1 [{2 }] are measurable sets in X. Let X 1 = A−1 [{1 }] and X2 = A−1 [{2 }] denote these inverse images, then X 1 and X2 forms a partition of X , so that X 2 = X1 (the set complement with respect to X .) Assume the set X 2 corresponds to clutter or a non-target and X 1 corresponds to a target.
3
If P (X1 ) = 0 and P (X2 ) = 0 then these conditional probabilities can be written as PT P (A)
=
PT N (A)
=
−1
Pr({x ∈ A
=
(1)
[{1 }] | x ∈ X[E1 ]})
Since X1 = A−1 [{1 }] and X[E1 ] = X2 we can rewrite this as PT P (A) = Pr({x ∈ X1 | x ∈ X1 })
(2)
If PT P (A) equals 1 or is close to 1, then the classifier is a good classifier for determining true positive detection. If PT P (A) equals 0 or is close to 0, the classifier is not a good classifier for determining true positive detection. Other measures of interest are P F P (A), the probability of false positive detection defined by PF P (A) = =
Pr[A(x) = 2 | x ∈ X[E1 ] ] (3) −1 Pr({x ∈ A [{2 }] | x ∈ X1 })
PT N (A), the probability of true negative detection PT N (A)
= =
Pr[A(x) = 2 | x ∈ X[E2 ] ] P ({x ∈ X2 | x ∈ X2 })
(4)
X2 = (X1 ∩ X2 ) ∪ (X2 ∩ X2 )
=
Pr[A(x) = 1 | x ∈ X[E2 ] ]
=
P ({x ∈ X1 | x ∈ X2 }).
(6)
so applying Pr gives
(5)
(7)
and dividing by Pr(X 2 ) = 0 yields 1 = PF N (A) + PT P (A)
(8)
PF N (A) = 1 − PT P (A).
(9)
so that Similarly, we have that PT N (A) = 1 − PF P (A)
(10)
hence, we need only consider P T P (A) and PF P (A), since the other two will follow. Usually, the classifier depends on a parameter θ called a threshold. Let Θ be a set of thresholds. For each θ ∈ Θ let Aθ be a classifier Aθ : X → L = {1 , 2 }. Clearly, the collection of classifiers A = {Aθ : θ ∈ Θ} is a family of classifiers. We wish to measure the “goodness” of the family of classifiers A. That is, we seek a measure m defined on A such that m(A) quantifies the “goodness”. To this end, for each θ ∈ Θ consider the set X1,θ ∩ X1 ≡ A−1 θ [{1 }] ∩ X[E2 ]
(11)
and consider the probability of true positive and false positive detection by the classifier A θ given by PT P (Aθ ) =
P (X1,θ ∩X1 ) P (X2 )
PF P (Aθ ) =
P (X1,θ ∩X2 ) . P (X2 )
Define the set of triples tA = {(θ, PF P (Aθ ), PT P (Aθ )) : θ ∈ Θ}.
(12)
Projecting this “trajectory” onto the second-third set of pairs yields fA = {(PF P (Aθ ), PT P (Aθ )) : θ ∈ Θ}.
PF N (A), the probability of false negative detection PF N (A)
PF N (A) =
P (X1 ∩X2 ) P (X2 ) P (X2 ∩X1 ) P (X1 )
Pr(X2 ) = Pr(X1 ∩ X2 ) + Pr(X2 ∩ X2 )
In order to quantify how well the classifier A performs (with respect to the mapping X and feature mapping X ) one introduces a measure by letting Pr be a (probability) measure defined on (X , X) to get a (probability) measure space (X , X, Pr) of features. Therefore, the measurable mapping X is called a random mapping. We consider four special quantifiers that try to measure the performance of the classifier A. Let PT P (A) denote the probability of true positive detection by the classifier A given by the conditional probability Pr[A(x) = 1 | x ∈ X[E1 ] ]
PF P (A) =
Notice that PF P (A) is the measure of the type I error and PF N (A) is the measure of the type II error of the classifier A. Since {X1 , X2 } forms a partition of X , then
ROC Curve Theory
PT P (A) =
P (X1 ∩X1 ) P (X1 ) P (X2 ∩X2 ) P (X2 )
(13)
If Θ is homeomorphic to the real numbers R, then the trajectory tA will be a curve in R 3 and the projection f A will be a curve in R 2 (specifically, in [0, 1] × [0, 1] ⊂ R2 ). This curve is called the receiver operating characteristic (ROC)
816
curve for the classifier family A . For brevity, we will write PT P (θ) and PF P (θ) in place of PT P (Aθ ) and PF P (Aθ ) and define the 2-vector P(θ) ≡ (PF P (θ), PT P (θ)) .
(14)
Also, we will drop the subscript A on f A where it is clear. Now, the ROC curve can be defined in the more compact notation as f = {P(θ) : θ ∈ Θ}. (15) The ROC curve f is a relation. In some cases, f is a function, i.e., if (p, q) ∈ f then, as usual, we write f (p) = q. For example, let F1 and F2 denote the cumulative distribution functions of the scalar feature x for Class 1, i.e., X1 , and Class 2, i.e., X2 data, respectively. Let X 1 and X2 denote random variables associated with these distributions. When X is discrete, the ROC function is a set of discrete points. When F1 and F2 are continuous, a closedform expression for the ROC function f can be written [10] as f (p) = 1 − F2 F1−1 (1 − p) (16) for all p ∈ [0, 1]. Lloyd [10] points out that for both the discrete and continuous cases, f is nothing more than the distribution function of 1 − F 1 (x2 ). Statistically, this is the non-null distribution function of the p-value [11], 1−F 1 (x), for testing the null hypothesis that a given feature x comes from Class 1 [10]. A ROC curve is generated by varying the decision threshold θ over all possible values. As θ is varied from a conservative value, which is defined here as a low decision threshold value to an aggressive value, defined here as a high decision threshold value, P T P (θ) and PF P (θ) both take on values between 0 and 1. To ensure that a proper ROC curve is generated [9], the following definition is made. Defintion 1 Let set Θ ⊂ R. Let a = inf {θ ∈ Θ} and b = sup {θ ∈ Θ} . The set Θ is said to be an admissible threshold set for the classifier family A if lim PF P (θ)
=
0 and
lim PF P (θ)
=
1 and
θ→a+ θ→b−
lim PT P (θ) = 0
θ→a+
lim PT P (θ) = 1.
θ→b−
These conditions ensures that both P F P and PT P take on values from 0 to 1 so a proper ROC curve is generated.
can one combine two different classifiers acting on different feature sets to produce results better than the individual classifiers separately? We now form Cartesian products and define the feature set Z = X × Y and label set N = L × M.
3.2
ROC Curve Fusion
Oxley and Bauer presented a novel approach for classifier system evaluation by analytically constructing the ROC curve for an MCS based on AND and OR rules using only data from the ROC curves for the individual classifiers [12]. The purpose of the classifier systems researched in their work was to determine if the system was in one of two states, (e.g., friendly or hostile). Their work resulted in four primary contributions. First they defined the difference between fusion within and across target types. A system of classifiers that are fused within is a system in which all classifiers are trained to detect a particular type of target. Thus, they share the same prior probability of detection. Moreover, there are only two possibilities for truth in such a system. Either the target is present, or it is not. A system that is fused across target types includes classifiers trained to detect a number of target types. Each of these types of targets may have a different prior probability of detection, and since the system seeks different types of targets, it can accidentally arrive at the correct conclusion if a classifier seeking target type A incorrectly detects a target when a target type B is present. For reasons such as these, an across system may be more difficult to analyze than a within system. The second contribution was the derivation of formulas for PD and PFA for logical AND and OR rules in within and across systems. The third and fourth contributions were very closely related. Rather than a traditional definition for a ROC curve (PD vs. PFA ), Oxley and Bauer defined a ROC curve as the maximum value of P D for each possible P FA for that particular classifier. Although this contribution may seem trivial, it allowed Oxley and Bauer to analytically determine the ROC curves for logical AND and OR rules. The example used was a system designed to solve a twoclass problem in which there were (a) two-classifiers, (b) each classifier could output two labels, and (c) the system could output two labels. However, later in this document we show that the results for AND and OR can be extended to any number of classifiers and labels.
Let Θ denote an admissible threshold set for the classifier family A throughout this paper.
4
3.1 System of Two Classifiers
4.1.1 Conditional Performance Matrix
Consider the case when two sensors, X and Y observe events occurring in the same event set E. Assume they produce features vectors in different feature sets X and Y. In particular, assume X : E → Y and Y : E → Y. How
One can summarize the performance of a classifier operating at a particular decision threshold θ in terms of the conditional probabilities P T P (θ), PF P (θ), PT N (θ) and PF N (θ) by recording them in a matrix equivalent to the
4.1
817
Main Results Single Classifier Performance
Table 1: Conditional Performance Matrix for a Classifier with Two Labels. Feature Set (Truth) No Target Target Output clutter PT N (θ) PF N (θ) Label targetr PF P (θ) PT P (θ) “performance matrix” defined by Ralston [13]. For specificity, we call this matrix the Conditional Performance Matrix (CPM). One could also build a CPM in which each cell contains the conditional probabilities as functions of θ. For each classifier k let Ck denote the CPM corresponding to the classifier k. Table 1 shows that each column corresponds with truth and each row corresponds with a particular output label. To be consistent, the first row should correspond with the clutter label and the last row should correspond with the target label. Similarly, the first and last columns should correspond with instances of clutter and targets, respectively. The set of CPMs for all θ ∈ Θ can be used to construct the ROC curve for that classifier. Defintion 2 A Conditional Performance Matrix (CPM) for a classifier k is a matrix in which the columns correspond with truth, the rows correspond with the classifier’s output labels, and the (i, j) cell is the conditional probability of the classifier outputting label i when the true state of the system is j. The sum of each column of the CPM is unity (i.e., the CPM is column stochastic). 4.1.2 Prior Probabilities Matrix Using the definition of conditional probability, Pr{A|B} =
Pr{A ∩ B} , Pr{B}
one can compute the joint probability Pr{A ∩ B} by simply multiplying the conditional probability by the a priori probability Pr{B}. Consequently, one can multiply each column of the CPM by the appropriate a priori probability to determine the unconditional probability of each output state. Thus, one could multiply the first column of the CPM by Pr{No Target} and the second column by Pr{Target} to determine the joint probabilities of the output labels coinciding with a particular true state. Let α = Pr{Target}, and (1 − α) = Pr{No Target} for a particular classifier k. Now we can define a 2 × 2 diagonal matrix ρ k as follows: (1 − α) 0 . ρk = 0 α Defintion 3 The Prior Probabilities Matrix (PPM) for a particular type of target is a 2 × 2 diagonal matrix in which the (2, 2) cell is the probability of observing that type of target, and the (1, 1) cell is a complementary value. Thus, the trace of a PPM is unity.
4.1.3 Joint Performance Matrix Now we can define the Joint Performance Matrix (JPM) for a two-label classifier as follows. Jk = Ck ρk Pr {clutter ∩ No Target} = Pr {target ∩ No Target}
Pr {clutter ∩ Target} Pr {target ∩ Target}
The events associated with each element of the JPM are mutually exclusive and exhaustive, and the probabilities define the entire set of outcomes for the classifier k. Note that the trace of a square JPM represents the classification accuracy for the classifier. Defintion 4 The Joint Performance Matrix (JPM) for a classifier k is a matrix in which the columns correspond with truth, the columns correspond with truth, the rows correspond with the classifier’s output labels, and cell (i, j) is the probability of the classifier outputting label i when the system is in state j. The JPM gives the probabilities of all possible outcomes for the classifier. One can construct the JPM from the CPM and PPM using the formula Jk = Ck ρk , and the sum of the elements of the JPM is unity.
4.2
Across Fusion System Performance
Consider a system made up of 3 classifiers with similar constructions. Each classifier in the system detects different types of targets (e.g., one is trained to detect trucks, another is trained to detect tanks, etc.). Reconsider the set of events E, which is now partitioned into 4 subsets: E k where k ∈ {1, 2, 3} consists of all targets of each target type, and E 4 includes all events that considered clutter. The decisions from each classifier are sent to a fusion center or combiner, where a fusion rule is applied to the labels. The result is the decision for the classifier system (clutter or target) in terms of the system label set LS = {cS , tS }. The following sections describe a method for computing the CPM and JPM for such a system of classifiers. For a more detailed derivation see [14] and [15]. 4.2.1 Conditional State Probabilities Matrix A detection system like the one described above could potentially observe 2 3 = 8 different combinations of events (e.g., all three types of target are present, target types 1 and 2 are present but target type 3 is not, no targets are present, etc.). There are also 2 3 = 8 possible combinations of output labels from the individual classifiers in the example system. Thus, there are potentially 64 possible output states for the three-classifier system. If one assumes that the individual classifiers are statistically independent, it can be shown that the Kronecker product of the individual JPMs results in an 8 × 8 matrix, called the Joint State Probabilities Matrix (JSPM), in which each cell gives the joint probability of a particular combination of output labels and a particular combination of events.
818
The Kronecker product, or tensor product, of two matrices multiplies each element of one matrix by each element of the other matrix in the following manner [16]. Assume A and B are 2 × 2 matrices. The Kronecker product of A and B is a11 B a12 B A⊗B = a21 B a22 B a11 b11 a11 b12 a12 b11 a12 b12 a11 b21 a11 b22 a12 b21 a12 b22 = a21 b11 a21 b12 a22 b11 a22 b12 . a21 b21 a21 b22 a22 b21 a22 b22 A⊗B consists of all possible products of an A-matrix entry with B-matrix entry. Some fundamental properties of the Kronecker product are given in [16] and [17] . Van Loan notes the widening use of the operation and lists some areas where Kronecker product research is thriving: signal processing, image processing, semidefinite programming, quantum programming, and fast Fourier transforms [18]. The Kronecker product is defined for pairs of matrices of any dimensions, but for this example we will only be working with the 2 × 2 CPMs and JPMs. Just as the rows and columns of the JPM correspond with labels and truth, respectively, the rows of the JSPM correspond with specific pairs of labels, and the columns correspond with specific combinations of true events. The JSPM will be denoted S J . SJ = J1 ⊗ J2 ⊗ J3 Defintion 5 The Joint State Probabilities Matrix (JSPM), denoted SJ , for a system of classifiers {1, 2, ..., K} seeking K different targets is a matrix in which the columns correspond with truth, the rows correspond with combinations of output labels from the individual classifiers, and cell (i, j) gives the probability of the classifier system outputting the combination of labels i when the true state of the system is j. The JSPM gives the probabilities of all possible states for the classifier system, and the sum of the elements of the JSPM is unity. 4.2.2 Combined Prior Probabilities Matrix Note that the JSPM is equal to J1 ⊗ J2 ⊗ J3 = C1 ρ1 ⊗ C2 ρ2 ⊗ C3 ρ3 = (C1 ⊗ C2 ⊗ C3 ) (ρ1 ⊗ ρ2 ⊗ ρ3 ). We call the matrix (ρ1 ⊗ ρ2 ⊗ ρ3 ) the Combined Prior Probabilities Matrix (CPPM) and represent it with P . The elements of P represent the a priori probabilities of each combination of events defined by the columns of the JSPM. A CPPM for a system in which two classifiers seek two types of targets is given by P = ρ1 ⊗ ρ 2 ⊗ ρ 3 . Defintion 6 The Combined Prior Probabilities Matrix (CPPM) for a system of classifiers {1, 2, ..., K} seeking K
different targets is a 2K × 2K diagonal matrix in which the diagonal elements (j, j) give the a priori probability of the true state combinations defined in the j th column of the JSPM. The trace of a CPPM is unity. 4.2.3 Conditional State Probabilities Matrix The matrix, (C1 ⊗ C2 ⊗ C3 ) is called the Conditional State Probabilities Matrix (CSPM) and will be represented by SC . The relationship between the CSPM and the CPM is analagous to the relationship between the JSPM and the JPM. The columns of S C correspond with combinations of event sets, and the rows correspond with combinations of labels. SC = C1 ⊗ C2 ⊗ C3 4.2.4 Truth Matrix Recall that the last column of each JPM and CPM corresponds with the presence of a target. Consequently, the last 7 columns of the 8 × 8 JSPM and correspond with the presence of at least one type of target. If the goal of the system is to determine if any targets are present, we can add the last 7 columns together to arrive at the probability of a target being present under each possible label set. Conversely, first column gives us the probability of no targets present. We can calculate both of these values by post-multiplying SJ by a truth matrix T , which takes the form 1 0 0 0 0 0 0 0 . TT = 0 1 1 1 1 1 1 1 Now we are left with a 8 × 2 matrix in which the rows correspond with each possible combination of labels, the first column corresponds with the absence of targets, and the second column corresponds with the presence of at least one target. Defintion 7 A Truth Matrix T for an MCS combining decisions across target types is a 2K × 2 matrix containing binary values in which row i corresponds with a column in the JSPM, the first column corresponds with the absence of hostile targets, and the second column corresponds with the presence of hostile targets. The columns of the Truth Matrix must be orthogonal. That is, if the i th column of the JSPM corresponds with the presence of at least one target, the (i, 2) cell of the Truth Matrix will contain a 1. Otherwise, the (i, 1) cell of the Truth Matrix will contain a one. 4.2.5 Fusion Rule Matrix Recall that a logical fusion rule selects combinations of labels for which the system concludes that a target is present. For example, a system using the OR rule will conclude that a target is present if any of the classifiers detect a target. Thus, the OR rule corresponds with all rows of the JSPM except the first. All other Boolean rules can be defined in this manner.
819
If we want to determine the probability of identifying a 4.2.7 System Prior Probabilities Matrix target under a particular fusion rule we can add the approRecall that the JPM for a classifier k can be computed priate cells from the second column of the 8 × 2 matrix with J = C ρ . Thus, one can post-multiply J by the k k k S given by S C P T . If we want to determine the probability of inverse of the system prior probabilities matrix to compute identifying a target when none exists we can add the corre- C , but ρ has not yet been computed. Recall the CPPM S S sponding cells from the first column. Note that the OR rule matrix P = ρ ⊗ ρ ⊗ ρ , and note that each row/column 1 2 3 corresponds with the last 7 rows of the 8 × 8 JSPM. Now of the diagonal CPPM corresponds with a particular combiwe can define a fusion vector as a column vector of zeros nation of true events defined by the columns of the CSPM and ones, the ones corresponding to the rows appropriate and JSPM. The last 7 columns correspond with any event for that rule. The fusion vector for the OR rule for our ex- where a target is present, and the first column corresponds T ample is rOR = (0, 1, 1, 1, 1, 1, 1, 1). This fusion vector is with instances where no target is present. Pre- and postsimilar to the “rule of engagement” defined in [13] and the multiplying by the truth matrix T computes the sums of aprule vectors defined in [19] . propriate mutually exclusive probabilities that correspond Pre-multiplying the matrix computed with the formula to the appropriate probabilities summarized in the PPM. SC P T by the transpose of the fusion vector (to preserve ρS = T T P T = T T (ρ1 ⊗ ρ2 ⊗ ρ3 ) T dimensionality) gives us a vector containing the probabilities of correctly identifying a hostile target and misclassifying a friendly object, and pre-multiplying S C P T by the 4.2.8 System Conditional Performance Matrix Using the previously developed formula, one can now complementary vector to r OR , r TOR = (1, 0, 0, 0, 0, 0, 0, 0), gives us the probabilities of misclassifying a hostile target compute and correctly identifying a friendly object. T −1 S ∩ No Target} Pr{c S ∩ Target} rOR SC P T = Pr{tS ∩ No Target} Pr{tS ∩ Target} r TOR T =JS ρPr{c CS S C P= S = RT SC P T ρ−1 S Pr {cS | No Target} = Pr {tS | Target}
When the two vectors of are augmented in the form . [r ..r ], the result is the fusion rule matrix OR
OR
T ROR
. = [r OR ..rOR ]T 0 1 1 1 = 1 0 0 0
1 1 0 0
1 1 0 0
4.3
.
A more general definition is given by the following. Defintion 8 A Fusion Rule Matrix R is a 2K × mS matrix containing binary values in which row i corresponds with a row in the JSPM, the first column corresponds with combinations of output labels for which the system concludes there is no target present, the last column corresponds with combinations of output labels for which the system concludes a hostile target is present, and the columns in between (if mS > 2) correspond with intermediate fuzzy labels. For example, if the system is to conclude that a hostile target is present for a combination of output states corresponding with the i th row of the JSPM, the (i, mS ) cell of the Fusion Rule Matrix will contain a 1. The columns of a Fusion Rule Matrix must be orthogonal. 4.2.6 System Joint Performance Matrix Consider an MCS in which each individual classifier is charged with identifying a different type of target. If the classifiers are statistically independent, the system JPM, JS , can be computed with the formula JS = RT SC P T, where SC and P are computed with the formulas presented before.
Pr {cS | No Target} Pr {tS | Target}
.
Within Fusion System Performance
One can also compute the CPM, JPM and PPM for a system in which all classifiers are trained to detect the same type of target, or fusing within a type of target. The key difference is that the CSPM and JSPM must have only two columns (equivalent to the first and last columns of the JSPM and CSPM for a system fusing across target types). Logically, since each classifier is trained to detect one type of event there are only two possibilities: either the target is present or it is not. 4.3.1 Conditional State Probabilities Matrix One can use the Kronecker product to compute the possible combinations of the elements from the CPMs, but the result must be modified to account for these impossibilities. A simple way of removing them is to post-multiply the Kronecker product of the CPMs by a 8 × 2 matrix with ones in the (1, 1) and (8, 2) cells. The result is an 8 × 2 matrix consisting of the first and last columns of the Kronecker product term. Defintion 9 The Conditional State Probabilities Matrix (CSPM) for a system of classifiers {1, 2, ..., K} seeking one type of target is a 2 K × 2 matrix in which the first column corresponds with instances where the target is absent, the second column corresponds with instances where the target is present, the rows correspond with combinations of output labels from the individual classifiers, and
820
cell (i, j) gives the conditional probability of the classifier system outputting the combination of labels i when the true state of the system is j (i.e. Pr{AS (xS ) = (l1 , l2 , ..., lK ) | xS ∈ XS,j }). The sum of the elements in each column of the CSPM is unity. 4.3.2 Combined Prior Probabilities Matrix In a within fusion system, computation of the CPPM is trivial. Note that the CSPM is 2K × 2. Therefore, one needs to post-multiply the CSPM by a 2 × 2 CPPM to arrive at a properly dimensioned JSPM. Since each classifier seeks the same type of target the a priori probability of the system being in the hostile state is the same as the a priori probabilities for all the classifiers in the system. Thus, the 2 × 2 matrix required is simply the PPM shared by the individual classifiers in the system (i.e., P = ρ). Defintion 10 The Combined Prior Probabilities Matrix (CPPM) for a system of classifiers (1, 2, ..., K) seeking one type of target is equivalent to the PPM for each classifier in the system.
method of ROC fusion can be applied for Logical AND and OR rules; however, other techniques are necessary for more complex rules (e.g., a majority vote). Liggins showed that there are 7 types relevant rules other than the AND and OR for combining the decisions of three classifiers [19]. Moreover, as the number of classifiers in the system gets larger, there are even more relevant rules besides the AND and OR rules. One might hope to find a way to estimate the ROC curves for systems combined using more complex fusion rules. Of these, the majority vote seems to be the most complex. 4.4.1 ROC Fusion Logical AND One can employ Oxley and Bauer’s method to analytically estimate the optimal system ROC curve for a within or across MCS of any size if the system only outputs two labels (target or clutter)[12]. The key to their formula for the AND rule was the observation that, under an AND rule, the system probability for assigning a target label is equal to the product of the probabilities that each of the individuals assign a hostile label:
4.3.3 Joint State Probabilities Matrix
Pr{tS } = Pr{t1 } · Pr{t2 } · Pr{t3 }.
The JSPM SJ can now be computed with SJ = SC P. Defintion 11 The Joint State Probabilities Matrix (JSPM) for a system of classifiers {1, 2, ..., K} seeking one type of target is a 2K × 2 matrix in which the first column corresponds with instances when the target is absent, the second column corresponds with instances when the target is present, the rows correspond with combinations of output labels from the individual classifiers, and cell (i, j) gives the probability of the classifier system outputting the combination of labels i when the true state of the system is j (i.e. Pr{AS (xS ) = (l1 , l2 , ..., lK ) ∩ xS ∈ XS,j }). The sum of the elements in the JSPM is unity. 4.3.4 Truth and Fusion Rule Matrices Using the definitions above for a within system, the JSPM is a 2K ×2 matrix in which each column corresponds with truth. Thus, the truth matrix is no longer necessary because of the structure of this special case. Fusion rule matrices are defined in exactly the same way as in an across fusion system, and the system PPM is equivalent to the PPM shared by the individual classifiers. The formulae for computing the CPM and JPM are identical:
Maximizing both sides of the separable equation with respect to the individual threshold values and manipulating the result allows one to derive a formula for the maximum PD,S for a given P F A,S . The property can easily be adapted to account for any number of classifiers by using the property K
Pr{tS } = Pr{tk }. k=1
Logical OR Oxley and Bauer derived a similar formula for the OR rule. The key to this formula was the observation that, under an OR rule, the system probability for assigning a clutter label is equal to the product of the probabilities that each of the individuals assign a clutter label: Pr{cS } = Pr{c1 } · Pr{c2 } · Pr{c3 }. Minimizing both sides of the separable equation with respect to the individual threshold values and manipulating the result gives a formula for min (−P D,S ) which is equivalent to max PD,S . This property can be adapted in a similar manner such that Pr{cS } =
T
JS = R SC P T and
K
Pr{ck }.
k=1
CS = JS ρ−1 S .
4.4.2 Other Boolean Rules
4.4 Estimating ROC Curves This section suggests methods for estimating the ROC curve using the system CPM for an MCS combining decisions within or across target types. Oxley and Bauer’s
The ROC Fusion method can also be used to estimate ROC curves for Boolean rules other than the simple AND and OR under certain conditions. Specifically, the ROC Fusion method can be implemented in a piecewise manner for
821
[6] C. E. Metz, “Some practical issues of experimental design and data analysis in radiological ROC studies,” Investigative Radiology 24, pp. 234–245, 1989.
any rule that can be represented with a simple Boolean sentence (i.e., one in which the decision for each classifier is listed only once). For classifiers A, B and C, the rule “A or (B and C)” (A is a dominant sensor) is a simple sentence, but “(A and B) or (A and C) or (B and C)” (majority vote) is not.
5
Conclusions
The primary contribution of this paper is a matrix algebraic formula for computing the Conditional and Joint Performance Matrices for a Boolean Multiple Classifier System. The paper is also the first known use of the Kronecker product in evaluating classifier system performance. Also presented were definitions and/or derivations for the following: Conditional Performance Matrix (CPM), Prior Probabilities Matrix (PPM), Joint Performance Matrix (JPM), Combined Prior Probabilities Matrix (CPPM), Conditional State Probabilities Matrix (CSPM), Joint State Probabilities Matrix (JSPM), Truth Matrix, and Fusion Rule Matrix. Furthermore, several methods were presented for estimating an upper bound of the ROC curve for the MCS using the system CPM. The individual CPMs were used previously to determine optimal fusion rules [13]; however, that work did not take into account the possibility of varying the decision thresholds for the individual classifiers, nor did that work provide a methodology for analyzing systems fusing across target types. Lastly, this work characterized Sensor Corroboration rules considered by Liggins [19].
6
[7] S. G. Alsing, The Evaluation of Competing Classifiers. Ph.D. dissertation, Air Force Institute of Technology, Wright-Patterson AFB, OH, March 2000.
Acknowledgements
We wish to acknowledge the financial support of the Air Force Research Laboratory, Sensor Directorate, Automatic Target Recognition Branch (AFRL/SNAT) at WrightPatterson AFB, Ohio and also the Air Force Office of Scientific Research (AFOSR), Mathematical Sciences Division.
References [1] A. Wald, Sequential Analysis, John Wiley & Sons, New York, 1947.
[8] J. P. Egan, Signal Detection Theory and ROC Analysis, Academic Press, New York, 1975. [9] J. A. Swets and R. M. Pickett, Evaluation of Diagnostic Systems: Methods from Signal Detection Theory, Academic Press, New York, 1982. [10] C. J. Lloyd, “Using smoothed receiver operating characteristic curves to summarize and compare diagnostic systems,” Journal of the American Statistical Association 93, pp. 1356–1364, Dec 1998. [11] D. C. Montogmery, Design and Analysis of Experiments, John Wiley & Sons, New York, fourth ed., 1997. [12] M. E. Oxley and K. W. Bauer, “Classifier fusion for improved system performance.” AFIT/ENS Working Paper 02-02, Air Force Institute of Technology, January 2002. [13] J. M. Ralston, “Bayesian sensor fusion for minimumcost i.d. declaration,” Tech. Rep. IDA Paper P-3441, Institute for Defense Analyses, June 1999. [14] J. M. Hill, “Evaluating the performance of multiple classifier systems: A matrix algebra representation of boolean fusion rules,” Master’s thesis, Air Force Institute of Technology, Wright Patterson AFB OH, March 2003. [15] J. M. Hill, M. E. Oxley, and K. W. Bauer, “Evaluating the fusion of multiple classifiers via roc curves,” in Proceedings of SPIE AeroSense, 5096, 2003. [16] A. Graham, Kronecker Products and Matrix Calculus, Ellis Horwood Limited, Chichester UK, 1981.
[2] W. W. Peterson, T. G. Birdsall, and W. C. Fox, “The theory of signal detectability,” Transactions of the IRE Professional Group on Information Theory PGIT-4, pp. 171–212, 1954.
[17] J. W. Brewer, “Kronecker products and matrix calculus in system theory,” IEEE Transactions on Circuits and Systems 25, pp. 772–781, September 1978.
[3] D. M. Green and J. A. Swets, Signal Detection Theory and Psychophysics, John Wiley & Sons, New York, 1966.
[18] C. F. Van Loan, “The ubiquitous kronecker product,” Journal of Computaional and Applied Mathematics 123, pp. 85–100, 2000.
[4] J. A. Swets, Signal Detection and Recognition by Human Observers, John Wiley & Sons, New York, 1964.
[19] M. E. Liggins II, M. A. Nebrich, M. S. Berlin, and M. Lazaroff, “Adaptive multi-image decision fusion analysis,” 2000.
[5] C. E. Metz, “ROC methodology in radiologic imaging,” Investigative Radiology 21, pp. 720–733, 1986.
822