SOURCE DIVERSITY AND FEATURE-LEVEL FUSION Mark D Bedworth Defence Evaluation and Research Agency, St. Andrew’s Road, Malvern, Worcestershire, WR14 3PS, UK
[email protected]
ABSTRACT We briefly review the various models proposed for data fusion systems. A common theme of these models is the existence of multiple levels of processing within the data fusion process. We highlight some of the issues which emerge from using such a layered approach, in particular the selection of sources at each level which are both relevant and complementary. The balance between relevance and complementarity is shown to be present at all levels in the data fusion process. Each strand of processing cannot afford to rely too heavily on other information sources since the system needs to be robust to sensor or communications failures. For the purposes of illustration we develop a number of small data fusion systems which carry out simple fusion at the feature level. We use a multi-layer perceptron neural network and show how a mixed error criterion which incorporates both local performance and fused performance leads to a selection of sources which is both relevant (in a local sense) and complementary (in a global sense). Keywords: data fusion, process models, feature fusion, neural networks. 1. INTRODUCTION Various process models have been proposed for combining sensor information using data fusion [1]. In the JDL model, proposed by the US Joint Directors of Laboratories group in 1985 [2] and recently updated, the processing is British Crown Copyright 1999 / DERA Printed with the permission of the controller of Her Britannic Majesty’s Stationery Office.
divided into five levels. As depicted in Figure 1, level 0 is associated with pre-detection activities, level 1 with object refinement, level 2 with situation refinement and level 3 with threat refinement. Level 4 is used to close the loop by re-tasking resources (e.g. sensors and communications).
Figure 1: The JDL data fusion process model as updated in 1997. The Boyd control loop [3] was first used for modeling the military command process but has since been used for data fusion. The Boyd (or OODA) loop possesses four phases as shown in Figure 2. The JDL levels map directly onto the Boyd loop as Observe (JDL level 0), Orient (JDL levels 1 and 2), Decide (JDL level 3) and Act (JDL level 4). As such the models do show similarities although the Boyd model makes the iterative nature of the problem more explicit.
processed. Individual sources may be partially processed separately before fusion occurs. Thus, fusion may occur at several levels as has been addressed by other researchers. What is generally not taken into account is the feedback mechanism by which the sources and associated processing are tasked. This may include sensor selection / sensor deployment (as described in [6]) or feature selection / feature extraction. The problem introduce by this type of process model is illustrated in Figure 4. Figure 2: The Boyd or OODA loop which has been used as a data fusion process model.
(a) Figure 3: The Waterfall data fusion process model originally proposed by the author. The Waterfall model proposed in [4] and endorsed in [5] places emphasis on the processing function at the lower levels (see Figure 3). Again, similarities exist with the other models. Sensing and signal processing correspond to JDL level 0, feature extraction and pattern processing to JDL level 1, situation assessment to JDL level 2 and decision making to JDL level 3. In the Waterfall model the feedback is not explicitly depicted. This appears the major limitation of the Waterfall model which otherwise divides the data fusion process more finely than either the JDL or the Boyd models. The common theme of all these models is the hierarchical style in which the data is
(b)
Figure 4: An illustration of the dilemma introduced by a hierarchical fusion process (F) in determining the “optimal” pre-processing (P). The system has three potential consumers (C), each of which has a different criterion function. In Figure 4a is shown a typical hierarchical data fusion system in which two sources of data are separately processed by modules marked P. The outcome of these prefusion processes may be used autonomously by individual consumers, C. The performance at this lower consumer level should be as high as possible. If one or more of the channels communicating information from these consumers to the fusion process, F, should fail then the system may retain usable performance.
For optimum fused performance, however, the pre-fusion processing P should be optimised by an error correcting feedback from the final consumer as illustrated in Figure 4b. Which of these two criterion functions, local or global, should be used for optimising the prefusion processes? We shall demonstrate in later sections that a hybrid criterion function may be employed which contains a user controllable parameter, ψ . This parameter varies the emphasis placed on system robustness and is related to the expected failure rate of the communications. 2. FEATURE LEVEL FUSION In all these hierarchical approaches it is generally acknowledged that fusing the information which has been extensively abstracted from the raw data can lead to poor performance. Several researchers have shown that probability-level fusion (after pattern processing but before decision making in the Waterfall model) gives higher performance than decisionlevel fusion [7][8] and that feature-level fusion is usually even better. These last two fusion levels yield performance benefits without incurring large communications overheads. Despite this promise, research on probability-level fusion has, with a few exceptions (such as [9], [10] and [11] ), concentrated on fusion of independent sources and activity in feature-level fusion has been almost singularly lacking [12].
feature sets constructed can be both locally discriminating and globally complementary. We describe the algorithm used for performing the fusion and illustrate the method using realistic applications data. To simplify the analysis we use a nonstandard MLP in which the input to hidden unit weights are subjected to a strong penalty for having other than binary values. We do this be introducing an additional error term for these weights thus:
(wij − 1) 2 : wij > E = ∑ (Ok − Tk ) + λ wij 2 : wij ≤ 21 î k 2
where λ is a parameter for appropriately weighting the binary weight term (in our experiments we start λ at zero and slowly increase it to 10 throughout the experiment to drive the weights to binary values once a near optimum has been found). In this way we may use the MLP for feature selection rather than feature extraction. Initially, these networks are trained separately on the data from each sensor using the conjugate gradient optimisation algorithm. Fusion occurs at the feature level by combining these networks as shown in Figure 5.
Perhaps this is because the problem of selecting compact representations which are both highly discriminating yet diverse. This problem is further complicated by the coupling between the sensors introduced by the fusion process when the information is not conditionally independent. 3. A NEURAL ARCHITECTURE FOR FEATURE-LEVEL FUSION Here we use a multi-layer perceptron (MLP) neural network [12] to obtain trade-offs between local and fused optimality. We show that appropriate use of architectures and functions can lead to coupling at the feature level and that the
1 2
Figure 5: The network configuration used for the feature fusion experiments.
The fusion network was initialised by fixing the weights of the separate sensor networks and training the fusion network using the concatenated hidden unit values as inputs. The whole network was then optimised using the conjugate gradient algorithm once more. Again, we were able to alter the degree to which the network arranged for the sensor features to allow good local recognition and good fused recognition by setting parameter ψ which weighted the error criterion thus:
E=
∑ (O k
k
− Tk ) 2 + ψ
∑ (O
m
− Tm ) 2
flip from measurement 1 to measurement 2. This was observed consistently after some symmetry breaking noise was added to the network weights. Fused error rate in this configuration fell to 27%. This improvement is not dramatic but is statistically significant. By increasing ψ we were able to observe a transition from fused optimality to local optimality. The value of ψ at which the transition was made varied slightly from one experiment to the next but was generally near 0.37. This value appears to have no special significance since the exact value depended on the class separations used in the initial data production.
m
where k denotes output units for the fusion network and m those for the separate sensor networks. With ψ set to zero the network optimzes the fused performance only. With ψ set to unity equal weight is given to the fused performance and the local performance, this leads to a fusion process which is somewhat robust to local sensor failure since either sensor can be used alone to perform recognition. We illustrated this method using simple Gaussian data as shown in Figure 6. The data comprised 1,000 patterns from three synthetic measurements, each being Gaussian distributions with unit variance and separations between the two classes of 1.0 (measurement 1), 0.75 (measurement 2) and 0.5 (measurement 3). Sensor 1 provided measurements 1 and 2, sensor 2 provided measurements 1 and 3. Training the separate networks resulted in selection of measurement 1 in both cases, as expected, giving an error rate of approximately 31%. Training the fusion network with the separate networks fixed did not, in this case, increase performance. The reason for this can easily be understood since the discriminatory power of both pre-fusion processing modules is based on the same information. Optimizing the entire network with ψ set to zero resulted in the selection for sensor 1 to
Figure 6: The synthetic data used to evaluate the hybrid optimisation approach. 4. DESCRIPTION OF IRIS DATA Originally created by Fisher [13] using measurements made by Anderson, the Iris data has been used widely (for example [14]). The Iris database is obtainable from the UCI Repository of Machine Learning Databases and Domain Theories. The data set contains three classes of 50 instances each, where each class refers to a type of iris plant growing in California in the 1930’s. One type (Iris Setosa) is linearly separable from the other classes (Iris Versicolor and Iris Virginica) which are not linearly separable from each other. As supplied there are four attributes per pattern (length and width of petals and sepals measured in centimetres). Table 1 shows the correlation between these attributes and the class indicator.
5. CONCLUSIONS We described the preponderance of hierarchical data fusion process models and illustrated the sensor diversity issue which all of these models introduce.
Figure 6: The Iris data projected onto the two most discriminative axes – petal length and petal width.
Attr ibute Sepal length Sepal width Petal length Petal width
Min.
Ma x.
S 1
S 2
Class correlation
4.3 2.0 2.0 0.1
7.9 4.4 6.9 2.5
N Y N Y
Y N Y N
0.78 0.42 0.95 0.96
Table 1: Summary statistics of the Iris features and their allocations to sensor S1 and S2 For our data fusion experiments we allocated the attributes as shown in table 1, sensor 1 produced width measurements and sensor 2 produced length measurements. We restricted each of the separate networks to produce a single feature (using one hidden unit). In general the individual sensors each selected the petal attributes when optimising separately (77 out of 100 experiments with random initial weight values). When optimising the sensor / fusion network as a whole there was a trend towards increasing diversity (for example by sensor 2 switching to the sepal attribute) although this was not as consistently observed as for the Gaussian data (for example 53 experiments from 100 diversified the sensor feature selection when using a value of ψ =0.25).
We went on to describe a neural network approach to feature level fusion which allows both the separate sensor processing and the fusion processing to be optimized together. For simple, synthetic data the behaviour of the method agreed closely with our expectations and was easily understandable from a feature selection standpoint. These experiments are encouraging. Preliminary experiments with real data, however, indicate some problems. Namely, that the separately trained sensor networks and fusion network represent a local optimum for the sensor / fusion network as a whole. Heuristic approaches which perturb the weights sometimes work, but too much weight noise can result in very slow convergence which we believe to result from the number of hidden layers (two) in the sensor / fusion network. 6. FUTURE WORK The results described in this paper were produced from work in progress. Several avenues are yet to be explored. These include: the interaction of the non-binary weight penalty with the sensor diversity penalty has largely been ignored. A set of methodical experiments is needed; the method has yet to be properly evaluated on a challenging real-world task having more than two simple sensors. The convergence problems encountered with the Iris data may worsen as the problem complexity increases. It is the intention to pursue these issues further over the coming months and to report findings in future papers.
7. REFERENCES [1] “Review of Multisensor Data Fusion Architectures”, Kokar and Kim, Proc. IEEE, 1993. [2] “Data Fusion and Multisensor Correlation”, Hall and Llinas, Technology Training Corporation course, 1985. [3] “A Discourse on Winning and Losing”, Boyd, Maxwell AFB lecture, 1987. [4] “Probability Moderation for Multilevel Information Processing” Bedworth, personal communication, 1992. [5] “Technology Foresight on Data Fusion and Data Processing”, Markin, Harris, Bernhardt, Austin, Bedworth, Greenway, Johnston, Little and Lowe, Publication of The Royal Aeronautical Society, 1997. [6] “The Automatic Management of Multi-Sensor Systems”, Penny, Proceedings of FUSION’98, 1998. [7] “Benefits of Soft Sensors and Probabilistic Fusion”, Buede and Waltz, Proc. SPIE 1096, Signal and Data Processing for Small Targets, 1989. [8] “Optimal Decision Fusion in Multiple Sensor Systems”, Thomopolous, Viswanathan and Bougoulas, Proc. IEEE AES-23(5), 1987. [9] “Data Fusion for Object Classification”, Bedworth and Heading, Proc. IEEE SMC, 1991. [10] “An Algorithm for the Fusion of Correlated Data”, O’Brien, Proc. Fusion’98, 1998. [11] “Fusing Dependent Information from Multiple Classes”, O’Brien, IDC-99 Information, Decision and Control, Adelaide, 1999. [12] “Optimal Features-In Feature-Out (FEIFEO) Fusion for Decisions in Multisensor Environments”, Dasarathy, Proc. SPIE 3376, Sensor Fusion: Architectures, Algorithms and Applications II, 1998.
[12] “Parallel Distributed Processing”, McClelland and Rumelhart, MIT Press, 1986. [13] “The Use of Multiple Measurements in Taxonomic problems”, Fisher, Annual Eugenics (7) part II, 1936. [14] “Nosing Around the Neighbourhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments”, Dasarathy, IEEE PAMI (2)-1, 1980.