of the N = 14 diagnoses. For a Bayesian network to operate, .... Ken Kreutz Delgado for the many fruitful ... V.E. Ogle and M. Stonebraker. Chabot: retrieval from a ...
Image Database Assisted Classification Simone Santini1 , Marcel Worring2 , Edd Hunter1 , Valentina Kouznetsova1 , Michael Goldbaum3 , and Adam Hoover1 1
Visual Computing Lab, University of California San Diego Intelligent Sensory Information Systems, University of Amsterdam 3 Department of Ophthalmology, University of California San Diego 4 Ref: Simone Santini, Marcel Worring, Edd Hunter, Valentina Kouznetsova, Michael Goldbaum, Adam Hoover, “Image Database Assisted Classification,” Proceedings of Visual 99, International Conference on VIsual Information Management Systems, Amsterdam, the Netherlands, June 1999 2
Abstract. Image similarity can be defined in a number of different semantic contexts. At the lowest common denominator, images may be classified as similar according to geometric properties, such as color and shape distributions. At the mid-level, a deeper image similarity may be defined according to semantic properties, such as scene content or description. We propose an even higher level of image similarity, in which domain knowledge is used to reason about semantic properties, and similarity is based on the results of reasoning. At this level, images with only slightly different (or similar) semantic descriptions may be classified as radically different (or similar), based upon the execution of the domain knowledge. For demonstration, we show experiments performed on a small database of 300 images of the retina, classified according to fourteen diagnoses.
1
Introduction
Image databases aim at retrieving images that have a certain meaning for the user asking the query. Finding an image with the right meaning is provably a difficult problem. Classification techniques attach meaning to the images by categorizing them into a fixed set of classes. Image databases avoid categorization by defining appropriate similarity measures between pairs of images, and by ranking the answers by similarity with the query. The underlying assumption is that image similarity will induce a soft categorization of some significance. Image databases can be classified according to the level of abstraction and the amount of domain knowledge used in computing the similarity. A possible taxonomy of approaches along a semantic axis is reported in fig. 1. A number of image databases assume that meaningful similarity can be expressed by a distance measure in a geometric feature space, obtained from the image data through some pure data driven image processing operations to retrieve shape, texture, and color features [1] [2]. This assumption is approximately valid under certain circumstances. Typically, if the database is a collection of
Fig. 1. Overview of the different spaces in which similarity can be defined.
disparate images not connected by a particular field or purpose, and retrieval is incidental. However, in domains where images play an integral role in daily practice, there is a lot more to meaning than those simple features describing image content. Meaning is then largely depending on the context and purpose of image retrieval. Therefore, systems use low level visual features as a basis for a reasoning subsystem trying to extract higher level semantics meaningful in the specific application domain [7]. Other systems try to apply some image processing operations to the visual features, in order to transform them into other visual features semantically more meaningful in the domain of discourse [5][6]. Both approaches result in what we call a Visual Semantic Space. The difference between the two is the amount of domain knowledge required and the way in which knowledge steers the extraction process. In the case of visual processing, knowledge only dictates the nature of the features to be extracted, while in reasoning it determines the methods and algorithms. In this paper, we take the idea one step further, and use a reasoning system on top of the visual semantic space. The output of this reasoning system defines features in a domain semantic space. The specific domain we consider is that of retina images. In this domain, the image similarity of interest is diagnostic: two images are similar if the same diagnosis can be ascribed to them with the same confidence. Our approach consists of two steps: In the first step we derive visual labels in the visual semantic space, which are of interest in this particular domain. In the current system, the labels are assigned by an expert. They are all based on pure visual information and could hence potentially be derived from the image in an automatic way using domain specific techniques. In the second step, we use a Bayesian network, whose weights were set using domain knowledge. The output of the network consists of the marginal probabilities for each of the possible classifications. It can be used for image classification without the help of a database. In our case, however, we use the vector of marginal probabilities as a feature vector to form the domain semantic space and define a novel probability based measure for comparing the two feature vectors to establish the required similarity in this space.
The rationale for using image database techniques to assist the classification is that in certain cases the output of the classifier may not be discriminant enough to allow for a sharp classification. However, it might happen that there are images in the database with the same pattern of probabilities. We retrieve these images and assign their label to the unknown image. Ideally, this method should help decide dubious cases while retaining the cases in which the label is decidable. The paper is organized as follows. Section 2 introduces the semantic spaces, the associated reasoning methods, and their definitions of similarity. Section 3 reports results with the proposed method and compares performance at the different levels.
2 2.1
Methods Semantic spaces
In our application the domain semantic space is formed by a set of 14 relevant diagnoses. The visual semantic space contains a set of 44 visual cues sufficient for discriminating amongst those fourteen diagnoses. These 44 cues were determined by questioning expert ophtalmologists about the information they were looking for while observing an image for diagnostic purposes. The cues are symbolic, and each one of them takes values in a small unordered set of possible values. As an example, the visual semantic feature microaneurism or dot hemorrhage takes values from {absent, few anywhere, many anywhere}. The number of possible values is cue dependent and varies between two and eight. Additionally, any cue may take on the value “unknown” if for a specific image it can not be identified [3]. Having separated the two semantic spaces, allows to separate the identification of visual cues from the judgment of the causes of the findings. The findings are based entirely on appearance, while the judgment process takes into account previously learned knowledge and expert training. As a practical example of the difference between the two spaces, one of the authors, who has worked on retinal images for two years but has no medical training or experience, is capable of assigning the right values to the semantic visual clues with a fairly high accuracy, but he is incapable of making diagnoses. 2.2
Image Database
Our database consists of 300 retinal images, digitized from 35mm slide film and stored at 605 × 700 × 24-bit (RGB) resolution. The retinal expert on our team determined, by hand, the set of diagnoses for each image in domain semantic space. Since diagnoses are not mutually exclusive, any individual image may have more than one diagnosis. This often occurs when a primary diagnosis is accompanied by one or more secondary diagnoses. Of our 300 images, 250 have one diagnosis, 46 have two diagnoses, and 4 have three diagnoses.
Fig. 2. Example images in the database
Example images are shown in fig.2. It is important to notice that in this domain simple features in geometric feature space (color histograms, etc..) are quite meaningless. To the untrained eye, all images already look more or less the same. Summarizing the data using geometric features, makes them only more similar. 2.3
Similarity in visual semantic space
To define the similarity of a pair of images in our semantic visual space, requires comparing two vectors containing symbolic values. The set of admissible symbolic values for an element do not have a direct relation to a numeric value. In fact the different values do not neccessarily have an ordering. These problems motivate the following similarity metric. Let F = {F1 , F2 , . . . , FM } represent a feature vector, consisting of M symbolic elements. Given two feature vectors FA and FB , the distance d(FA , FB ) between them is defined as: M X 1 FAi 6= FBi d(FA , FB ) = (1) 0 FAi = FBi i=1
Note that if all features could only assume two values this would reduce to the Hamming distance. Using this metric, the similarity of two images is proportionate to the number of semantic features that have the same symbolic value for both images. 2.4
Reasoning in visual semantic space
To obtain values in domain semantic space requires a reasoning process based on expert knowledge. In this paper a Bayesian network based approach is followed. The Bayesian network computes probabilities for classifications based upon Bayes’ rule: P (m|di )P (di ) P (di |m) = N (2) P P (m|di )P (di ) i=1
where, m is the vector of 44 visual semantic cues, and di is the i-th element out of the N = 14 diagnoses. For a Bayesian network to operate, it must be supplied with the set of probabilities P (m|di ). These are supplied by the expert in the application domain, and are commonly called beliefs. For our application, given an average of three values for each manifestation, this seemingly requires 44 × 3 × 16 ≈ 2300 estimated probabilities. However, many of the beliefs have a value of zero, and so may be supplied implicitly. Additionally, each probability P (m|di ) defines a combined probability P (m|di ) = P (m1 = s1 AND m2 = s2 AND m3 = s3 . . .)
(3)
where each value sk is any allowable state value for the feature mj . Rather than supply these combined probabilities, which can also be difficult to estimate, individual probabilities for P (mj |di ) may be supplied and, assuming mutual independence, combined as follows: P (m|di ) = P (m1 |di )P (m2 |di )P (m3 |di ) . . . P (m44 |di )
(4)
Finally, the prior probabilities P (di ) are also supplied by the expert. We used commercially available tools [4] to construct and operate the Bayesian network. A graph of the structure of the Bayesian network in our application is shown in fig. 3 (only non-zero links are represented). For each diagnosis we have,
Fig. 3. A graph of the Bayes network. Cues and diagnoses are represented by nodes, while links denote non-zero conditional probabilities.
Kj X
P (mj |di ) = 1.0
(5)
j=1
where Kj is the number of possible states for the manifestation mj . The network and the associated probabilities define the domain knowledge we utilize. It should be noted that one of the nodes in the network is the age of the patient. Clearly this is a non-visual cue which is important for finding the proper diagnosis. Given an image with an unknown diagnosis, and its visual semantic features, the Bayesian network computes the probabilities for each individual diagnosis using eq.2 given the set of manifestations. As indicated earlier, a doctor classifies the image into a limited set of diagnoses only. In order to separate the list of derived probabilities into a set of likely diagnoses, and a set which are not likely, we perform an adaptive clustering. A threshold is found which maximizes the Fisher criterion of class separation (µ1 − µ2 )2 /(σ12 − σ22 ), where µ and σ 2 are the sample mean and variances of the probabilities of the two respective output categories. To perform the clustering, the output list of probabilities is sorted. The threshold is taken at the maximum value for the criterion encountered while incrementally adding diagnoses from the unlikely category to the likely category, in sorted order. Since the number of diagnoses per image is limited to three for this application, the output is in any event limited to between the one and three most likely diagnoses. 2.5
Similarity in domain semantic space
The output of the Bayesian network can be considered as a 14-dimensional feature vector, and used for indexing the database of images. Now, given a query image, the 14 marginal probabilities of the diagnoses are in this case used to retrieve images with similar marginal probabilities from the database. The diagnoses of these images are returned by the system. The rationale behind this choice is that sometimes the output of the Bayesian network is not sufficient to make a clear choice regarding the diagnoses to be assigned to an unknown image. In other words, classes may not be well separated. In these cases, however, the pattern of probabilities can be indicative of the diagnoses, and finding an image in the database with a similar pattern of probabilities can give us the right result. Formally, let Ii be the i-th image in the database, D(Ii ) the set of diagnoses associated with Ii , and pi = B(Ii ) the output of the Bayesian network when image Ii is presented as an input. We define a distance measure δ between outputs of the Bayesian network, δ(B(Ii ), B(Ij )) being the distance between the i-th and the j-th image in the database. When the unknown image J is presented, we determine a permutation πi of the database such that δ(B(J), B(Iπi )) ≤ δ(B(J), B(Iπi+1 ))
(6)
to rank the images, and retain the K images closest to the query: {Iπ1 , . . . , IπK }. The union of the diagnoses of these images will be assumed as the set of diagnoses
of the unknown image: D(J) =
K [
D(Iπi )
(7)
i=1
The definition of the distance function δ is obviously important to assure the correct behavior of the system. We again use a function that can be seen as a generalization of the Hamming distance. If 0 ≤ p ≤ 1 is a probability value, we define its negation p¯ = 1 − p. Given two vectors of marginal probabilities x and y, we define their distance as δ(x, y) =
1 (x · y¯ + x ¯ · y) N
(8)
The normalization factor guarantees that 0 ≤ δ ≤ 1. It is immediate to see that if the elements of x and y are in {0, 1}, δ is the Hamming distance between them. This distance has also another interesting interpretation. Consider a single component of the vectors xi and yi , and assume that the “true” values of those components can only be 0 or 1 (i.e. a disease is either present or not). Because of uncertainty, the actual values of xi and yi are in [0, 1]. In this case xi y¯i + x ¯i yi = xi (1 − yi ) + (1 − xi )yi
(9)
is the probability that xi and yi are different. The choice of the value K should be done using engineering judgment. High values of K will increase the number of true diagnoses incorporated in the answer of the system that is, increasing the value of K will reduce the number of false negatives. At the same time, however, increasing the value of K will increase the number of false positives. In all our experiments we used K = 1, considering the database image closest to the query only.
3
Results
Let C define a classification for an image, consisting of a set of one or more diagnoses: C = {Di , . . .} (10) where each Di is one of the 14 diagnosis codes. Let C1 and C2 define two different classifications for an image. Typically, C1 will be our “ground truth” (expert classification) and C2 the classification obtained with one of the three methods above. We define the quality of the match as Q=
|C1 ∩ C2 | |C2 |
(11)
A value Q = 0 means that no correct diagnosis was reported, a value of Q = 1 means that all correct diagnoses an no extra diagnosis were reported. Note that the normalization factor is |C2 | and not |C1 | to penalize giving too many diagnoses.
We considered our image database and made a rotation experiment: each one of the images was in turn removed from the database and considered as the unknown image. The values of Q were collected for all images, and their average computed. For Nearest Neighbors in visual semantic space this yielded a value of 0.52 and for the Bayesian classifier 0.53. The method using the Bayesian classifier and the database (search in the domain semantic space) yielded 0.57. The variance was approximately 0.16 in all three cases.
4
Discussion
In this paper we have proposed a new framework for performing database searches by introduction of a semantically meaningful feature space. A reasoning system like a Bayesian network, can provide this. Reasoning alone does not always provide sufficient information to classify images. In these cases, comparing the pattern of marginal probabilities with that of images classified earlier can aid in proper classification. The new similarity measure we defined generalizes the Hamming distance of binary patterns. The results indicate that the performance of the nearest neighbor and the Bayesian classifiers are indistinguishable, while there is some evidence that the combination of the classifier and the database yields improved results. It is noted that the improvement is small. We hypothesize that the complexity of the semantic network is not at par with the small database of 300 images. Furthermore, the results are only as good as the coverage of the database. If we give a system an image with certain diseases, and the database contains no image with the same diseases, we will not be able to obtain a correct answer. Thus, we can expect the performance of the system to increase with the size of the database. Acknowledgments We gratefully acknowledge Prof. Ken Kreutz Delgado for the many fruitful discussions and for suggesting the generalized Hamming distance.
References 1. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: the QBIC system. IEEE Computer, 28(9), 1995. 2. A. Gupta and R. Jain. Visual information retrieval. Communications of the ACM, 40(5):70–79, 1997. 3. A. Hoover, M. Goldbaum, A. Taylor, J. Boyd, T. Nelson, S. Burgess, G. Celikkol, and R. Jain. Schema for standardized description of digital ocular fundus image contents. In ARVO Investigative Ophthalmology and Visual Science, Fort Lauderdale, FL, 1998. Abstract. 4. F. Jensen. Hugin api reference manual, version 3.1, hugin expert a/s, 1997. 5. V.E. Ogle and M. Stonebraker. Chabot: retrieval from a relational database of images. IEEE Computer, 28(9), 1995.
6. G.W.A.M. van der Heijden and M. Worring. Domain concept to feature mapping for a plant variety image database. In A.W.M. Smeulders and R. Jain, editors, Image Databases and Multimedia Search, volume 8 of Series on software engingeering and knowledge engineering, pages 301–308. World Scientific, 1997. 7. N. Vasconcelos and A. Lippman. A Bayesian framework for semantic content characterization. In Proceedings of the CVPR, pages 566–571, 1998.