of geometric patterns can be thought of as a set of patterns that visually ..... To meet the PAC criteria, our SQ algorithm requires a sam- ple of size O (. Dk2.
In The ICML 2000 Workshop on Machine Learning of Spatial Knowledge, pp. 109–115, 2000
109
Geometric Patterns: Algorithms and Applications
Stephen Scott SSCOTT @ CSE . UNL . EDU Dept. of Computer Science & Engineering, 115 Ferguson Hall, University of Nebraska, Lincoln, NE 68588-0115
Abstract We review definitions of various concept classes called geometric patterns, which are based on a measure of visual similarity called the Hausdorff metric. These classes generalize multipleinstance learning models, and have possible applications to areas including robot vision, drug activity prediction, and content-based image retrieval. We also briefly describe algorithms (which have provable guarantees of time complexity and predictive performance) to learn these concept classes. We then summarize ongoing empirical work to evaluate these algorithms on real data.
summarize our ongoing work to empirically evaluate our algorithms in various application areas.
2. The Concept Classes 2.1 One-Dimensional Patterns on the Real Line We start with a concept class defined by Goldberg and Goldman (1994) and Goldberg et al. (1996), and studied by Goldman and Scott (1999b). In these concept classes, the instance space Xn consists of all configurations (or bags) of ≤ n points on the real line1 . A concept is the set of all bags from Xn within some distance γ under the Hausdorff metric of some “ideal” bag of ≤ k points. The Hausdorff distance between bags P and Q, denoted HD(P, Q), is: , max max min{dist(p, q)} , max min{dist(p, q)} p∈P
1. Introduction This paper summarizes results in learning geometric patterns and describes possible applications of such algorithms. In our concept classes, each example pattern is a set of points in a d-dimensional space, where the space is either real (for d = 1) or bounded and discretized (for some constant d > 1). Loosely speaking, each concept in the class of geometric patterns can be thought of as a set of patterns that visually resemble each other. We formalize the notion of resemblance between two patterns P and Q with the Hausdorff metric (Gruber, 1983), which is used commonly in computer vision applications. Informally, two geometric patterns P and Q resemble each other under the Hausdorff metric if every point in one pattern is “close” to some point in the other pattern. Most of the classes we present have binary labels, but fuzzy (real-valued) labels are also possible. Our concept classes generalize multiple-instance concepts (Dietterich et al., 1997), where each example is a bag of points rather than a single point. The generalizations include new rules for classifying a bag as positive, and multiple-instance learning when fuzzy labels are used. We also describe potential applications of these concept classes, as well as algorithms to learn them. These algorithms were developed in a learning-theoretic framework and are provably tolerant of noise and are agnostic in the sense that their performance guarantees make no assumptions about the target concept to be learned. Finally, we
q∈Q
q∈Q
p∈P
where dist(p, q) is the distance (under some norm) between points p and q. In other words, if each point in P reports the distance to its nearest neighbor in Q and each point in Q reports the distance to its nearest neighbor in P , then the Hausdorff distance is the maximum of these distances. Let P be any bag of points on the real line. Then we define the concept CP that corresponds to P by CP = {X ∈ Xn : HD(P, X) ≤ γ}. Figure 1 illustrates such a concept. The concept class Ck,n that we study is defined as follows: . Ck,n = {CP : P is a bag of ≤ k points on the real line}. At first glance, there may appear to be some similarities between Ck,n and the class of the union of at most k intervals over the real line. However, the class of one-dimensional geometric patterns is very different and significantly more complex than the class of unions of intervals. One major difference is that Ck,n is a multiple-instance learning class, whereas for the union of intervals, each instance is a single point. Also, for the class of unions of intervals, an instance is a positive example simply when the single point provided is contained within one of the k intervals. For Ck,n a bag is positive if and only if it satisfies the following two conditions: (1) each of the n points in the bag is contained 1
Note that throughout this paper, the word “point” will refer to a single point, and we shall use the term “a bag of points” or “bag” when speaking of an example.
110 the boxes’ sizes to vary.
2 target
k =3 n =7
X1
positive
X2
negative
X3
negative
Figure 1. An example concept from C3,7 with γ = 1. The top line is the target pattern. Around each target point is an interval that covers all points within distance γ from that point.
2.3 Fuzzy d-Dimensional Patterns f inite Goldman and Scott (1999a) then generalized Ck,n,d . In Section 2.2, we assumed that each target box contained or did not contain a point from a given bag, i.e. containment was measured by a binary function. Now we instead associate with each box b a membership function µb : {1, 2, . . . , s}d → [0, 1] that indicates the amount of membership a point p has in b for any p ∈ {1, 2, . . . , s}d . A reasonable choice for µb is a function that is 1 at b’s center and monotonically decreases as distance from the center increases, taking the value 0 at b’s defining edges, e.g. ( 0 if p 6∈ b µb (p) = , kp−cb k` 1 − max 0 kp0 −cb k` otherwise p ∈b
within one of the k width-(2γ) intervals defined by the k target points, and (2) there is at least one of the n points in the bag contained within the width-(2γ) interval defined by each of the k target points (see Figure 1). Finally, we note that Goldberg (1992) has shown that given a set S of examples labeled according to some one-dimensional pattern of k points, it is NP-complete to find a pattern of any number of points that correctly classifies all examples in S. So it is necessary to use a more expressive hypothesis space to learn Ck,n . Thus to give even further evidence that Ck,n is more complex than the union of intervals, observe that the consistency problem for the latter class is trivial to solve. 2.2 d-Dimensional Patterns Goldman et al. (2000) later extended Ck,n to d dimensions. We started by discretizing and bounding the instance space, which allowed us to develop an algorithm to learn a generalization of Ck,n in any constant2 dimension d. Specifically, we change the space from the real line < to Qd i=1 {1, . . . , si }, where si ∈ {s1 , . . . , sd } bounds the size of the space in dimension i. Now rather than representing each concept as a set of ≤ k points, we represent each concept as a set of ≤ k axis-parallel boxes, where the size of each box in each dimension can vary. In this class (denoted f inite Ck,n,d ), an example is positive if and only if it satisfies the following two conditions: (1) each of the n points in the bag is contained within one of the k axis-parallel boxes, and (2) there is at least one of the n points in the bag contained within each of the k boxes. Note that a d-dimensional version of Ck,n could represent its concepts as axis-parallel squares (all of the same size) if f inite the L∞ norm is used in the Hausdorff metric. Thus Ck,n,d generalizes a bounded, discretized, d-dimensional version f inite of Ck,n under the L∞ norm, because in Ck,n,d , we allow 2 It is believed that there is no algorithm for learning this class that runs in time poly(d), since this would yield an algorithm for efficiently learning DNF formulas, solving a major open problem in learning theory (Goldman et al., 2000; Auer et al., 1997).
where cb is the center point of b and k · k` denotes the `norm. The above equation measures the distance from p to b’s center and normalizes by dividing by the radius of b under k · k` . Other possibilities include Gaussian-shaped functions and unnormalized linear functions (Lin & Lee, 1996). When assigning a label to a bag by a set of target fuzzy boxes, we let each target box assign a label to each point in the bag, and then apply an aggregation operator such as max, min, or average to combine the boxes’ predictions into a single prediction. Let C be the set of boxes in the target concept and P be the bag, which has n points p1 , . . . , pn . Goldman and Scott (1999a) used the following aggregation operators in their work. (1) max-max: maxc∈C {maxp∈P {µc (p)}}. Here, each box in C finds the point in P that maximizes its membership function, and then the boxes are combined by taking a maximum. This matches the semantics of learning the union of boxes with multi-instance examples where an example is positive if some box from C contains at least one of the points from P . (2) minimax: There are two versions here depending on whether one wants to first assign a value to each box and then combine the boxes (minc∈C {maxp∈P {µc (p)}}), or whether one wants to first assign a value to each point and then combine the points (minp∈P {maxc∈C {µc (p)}}). The former corresponds to criterion 1 in Section 2.2 and the latter corresponds to criterion 2. So we apply both operators and define the final membership value as the minimum of these two values. This definition is an interpretation of geometric patterns with fuzzy boxes, related to the Hausdorff metric. The above definition finds the box with the minimum membership value (having picked the best point for the box) and the point with the minimum membership value (having picked the best box for the point) and P then takes the smaller of the two. (3) avm erage: (1/m) i=1 maxc∈C {µc (pi )} . This definition is motivated by a desire to tolerate noise in the data, and is described more in Section 3.1.
111
3. Potential Applications 3.1 The Landmark Matching Problem As a possible application of algorithms that can learn the above concept classes, consider the problem of recognizing from a visual image of a robot’s current location whether or not it is in the vicinity of a known landmark (where a landmark is a location that is visually different from other locations). Such an algorithm is needed as one piece of a complete navigation system where the robot navigates by planning a path between known landmarks, tracking the landmarks as it goes. Because of inaccuracies in effectors and possible errors in its internal map, when the robot believes it is at landmark L, it should check that it is really near L before heading to the next landmark. Then adjustments can be made if the robot is not at L by re-homing to L and/or updating its map. For efficiency’s sake, one can use one-dimensional visual data (of a 360◦ view) to do the matching (Hong et al., 1992; Levitt & Lawton, 1990; Pinette, 1993; Suzuki & Arimoto, 1988). To apply our learning algorithms, we pre-process the images, placing points on the real line where there are significant changes in light intensity, producing one-dimensional geometric patterns. Then we take patterns produced from images taken near the landmark to act as positive examples and patterns produced from images taken not near the landmark as negative examples. The idea is that we can learn a hypothesis from these examples that can accurately predict whether a new pattern came from near or not near the landmark. A major issue revealed by Goldman and Scott (1999b) that arises in using one-dimensional patterns for this problem is that valuable information is lost when mapping the robot vision data to one-dimensional geometric patterns. All that a one-dimensional geometric pattern tells us is where the absolute value of the derivative of the input signal exceeds some threshold. What are lost are the direction and the magnitude of that change. To capture this information, a two-dimensional pattern is necessary: the x direction represents the position in the input signal as before, and the y direction represents the derivative of the signal at that point. The concept class to learn could then be represented by circles (if the L2 norm is used with the Hausdorff metric), rhombi (if the L1 norm is used), axis-aligned squares (if the L∞ norm is used), or arbitrary axis-aligned rectangles (if the L∞ norm is used and a different value of γ is used f inite in each dimension for each box). The latter class is Ck,n,d if we allow for a bounded, discrete space, which causes no difficulties since e.g. for robot navigation the input data always consist of one-dimensional arrays of light intensities, where the length of the array and number of values each element of the array can take are known a priori. Finally, of course, if the input data consisted of two-dimensional information, we could map this to three-dimensional patterns, and so on to d-dimensional input mapped to (d + 1)-
dimensional patterns. This allows us to handle other types of constant-dimensional data, e.g. amplitudes of a waveform, sonar data, or temporal difference information. We now motivate the use of fuzzy geometric patterns in this problem. Imagine e.g. a one-dimensional waveform W that is considered a positive instance because it visually resembles other positive instances. By taking W ’s derivative at different sampling points, we can map it to a bag of points P (Figure 2), whose values in the y direction correspond to the changes of W .
Figure 2. An example of mapping a one-dimensional waveform to a two-dimensional pattern, and of how the average aggregation operator can be more appropriate for pattern recognition problems than minimax. The dashed boxes indicate components of the waveform W that must be modified so W can approximate the “ideal” signal. The pattern P below W is the result of the mapping. The boxes around P ’s points represent the target concept C. In the justification of the average aggregation operator, W 0 is the waveform that includes the dashed spike and P 0 the pattern that includes the open circles.
For a standard (non-fuzzy) geometric pattern C, a bag P is positive iff (1) each of the ≤ k boxes c ∈ C contains a point from P , and (2) each point in P lines in some box c ∈ C. For this application, one can view each box as an area in which one expects the target image to have certain behavior. In the fuzzy generalization, the label3 of example P measures the degree to which criteria 1 and 2 are satisfied. Ideally, in a positive example, there would be a point in the center of each box. However, part of the motivation of using a learning approach for this pattern matching problem is the flexibility provided by the geometric concepts for handling variations between images of the same object obtained from slightly different locations and/or conditions (one can think of these concept classes as being generalizations of the Hausdorff metric where weighted norms are used, and the weights can vary for each point). Under the standard (non-fuzzy) formulation, a binary classification is 3 The labels could come from a human expert or signal processing system that we are trying to approximate.
112 made for each point (based on whether it is inside the box or not) and then these are combined to give a classification for the n-point multi-instance example. A problem with this formulation is that one could have an example X1 where all n points are very near the centers of the boxes and another example X2 in which all n points are along the borders of the boxes. One would like a classification scheme that reflects that X1 is more similar to the target than X2 . Our work resolves this important problem by using a fuzzy model of membership for a point inside a box and then using fuzzy aggregation operators to combine the points in the instance. In this application area, the target boxes can be thought of as indicators where “components” of a waveform W should lie. If the target boxes are fuzzy, then each box can indicate how much the components of W must be translated, stretched, compressed, and scaled so W can approximate the “ideal” pattern. Using a minimax approach in the target concept is tantamount to saying that the similarity is based on the amount of transformation required in the “worst” component, independent of the others, whereas averaging over all points tells on average the amount of transformation required. The average measure tends to degrade more gracefully than minimax as W becomes less ideal. For example, if W 0 = W with one noisy spike (the dashed line and open circles of Figure 2), then setting the label using an average over the points will yield a value farther from 1 for P 0 than for P , but still > 0. If instead we use the original minimax definition, the label of P 0 would be 0, losing much information. This is the justification for the average aggregation operator of Section 2.3. 3.2 Learning from Multiple-Instance Examples and Drug Activity Prediction Motivated by drug discovery, Dietterich et al. (1997) introduced the notion of learning from multiple-instance examples where the target concept is simply a boolean function, each example is a collection of instances, and the example (collection) is classified as positive if and only if at least one of its elements is mapped to 1 by the target concept. Their model is primarily motivated by the problem of predicting whether a molecule would bind at a particular site. They argued empirically that axis-parallel rectangles are good hypotheses for this and other similar learning problems. Subsequently, multi-instance learning has been extensively studied (e.g. Long & Tan, 1998; Auer et al., 1997; Auer, 1997; Blum & Kalai, 1998; Maron & LozanoP´erez, 1998; Maron, 1998; Maron & Ratan, 1998; Blum & Kalai, 1998). In several of these papers, the target concept is a single axis-parallel box and a bag is classified as positive if at least one of the points in the bag is inside the target box. But any algorithm for learning geometric patterns (e.g. Section 4.3) learns a union of axis-parallel boxes where a multiple-instance example is classified as
positive if and only if (1) each point is classified as positive by some box and (2) every box contains at least one point. Furthermore, the algorithm of Section 4.3 is easily adapted to learn the union of axis-parallel boxes (of constant dimension) in the multiple-instance model under the rule that an example is positive if at least one of its points is inside some target box. Other variations for the rule of when an example is positive can be used, such as an example is positive iff every target box contains some point from the example, but not every point must lie in a box. Such a concept class is potentially useful when predicting “antagonist drugs”, whose jobs are not to bind at a single site of another molecule (which has been the focus of recent research, e.g. Dietterich et al., 1997; Auer, 1997; Maron & Lozano-P´erez, 1998), but instead to block access to multiple binding sites in a single molecule by fitting in several of them simultaneously. Such a concept cannot be represented by a single axis-parallel box, but can be represented by the class just described. In their work on drug discovery, Dietterich et al. (1997) used a boolean classification as to whether or not a molecule is a musk molecule. But in many cases it was hard for the experts to decide between a classification of musk or non-musk. (They handled this by only using the data for which the experts had no uncertainty.) By using fuzzy geometric concepts, the fuzzy classification can indicate the degree to which the example satisfies the target concept. So for example, for the musk-odor prediction task, compounds could be assigned a degree of “muskness” versus being forced to classify each compound as a musk (1) or a non-musk (0). The main drawback to applying our algorithms (Sections 4.3 and 4.4) to the drug discovery problem is that the time complexity of those algorithms blows up exponentially in the dimension d. Since much of the drug discovery data (e.g. the Musk UCI data set) has very high dimension, some pruning or remapping of features is required. 3.3 Content-Based Image Retrieval Maron and Ratan (1998) demonstrated the feasibility of using multiple-instance learning for content-based image retrieval (CBIR) for images of natural scenes. The system was query by example, where the user presented to the system examples of desired images, and the system’s job was to determine the commonalities among the query images. Maron and Ratan filtered and subsampled the images and then extracted “blobs” (groups of m adjacent pixels), which they mapped to a (3m)-dimensional space (one attribute for each red, green, and blue value of each pixel). Each blob was mapped to one point in a bag, which consisted of all possible blobs from an image. All bags derived from query images were labeled as positive. Then the system used the multiple-instance learning algorithm diverse den-
113 sity (Maron, 1998) to learn the commonalities, find more positive and negative examples in the database, and then present those to the user for labeling for another iteration of learning. We are exploring the application of multiple-instance learning algorithms to CBIR, including our pattern-learning algorithms (Goldman et al., 2000; Goldman & Scott, 1999a) and those by Auer (1997) and Maron and Ratan (1998). We are using blobs as features as well as other features such as line segment descriptors, Fourier coefficients of shapes, centralized moments, and elongatedness. Primitive descriptors like line segments and elongatedness are not expressive enough to be used in a purely disjunctive multiple-instance learning model where the target concept is a single box or point in real space and a bag is positive if any of its points is positive. Instead, these primitive descriptors require some conjunctive multiple-instance concepts, like geometric patterns or that described in Section 3.2 for antagonist drugs. (This is because the primitive descriptors require conjunctive operators to represent meaningful higher-level concepts like shape, whereas Maron and Ratan’s blobs implicity represent spatial information.) Thus we expect our algorithms (Sections 4.3 and 4.4 and their variants), which can learn more expressive multiple-instance concepts, to be more successful when using primitive descriptors than other algorithms. We also expect the primitives (when used with conjunctive operators) to represent shapes invariant of rotation, translation, and scale, which cannot be handled with blobs (experiments are planned to verify this). The down side to our algorithms is that their time complexity is exponential in the dimension, so few attributes can be used per instance.
4. The Algorithms 4.1 A Noise-Intolerant PAC Algorithm for Ck,n Our first two algorithms work within the PAC (probably approximately correct) model of learning theory (Valiant, 1984; Valiant, 1985; Kearns & Vazirani, 1994). A PAC algorithm is given inputs and δ and a set of examples of the target concept randomly drawn according to a fixed, arbitrary, and unknown probability distribution D. It outputs a hypothesis that, with probability at least 1 − δ, has error at most on examples randomly drawn according to the same distribution D. We now review a PAC algorithm (Goldberg & Goldman, 1994; Goldberg et al., 1996) for learning Ck,n . This algorithm is called an Occam algorithm (Blumer et al. 1987; 1989) because, in the spirit of Occam’s Razor, its hypothesis is a shorter representation of the training sample. It draws a sample of size m = k5/2 log(kn) 5/2 k log(kn) 1 1 O log δ + log and finds a hypothesis consistent with all examples. It builds the hy-
pothesis using a greedy covering algorithm that is based on the observation that it is possible, in polynomial time, to find a concept from Ck+1,n consistent with all the positive 1 of the negative examples. examples and a fraction 2(k+1) Then the negative examples accounted for are removed and the procedure is repeatedly applied (at most 2(k + 1) lg m times) until all negative examples have been eliminated. The intersection of all concepts obtained by doing this is consistent with the sample and thus forms a hypothesis with error at most with probability at least 1 − δ. 4.2 A Noise-Tolerant PAC Algorithm for Ck,n A key drawback of the Occam algorithm for learning Ck,n is that it is very sensitive to even a small amount of noise in the data. A single negative example misclassified as positive could make it impossible to find a pattern that is consistent with all the positively labeled examples. Thus Goldman and Scott (1999b) gave an algorithm that is tolerant of certain types of noise. This algorithm, rather than examining the labels of individual examples, asks questions relating to statistics about the examples. These statistical query (SQ) algorithms (Kearns, 1993; Aslam & Decatur, 1993; Aslam & Decatur, 1998; Decatur, 1993) can also be shown to meet the PAC criteria, even in the presence of many types of noise, making them appealing tools for working with real data. To meet the PAC our a sam SQ algorithm requires criteria, k 1 Dk2 , where log + log ple of size O 2 (1−2η 2 (1−2ηb ) δ b) 1 1 2 D = O k log n · log · log k + log log is the VC dimension of our algorithm’s hypothesis class and ηb is an upper bound on the classification noise rate. This algorithm starts with a hypothesis that has low false negative error (as measured by SQs) and then performs a greedy “covering” of the false positive error (also measured by SQs). Thus the algorithm is similar to the Occam algorithm, except that its use of statistical measurements in lieu of labeled examples gives it tolerance of large amounts of different types of noise. Goldman and Scott (1999b) also gave experimental results on simulated data that show that not only are the empirical data requirements substantially less than the theoretical requirements (as with the Occam algorithm), but also that the SQ algorithm performs nearly as well as the Occam algorithm on noise-free data. This provides empirical evidence that SQ algorithms may be effective tools in practice. Also given were preliminary experimental results on real one-dimensional robot vision data that revealed the loss of information when mapping to one-dimensional patterns, showing that the two-dimensional mapping of Section 3.1 is necessary for this application. f inite 4.3 An On-Line, Agnostic Algorithm for Ck,n,d f inite For our next algorithm (to learn Ck,n,d ), we change learning models. Instead of learning in batch mode, our al-
114 gorithm works on-line (Angluin, 1988; Littlestone, 1988), meaning that there is no separate “training phase” for it to develop a complete hypothesis. An on-line learner must learn as it goes, and all errors it makes count against it. Instead of bounding the probability of error of an on-line learner (as in the PAC model4 ), we bound the number of prediction mistakes that the learner makes on the sequence of examples it sees when these examples are adversarially generated. Goldman et al. (2000) give an on-line algorithm (based on the algorithm Winnow by Littlestone, 1988) to learn geometric patterns. This algorithm is agnostic (Haussler, 1992; Kearns et al., 1994) in the sense that its error bounds make no assumptions whatsoever about the target concept to be learned (as opposed to our PAC algorithms, whose error bounds break if our assumptions are violated). Our algorithm is also tolerant of concept shift, i.e. if the target concept changes over time, our algorithm can adapt its hypothesis. The algorithm of Goldman et al. (2000) reduces the problem of learning geometric patterns to that of learning a disjunction of attributes. Recall the two criteria from Section 2.2 for an example (bag) to be positive. These are easily mapped to two criteria that indicate when a bag is negative. Namely, a bag is negative iff (1) there exists a box completely disjoint from all target boxes that does contain a point, or (2) there exists a box in the target concept that does not contain a point. Thus given the set C of k target boxes and a set C¯ of kcomp boxes whose union is the complement of the union of target boxes, an example is negative iff (1) some box in C¯ contains a point or (2) some box in C is empty. So if we associate two boolean attributes per box (one attribute is 1 if the box is empty, and one is 1 if it contains a point), then an example is negative iff the monotone disjunction of the K = k + kcomp relevant attributes is 1. Thus we can associate two attributes to all boxes5 (for a total of N attributes) and use Winnow to learn the target disjunction, which it is guaranteed to do while making only O(K log N ) prediction mistakes (Littlestone, 1988). This algorithm is also agnostic in the sense that if there does not exist a set of target boxes that correctly classifies all examples, this algorithm makes at most O(KMopt + K log N ) prediction mistakes, where Mopt is the number of mistakes made by the best collection. One more issue we must contend with is that a direct implementation of Winnow in our setting requires that the number of attributes (and thus the computation time) be exponential in the number of bits required to represent an instance and a concept. We circumvent this problem by applying the virtual weight technique of Maass and War4
On-line algorithms with a bounded number of mistakes can be mapped to PAC algorithms (Angluin, 1988; Littlestone, 1989). 5 Recall from Section 2.2 that our concept class is bounded and discretized, so the number of boxes is finite.
muth (1998) to implicitly maintain the weights. The basic idea is to simulate Winnow by grouping boxes that “behave alike” into blocks. For each block only one weight has to be computed and we construct the blocks so that the number of boxes combined in each block as well as the weight for the block can be efficiently computed. Using the virtual weights technique allows our time complexity to be polynomial in the size of the examples and the target concept. 4.4 Learning Fuzzy Geometric Patterns Goldman and Scott (1999a) developed algorithms for learning the concept classes described in Section 2.3. Those algorithms are similar to those from Section 4.3, except that since real-valued predictions are required, the square loss function is used and the problem is reduced to that of learning real-valued functions using the exponentiated gradient algorithm (EG, Kivinen & Warmuth, 1997). We also give a novel application of the virtual weights technique that enables our predictions to be made in polynomial time. In typical applications of the virtual weights technique, all of the concepts in a group have the same weight and prediction, which allows a single “representative” concept from each group to be tracked. However, in our application there are an exponential number of different weights (and possible predictions). Hence, boxes in each group have different weights and predictions, which makes the computation of the contribution of a group is significantly more involved. We are able to both keep the number of groups polynomial in the number of trials and efficiently compute the overall prediction.
5. Conclusions and Future Work The concept classes of geometric patterns are very expressive and flexible. Thus algorithms to learn these classes are potentially quite useful for many application areas for which spatial information is important. Ongoing work in our research group is in applying Section 4.3’s algorithm to content-based image retrieval (CBIR) and robot vision. We also plan to contrast this algorithm’s performance with other multiple-instance learning algorithms in CBIR, and evaluate our algorithms in domains such as drug discovery (especially detection of antagonist drugs) and on computational biology problems. (The main issue to overcome in some of these applications is the high dimensionality of the data; our algorithms’ time complexity grows exponentially in d.) We also wish to implement and test Section 4.4’s algorithm on fuzzy-labeled data, e.g. in CBIR, a user would give a [0, 1]-valued confidence in how well an example image matches the user’s true query. R EFERENCES Angluin, D. (1988). Queries and concept learning. Machine Learning, 2, 319–342. Aslam, J. A., & Decatur, S. E. (1993). General bounds on statisti-
115 cal query learning and PAC learning with noise via hypothesis boosting. Proceedings of the 34rd Annual Symposium on Foundations of Computer Science (pp. 282–291).
Kearns, M. (1993). Efficient noise-tolerant learning from statistical queries. Proc. 25th Annu. ACM Sympos. Theory Comput. (pp. 392–401). ACM Press, New York, NY.
Aslam, J. A., & Decatur, S. E. (1998). Specification and simulation of statistical query algorithms for efficiency and noise tolerance. Journal of Computer and System Sci., 56, 191–208.
Kearns, M., & Vazirani, U. (1994). An introduction to computational learning theory. Cambridge, Massachusetts: MIT Press.
Auer, P. (1997). On learning from multi-instance examples: Empirical evaluation of a theoretical approach. Proc. 14th International Conference on Machine Learning (pp. 21–29). Auer, P., Long, P. M., & Srinivasan, A. (1997). Approximating hyper-rectangles: Learning and pseudo-random sets. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing (pp. 314–323). ACM.
Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning, 17, 115–142. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1–63. Levitt, T., & Lawton, D. (1990). Qualitative navigation for mobile robots. Artificial Intelligence, 44, 305–360.
Blum, A., & Kalai, A. (1998). A note on learning from multipleinstance examples. Machine Learning, 30, 23–29.
Lin, C.-T., & Lee, C. S. G. (1996). Neural fuzzy systems: A neurofuzzy synergism to intelligent systems. Prentice Hall.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Inform. Proc. Lett., 24, 377–380.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36, 929–965.
Littlestone, N. (1989). From on-line to batch learning. Proc. 2nd Annu. Workshop on Comput. Learning Theory (pp. 269–284).
Decatur, S. E. (1993). Statistical queries and faulty PAC oracles. Proc. 6th Annu. Workshop on Comput. Learning Theory (pp. 262–268). ACM Press, New York, NY.
Long, P. M., & Tan, L. (1998). PAC learning axis-aligned rectangles with respect to product distributions from multipleinstance examples. Machine Learning, 30, 7–21.
Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T. (1997). Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, 89, 31–71.
Maass, W., & Warmuth, M. K. (1998). Efficient learning with virtual threshold gates. Inf. and Comp., 141, 66–83.
Goldberg, P. (1992). PAC learning geometrical figures. PhD dissertation, University of Edinburgh. Goldberg, P. W., & Goldman, S. A. (1994). Learning onedimensional geometric patterns under one-sided random misclassification noise. Proc. 7th Annu. ACM Workshop on Comput. Learning Theory (pp. 246–255). Goldberg, P. W., Goldman, S. A., & Scott, S. D. (1996). PAC learning of one-dimensional patterns. Machine Learning, 25, 51–70. Goldman, S. A., Kwek, S. K., & Scott, S. D. (2000). Agnostic learning of geometric patterns. Journal of Computer and System Sciences. To appear. Early version in COLT ’97. Goldman, S. A., & Scott, S. D. (1999a). Multi-instance learning of fuzzy geometric concepts (Technical Report UNL-CSE-99006). Dept. of Computer Science, University of Nebraska. Goldman, S. A., & Scott, S. D. (1999b). A theoretical and empirical study of a noise-tolerant algorithm to learn geometric patterns. Machine Learning, 37, 5–49. Gruber, P. M. (1983). Approximation of convex bodies. In P. M. Gruber and J. M. Willis (Eds.), Convexity and its applications. Brikh¨auser Verlag. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inform. Comput., 100, 78–150. Hong, J., Tan, X., Pinette, B., Weiss, R., & Riseman, E. (1992). Image-based homing. IEEE Control Systems Mag., 12, 38–45.
Maron, O. (1998). Learning from ambiguity. PhD dissertation, Dept. of Electrical Engineering and Computer Science, M.I.T. Maron, O., & Lozano-P´erez, T. (1998). A framework for multiple-instance learning. Advances in Neural Information Processing Systems 10. Maron, O., & Ratan, A. L. (1998). Multiple-instance learning for natural scene classification. Proc. 15th International Conf. on Machine Learning (pp. 341–349). Pinette, B. (1993). Image-based navigation through large-scaled environments. PhD dissertation, University of Mass., Amherst. Suzuki, H., & Arimoto, S. (1988). Visual control of autonomous mobile robot based on self-organizing model for pattern learning. Journal of Robotic Systems, 5, 453–470. Valiant, L. G. (1984). A theory of the learnable. Commun. ACM, 27, 1134–1142. Valiant, L. G. (1985). Learning disjunctions of conjunctions. Proceedings of the 9th International Joint Conference on Artificial Intelligence, vol. 1 (pp. 560–566). Los Angeles, California.