We propose Cofe, a method for sparse feature extraction which is based .... The method we propose, called Cofe, is a scalable MDS method. .... Obviously, we cannot count on the input points ... l2(ER(x3), ER(x5))=â32 + 82 +0+0+32 + 0 .... We will show that Cofe-GR does a very good job of picking out the best features.
Cofe: A Scalable Method for Feature Extraction from Complex Objects Gabriela Hristescu and Martin Farach-Colton {hristesc,farach}@cs.rutgers.edu ? Rutgers University, Dept. of Computer Science, Piscataway NJ 08854, USA Abstract. Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdorff distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database. While feature extraction is a well-studied problem in the databases and statistics communities, almost all methods known require that the distance between every pair of points be evaluated. This is prohibitive, even for small databases, when the distance function is expensive. We propose Cofe, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate Cofe on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time. We further propose Cofe-GR, which matches Cofe in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.
1
Introduction
Feature Extraction, also known as Multidimensional Scaling (MDS), is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. The simplest instance of feature extraction arises when the points of a data set are defined by a large number, k, of features. We say that such points are embedded in k-dimensional space. Picking out k 0 k features to represent the data-points, while preserving distances between points, is a feature extraction problem called the dimensionality reduction problem. The most straightforward and intuitively appealing way to reduce the number of dimensions is to pick some subset of size k 0 of the k initial dimensions. However, taking k 0 linear combinations of the original k dimensions can often produce substantially better features than this na¨ıve approach. Such approaches are at the heart of methods like Single Value Decomposition (SVD) [10] or the Karhunen-Lo`eve transform [8]. Of course, linear combinations of original dimensions are but one way to pick features. Non-linear functions of features have the potential to give even better embeddings since they are more general than linear combinations. ?
Supported by NSF Award CCR-9820879
In many other cases the points are complex objects which are not embedded. For example, if the dataset consists of DNA or protein sequences, then there is no natural notion of a set, large or otherwise, of orthogonal features describing the objects. Similarly, in multimedia applications, the data may include polygons as part of the description of an object. Once again, polygonal shapes can be described as a sequence of points along the convex hull, but such a description does not constitute a description of the object in a feature space. While such data types are not represented in a feature space, they are typically endowed with a distance function, which together with the dataset of objects define a distance space. For example, the distance between biological macromolecules is taken to be some variant of the edit distance between them. For geometric objects in 2or 3-dimensions, the distance is often measured as the Hausdorff distance. These distance functions are very expensive to compute. The edit distance between two sequences of length m takes O(m2 ) time to compute, while the Hausdorff distance between two geometric objects, each with m points, takes time O(m5 ) to compute. Even though complex objects have a finite representation in the computer, the natural feature space this representation describes does not preserve distances between objects. For example, the k points of the polygon can be trivially represented by O(k) dimensions by a straightforward O(k) bit computer representation, but the vector distance between such embeddings is unrelated to any natural geometric distance between the polygons. The Complex Object Multidimensional Scaling (COMDS) problem is then the problem of extracting features from objects given an expensive distance function between them. A good solution to the COMDS problem has to have good quality: Quality: The k features extracted should reflect, as closely as possible, the underlying distances between the objects. Furthermore, the extracted features should be good even with small k. If we are interested in visualization k = 2, 3. Clustering and nearest neighbor searching becomes prohibitively expensive, or the quality of the clustering degrades, if k is more than about 10. Thus the quality of a COMDS algorithm depends on the quality of a small number of extracted features. There is a tradeoff between the quality and the scalability of a solution to the COMDS problem. A scalable solution should have the following characteristics: Sparsity: Since the distance function is very expensive to evaluate, it is not feasible to compute all n2 pairwise distances, where n is the number of elements in the database. Thus, the method must compute only a sparse subset of all possible distances. Locality: As many databases continue to grow faster than memory capacity, the performance of any COMDS solution will ultimately depend on the number of accesses to secondary storage. Therefore, a COMDS solution should have good locality of object references. This can be measured by the number of database scans necessary to compute the embedding. Thus, this number should be low. We address these issues in designing Cofe, an algorithm for the COMDS problem. We evaluate Cofe and compare it with FastMap [7], a previously proposed solution for COMDS.
Features define a metric space. The standard way to define the distance between two k-dimensional feature vectors is through their l2 (Euclidean) distance, that is, if point p has features p1 , ..., pk andq point q has features q1 , ..., qk, we can Pk 0 2 interpret the “feature distance” d (p, q) = i=1 (pi −qi ) . Taken in this view, we seek the embedding of the real distance function d(·,·) into