using component-wise comparison on domains of interest. The guiding ... to express their preferences for each domain, e.g. price and milage should both be low. ..... Note that transitivity of r is necessary for TX(r) to extend r (hence the name.
Transitivity-Preserving Skylines for Partially Ordered Domains Henning K¨ ohler, Kai Zheng, Jing Yang, and Xiaofang Zhou The University of Queensland, Brisbane, Australia {henning,kevin,jingyang,zxf}@itee.uq.edu.au
Abstract. The skyline of a set P of multi-dimensional points (tuples) consists of those points in P for which no clearly better point in P exists, using component-wise comparison on domains of interest. The guiding idea is to prune large data sets to a more manageable size, while ensuring that points of interest are preserved. However, when domains are only partially ordered, it easily happens that the skyline is nearly as large as the original set (or at least of the same order of magnitude), since most of the time points are incomparable in at least some dimension. To obtain a smaller, more useful skyline set which better reflects actual user preferences, we propose a richer notion of dominance, based on two assumptions: that preference specifications are often incomplete, and that actual preferences are transitive.
1
Introduction
There are many applications where a user is interested in viewing the ‘best’ objects chosen from a large collection, based on multiple criteria, e.g. price and milage for used cars. There are multiple ways to approach this problem. Top-k queries [5] require a user to define a ranking function over the object collection, and return the k top-ranked objects. In contrast, skylines [4] only require users to express their preferences for each domain, e.g. price and milage should both be low. They then return all objects for which no object exists that is “clearly better”: at least as good on every domain and strictly better in at least one. The resulting skyline can provide a user with a better understanding of the trade-offs involved, without requiring any precise ranking functions to be specified. However, skyline sets can grow very large, for different reasons: – large and/or anti-correlated data sets – too many domains – preferences are only partial As a result the user is overwhelmed with information, which is particularly aggravating if only a small portion of the skyline is of actual interest to the user. In the case of large and/or anti-correlated data sets, we can expect that many skyline points are really interesting, but often very similar. Here sampling and clustering approaches [11] can provide an overview. In [10] points are chosen to maximize the number of points dominated. H. Kitagawa et al. (Eds.): DASFAA 2010, Part II, LNCS 5982, pp. 109–115, 2010. c Springer-Verlag Berlin Heidelberg 2010
110
H. K¨ ohler et al.
When skylines grow large because many domains are considered or because preferences are partial, the expectation is that only a small portion of the skyline points will really be interesting. Thus the usefulness of any filtering method must be judged by how well it manages to eliminate ‘bad’ points without eliminating ‘good’ points as well. Here the classic notion of Pareto dominance often fairs poorly, since it is too restrictive. A number of approaches have been suggested for dealing with large numbers of domains. k-dominance [6] only requires that a point is better than or equal to another in at least k dimensions. The ε-skyline [12] allows a dominating point to be worse in some dimensions as well, though only by a small pre-defined ε value. Approximately dominating representatives [9] allow dominance by being no worse than a factor 1 + ε in any dimension, and the objective is to find small sets which approximately dominate all points. In [8] user-defined preference rankings between domains allow dominance by being better on preferred domains. User-defined preference comparisons between instances on subspaces are considered in [1]. The strong skyline [14] combines skyline points from subspaces where the skyline is small. Skyline frequency [7] ranks skyline points by how often they appear as skyline points in subspaces, while top-k-skyline [13] ranks them by the number of points they dominate in the given data set. For partially ordered domains, Balke et. al. [3] proposed a new notion of dominance, called weak Pareto dominance, which treats incomparability on a domain as equality. While this idea is easy to understand, and results in a much smaller skyline, it suffers from cyclicity and intransitivity (as do k-dominance and ε-skyline). As already pointed out in [2], such properties are undesirable. They contradict the intuition that preferences are intrinsically transitive, and can lead to unexpected behavior. To avoid this, we will propose a new notion of dominance for partially ordered domains, based on the assumptions that – preference specifications may be incomplete, but – actual (hidden) preferences are transitive Theoretical arguments as well as experimental evaluation (which had to be omitted due to space constraints) suggest that this leads to more useful skyline sets, i.e., that our new dominance notion reflects user preferences more closely than classic or weak Pareto dominance. As a nice side-effect, transitivity of dominance allows us to employ efficient algorithms for computing it.
2
Dominance
At the heart of the skyline problem lies the notion of dominance, indicating that a tuple A = (a1 , . . . , an ) is strictly better than a tuple B = (b1 , . . . , bn ). Here the traditional definition, called Pareto dominance, is that ai ≤i bi for all i and aj