Vague Spatial Objects Using Shapelet Distributions by Jim Bosch and ...

1 downloads 0 Views 414KB Size Report
Jim Bosch. Daniel Zinn. {jfbosch,dzinn}@ucdavis.edu. Dept. of Computer Science,. University of California at Davis,. One Shields Ave,. Davis, CA 95616, USA.
Vague Spatial Objects Using Shapelet Distributions Jim Bosch

Daniel Zinn

{jfbosch,dzinn}@ucdavis.edu Dept. of Computer Science, University of California at Davis, One Shields Ave, Davis, CA 95616, USA

Abstract. Research into spatial database systems in the past few years has been primarily focused on traditional, vector-based, precisely defined regions. More recent work, however, has begun to address the need for data types that can model certain spatial phenomena, for example “hills and valleys”, that cannot be modeled using objects having sharp boundaries. Other work has focused on point objects whose location is not precisely known, and is described using a probability distribution. We show that many previously proposed concepts of those fuzzy or probabilistic objects are equivalent to the usage of distribution functions and the definition of a few appropriate operators upon them. Adapting an image compression method based on orthogonal function expansions (“Shapelets”), we describe a framework that efficiently implements these functions and operators, providing a representation of these distributions and operators appropriate for smoothly-varying vague and probabilistic data. We also implement these concepts as a new data type in the open source database management system PostgreSQL.

1

Introduction

Spatial database systems are used to store, maintain, and process data that is associated with a location and/or physical extent. Precisely-located “vector”primitives like points and lines are used to define and store exact locations and extensions of spatial objects. This model fits many man-made artifacts, such as buildings, streets or legislative country borders. However, for certain spatial objects, it is hard or even meaningless to define exact locations and boundaries. On a map showing San Francisco, it is, for example, hard to define sharp regions denoting urban and suburban areas. As noted by Erwig and Schneider [12], natural, social, or cultural phenomena such as soil quality, vegetation, oceans, biotopes, or English speaking areas are hard to capture using crisp regions with sharp boundaries. Schneider [15] also notes that there are also many spatial

Fig. 1. Finite representations for the objects shown in the first column (from top): a bivariate normal probability distribution, a digital elevation map of Mount Shasta, CA, and an arbitrary, hand-drawn figure. Each representation is limited to the same amount of memory (36 shapelet coefficients, 18 polygon vertices (x2 points per vertex), 36 pixels).

Fig. 2. Shapelet approximation with a varying number of coefficients: original, 1, 6, 15, 36, 55, 120

phenomena that exhibit a very smooth fuzziness. Here, some examples are air pollution, temperature zones, magnetic fields, and storm intensity. In this paper, we present both a new framework unifying the implementation requirements of various types of vague/fuzzy data, and a new data type for efficient storage and manipulation of smooth fuzzy regions in this framework. Rather than representing fuzzy regions as combinations of traditional regions associated discretely with different levels of fuzziness [15], we adapt a technique developed for astronomical image analysis that handles smooth data in a more natural manner. This method is optimized for localized smooth spatial data (data which is more or less concentrated near a central point), but it is capable of representing any arbitrary distribution. This is an extremely common form of smooth spatial data, and includes most probabilistic applications, remote sensor data, wi-fi/radio antenna coverage, many storm patterns, and most astronomical objects. Beyond describing the data type on an abstract level, we present a prototypic low-level implementation for the open source database management system PostgreSQL [5]. This paper is organized as follows. In section 2, we discuss previous work on vague spatial data. In section 3, we present distributions, a framework for unifying the implementation requirements for all existing types of vague spatial data, and in section 4, we discuss existing implementation types and motivate the need for a new type of implementation based on a series representation of functions. In section 5, we review the mathematics of series expansions with a focus on the requirements for distributions. In section 6, we present Shapelets, an image analysis technique developed in the astronomical literature, and demonstrate how it can be adapted to fit the requirements of distributions. Then, in section 7 we describe our implementation. In section 8 we present several applications of Shapelet-based distributions, followed by a discussion of possible future research in section 9, and a short conclusion in section 10.

2

Related Work

As noted above, most work on vague and fuzzy regions has focused primarily on the definition of a complete “fuzzy region” algebra using fuzzy set theory (Schneider in [12]). Implementation details have generally been more implied than described, with most methods assuming a representation of fuzzy regions using sets of traditional (“crisp”) regions, which is a straightforward, if somewhat complicated, task. This representation, however, produces very poor approximations to smoothly varying spatial data, which perhaps comprises most of the applications of vague or fuzzy regions. Another area of research focuses on objects whose location is not precisely known, and can only be described by a probability density function (PDF). This is a particularly important concept in moving object databases, where some uncertainty is always associated with the location of an object. In probabilistic applications, the PDF is generally assumed to be some standard analytic function (perhaps a Gaussian or Cauchy distribution). Many necessary operations on

these functions, in particular definite integrals, are assumed to be extremely expensive to calculate, however, requiring purely numerical methods as pointed out by Cheng and Prabhakar [11]. Recent work in this area has focused on query types and appropriate index structures that can reduce the number of integrals that must be calculated to execute certain types of queries. Adaptations of these concepts, and in particular the U-Tree, presented by Tao et al. [16], may provide excellent methods of indexing and handling queries on smooth spatial objects in general, but little work has focused on the actual implementation of arbitrary PDFs or in optimizing the calculation of the integrals themselves. Purely field-based models can be used to model certain land-based phenomena like temperature distribution over a specific area. Here, parts of the field are not associated with distinct objects. Secondary attributes (such as names for mountains, or the channel numbers in a radio antenna coverage map) cannot easily be included in the field-based view, as field-based data has no concept of distinct objects. Associating different (and potentially overlapping) regions of a field with distinct objects produces, then, a more general form of vague/fuzzy spatial data, in which the interpretation of field values is not limited to membership in a region or statistical probability.

3

Vague Regions As Distributions

All of the above three approaches to vague spatial data share an important feature: all represent spatial objects in N-d space with distributions, here defined simply as functions of N variables. For the fuzzy regions defined in using fuzzy set algebra, this distribution is the membership function, which is restricted to values between 0 and 1, with 1 denoting complete membership in a fuzzy set, 0 representing non-membership, and values in-between representing fuzzy membership. In a probabilistic interpretation, the distribution is simply the PDF that represents the differential probability of finding the object at a certain point in space, and the distribution is normalized such that its integral over all space is one. In object-associated field data, there are no restrictions on the value of the distribution, and the value of the function at a point simply represents the value of the field attribute at that point. Manipulation of vague spatial objects in all three cases thus reduces in practice to operations on distributions, and we will show here that a small set of operations on these distributions will allow us to meet the requirements of virtually all vague spatial data. In particular, any method of representing distributions that provides these operations thus provides a way of realizing any of the above types of vague spatial data, and we can address the suitability of various data structures and concepts without the encumbrance of potential interpretations, such as fuzzy set theory, probability, or any number of types of object-associated field data. For fuzzy regions traditionally defined with fuzzy set theory, the most important operations are topological: extensions to the traditional region-algebra operations intersection and union. If we take F and G to be two fuzzy sets rep-

resented by the membership functions (distributions) f (x, y) and g(x, y), these operations are traditionally defined as: F ∩ G ⇐⇒ min(f (x, y), g(x, y))

(1)

F ∪ G ⇐⇒ max(f (x, y), g(x, y))

(2)

A complete discussion of the mapping between fuzzy set operations and arithmetic operations on distributions as membership functions is beyond the scope of this paper; in particular, this requires the development of distributions of lower dimensionality to meet the requirements of “fuzzy lines” and “fuzzy points”. A complete mapping clearly exists, however, as membership functions are present in the very definition of fuzzy sets. For our purposes, it is sufficient to note that meeting the most important requirements of set-based fuzzy regions requires the minimum and maximum operators above. In probabilistic applications, the most important operations by far are variations on the basic probabilistic query: what is the probability of something in a given region? With distributions modeling the PDF, this simply amounts to a definite integral of the distribution over an arbitrary region of space. This provides, then, a second requirement for a useful representation of distributions: evaluation of definite integrals. We can also take the integral requirement one step further - many operations on spatial regions are metric operations concerned with the spatial extent of individual objects. For distributions this generalizes in a very straightforward way to integral moments: x center ⇐⇒ mean: R xf (x, y)dx hxi = R (3) f (x, y)dx x extent (width) ⇐⇒ square root of the variance: sR R 2 f (x, y)dx p x xf (x, y)dx R hx2 i − hxi2 = σ = − R f (x, y)dx f (x, y)dx

(4)

A useful representation of distributions thus requires that simple integral moments be straightforward to calculate. Probabilistic operations also make use of operations similar to intersection and union, but here these are defined simply as the product of two distributions (representing the probability of both distributions simultaneously) and the sum of two distributions (representing the probability of either distribution). We also thus require simple arithmetic operations on distributions: sums and products. It turns out that arithmetic operations may also be sufficient for fuzzy set theory applications, as well. Returning to distributions as fuzzy membership functions, we may also define intersection and union in the following way: F ∩ G ⇐⇒ f (x, y)g(x, y)

(5)

F ∪ G ⇐⇒ f (x, y) + g(x, y) − f (x, y)g(x, y)

(6)

These definitions are not exactly equivalent to (1) and (2). However, they do meet the most important requirements of intersection and union operations for fuzzy sets: they reproduce the same three-valued logic in the mapping f = 0 → false, f = 1 → true, 0 < f < 1 → maybe, and they follow the conditions: f (x, y)g(x, y) ≤ f (x, y) f (x, y)g(x, y) ≤ g(x, y)

(7) (8)

f (x, y) + g(x, y) − f (x, y)g(x, y) ≥ f (x, y)

(9)

f (x, y) + g(x, y) − f (x, y)g(x, y) ≥ g(x, y)

(10)

The only difference is that (5) gives smaller “maybe” values for the intersection of two “maybe” regions relative to (1), and (6) gives higher “maybe” values for the union of two “maybe” regions relative to (2) (these differences are shown in figure 3). 1

0.8

f(x) g(x) intersection union [min(f,g)] - [f·g] [f+g-f·g] - [max(f,g)]

0.6

0.4

0.2

0

Fig. 3. Comparison of min-max definitions of intersection and union with arithmetic definitions of the same. As these arithmetic intersection and union operations are nearly equivalent for the purposes of fuzzy-set regions and arguably more appropriate for other types of data, minimum and maximum operations are not strictly necessary for a useful representation of distributions. As it turns out, this is an important adjustment, because these operations are not easy to implement for smoothly data, and in fact inherently disturb the “smoothness” of such data. The final set of necessary operations for a distribution representation are the simple geometric operators necessary for performing the coordinate transforma-

tions and projections necessary in almost every representation of spatial data: rotation, translation, shear, et cetera. To summarize, implementing any of the three described models of vague spatial data in N-dimensional space can be reduced to implementing a concept of N-dimensional distributions that efficiently performs the following computations: – – – –

Basic arithmetic operations Integrals and integral moments Standard geometric transforms Minimum and maximum operations (optional)

The extension of distribution algebra to higher dimensions is implicit in its definition; 3-dimensional data may be modeled using distributions that are functions of 3 variables, and so on. For the remainder of this paper, however, we will restrict ourselves to the case of two-dimensional distributions, as this will simplify some concepts and allow us to use the only slightly misleading terms “height” and “volume” to refer to the value of a distribution and its integral, respectively.

4

Representations of Distributions

In this section we present and compare several methods for representing distributions, many of which are already well-understood. We also motivate the need for a new representation optimized for smooth data. For illustration purposes, we have applied these methods, namely pixeled images, sets of determinate regions, and Shapelets on the same raw data to compare their approximation “quality”. The achieved results using the same amount of memory (36 pixels, 18 vertices, 36 coefficients, respectively) are shown in figure 1. It can be seen that the Shapelet method works well for smooth data like the Gaussian distribution (first), an elevation map of Mount Shasta, California (second), and the hand-drawn fourth example.

4.1

Images

Raster images are perhaps the most straightforward method of representing distributions. All arithmetic operations can be done pixel-by-pixel, integrations are simply sums over pixels, and methods for geometric transforms are very well understood. The disadvantage of image-based representations are equally obvious, however: images are often more expensive than other methods in both computation and storage, especially when interpolation schemes must be employed to perform operations on multiple images whose pixels may not line up precisely. Low resolution images can be used to decrease these costs, but force greater degrees of approximations in almost all types of distributions.

4.2

Sets of Determinate Regions

For distributions that do not vary smoothly, distributions can be well represented by sets of the standard determinate (“crisp”) regions mentioned above, with different regions being associated with different values for the distribution. Arithmetic operations reduce to combinations of set algebra and standard arithmetic on real numbers, integrals reduce to set algebra, summation, and geometric operations are already well-established as operations on the points that compose regions. Sets of determinate regions do not model smooth or continuous data well, however (which is precisely what vague data most often is), and the computational and storage requirements only increase as more regions are required to describe such data more accurately. 4.3

Standard Analytic Functions

It is particularly common in probabilistic work to use well-known analytic functions to represent distributions, such as a Gaussian (normal), Cauchy, or Binomial function. These represent smooth data extremely well, and integrals and moments are usually well understood even when they result in non-elementary functions such as the standard error function. Two-dimensional extensions, however, are much more complex; even the bivariate Gaussian distribution cannot be analytically integrated over an arbitrary region. Arithmetic and geometric operations are extremely complex, however, and require separate implementations for different analytic functions. In fact, such operations are not closed for a single analytic form; the sum of two Gaussians is certainly not a Gaussian, and neither is the product. 4.4

Generalized Analytic Functions

To best represent smooth data, then, we need a fully-arbitrary set of analytic functions. This can be done through series representations, in which a function is expanded onto a set of known orthogonal basis functions. The most common example of these is Maclauren’s series, in which a function f (x) can be represented as a set of constants an in the expansion: f (x) =

∞ X n=0

an

xn n!

(11)

This sort of expansion can be generalized to higher dimensions, but even more useful expansions other than Maclauren’s can be used. The Fourier Series, for example, has several advantages that would make it an excellent choice in modeling distributions. We will concentrate, however, on Shapelets, a set of basis functions in use in astronomical image analysis that are optimized to deal with distributions that are localized and “almost” Gaussian, but can deal (in theory) with truly arbitrary distributions (the complexity of operations increases

with the deviation from Gaussian). This will provide great advantages for probabilistic work, where the Gaussian curve is of utmost importance (and of course astronomy, where galaxies are wonderful examples of vague spatial objects), but they may be used in any application in which a distributions are concentrated near a single point.

5

Orthogonal Function Expansions

Before introducing Shapelets, it is useful to consider first the properties of series representations of functions from a more general perspective. Many textbooks cover what follows in greater mathematical rigor (in particular, see Arfken and Weber, chapter 9 [7]), but we will review here the most important elements of series representation and address their importance for the representation of distributions. The branch of mathematics dealing with the representation of general functions as series expansions can be best described using the language of linear algebra. In this framework, functions take on the role of vectors, and linear operators (such as the differential operator) assume the role of matrices. Addition and subtraction are easily defined for functions, and a scalar product can be defined as the integral over the product of the function and its complex conjugate. If both functions are real, the scalar product will be real and the operation is commutative. Z f · g ⇐⇒ f (x)∗ g(x)dx (12) The integral is over some interval where the scalar product is defined; for some valid functions this is all space, but for others (sin and cos, for example) a smaller region (2π) is more appropriate. Just as with vectors, a set φn (x) of functions is said to be complete on an interval (the set spans the interval) if any function A(x) can be represented on that interval as a linear combination of the φn (x) as below: ∞ X an φn (x) (13) A(x) = n=0

(where an are constants). Note that this is identical to the Maclauren expansion n with φn (x) = xn! . This completeness is all that is necessary for a function expansion to exist; but it is also useful if that expansion is unique (among other things). For this, we require that the set of functions to be orthogonal: Z φi · φj = φi (x)∗ φj (x)dx = 0 for i 6= j (14) It is also useful to require normality: Z φi · φi = φi (x)∗ φi (x)dx = 1

(15)

What we require, then, is a set of functions that is complete and orthonormal over some interval. We can then represent any function on this interval by projecting

onto the basis vectors φn (x) in an expansion of the form of (13). The projection operation necessary to determine the constants an for some general function A(x) is quite simple if the basis set φn (x) is orthonormal: Z an = A(x)φn (x)dx (16) One can compose higher-dimensional basis functions simply as the tensor product of one-dimensional basis functions: φn,m (x, y) = φn (x)φm (y)

(17)

If the φn (x) are orthonormal, the expected relations hold: A(x, y) =

∞ X ∞ X

an,m φn (x)φm (y)

(18)

A(x, y)φn (x)φm (y)dxdy

(19)

n=0 m=0

ZZ an,m =

Using an orthonormal basis, many of the necessary operations for distributions become remarkably simple. If A(x), B(x), and C(x) are distributions expanded onto the basis set φ(x) with coefficients an , bn , and cn , respectively, then arithmetic operations on functions reduce to arithmetic operations on the coefficients: C(x) = A(x) + B(x) ⇐⇒ cn = an + bn (20) C(x) = A(x)B(x) ⇐⇒ cn = (an )(bn )

(21)

Integrals and integral moments are also fairly tractable, as long as integrals and integral moments of the basis functions are: Z Z ∞ X an φn (x)dx (22) A(x)dx = n=0

Z

xp A(x)dx =

∞ X

Z an

xp φn (x)dx

(23)

n=0

The simplicity of geometric transformations is not as automatic, unfortunately, and depends on the explicit choice of basis functions. With the appropriate choice, however, most geometric transformations can be represented as matrix multiplications on the coefficient vector an . With the appropriate choice of basis - one that is not only complete and orthonormal, but also has tractable integrals and geometric transformations - we can then perform all of the necessary operations for a useful distribution representation term-by-term, just as all operations were calculated pixel-by-pixel for images. A choice of basis functions that bears some similarity to the data being modeled can thus have huge advantages over an image-based representation, as only a few terms may be necessary to provide an approximation to a smooth distribution that is equivalent to an image with many, many pixels.

5.1

Finite Series and Parameterized Basis Functions

While most series representations converge exactly with an infinite number of terms, in practice only a finite number may be used. As noted above, an “appropriate” series representation is thus one that represents the data well with a small number of terms. In general, series representations tend to converge well to a function at a given point, and diverge more at large distances from that point. As a result, when selecting an appropriate basis for a distribution, it is best to use a set of basis functions that are optimized at the center of the distribution in space. Since different objects have different centers, this would require using different actual basis functions for different distributions; the basis functions would have the same general form, but would be parameterized by the center of the distribution. In some cases (including the Shapelet basis described below), an additional parameter is included in order to match the extent of the basis functions to the extent of the distribution as well. These parameters (such as a point describing the center of the expansion) must be stored along with the coefficients in the representation of a distribution. These parameterizations are practically necessary for any set of distributions that cover an area large compared to the size of individual distributions; otherwise the number of terms necessary to describe various distributions would grow too large, producing (at least) the same sort of expense in calculations that hinders image representations. Such parameterizations, however, invalidate all of the term-by-term arithmetic necessary for the definition of our most useful operators. Unless two distributions use the exact same parameterization, any term-by-term operation is meaningless. There are two solutions to this problem, depending on how different the parameterizations of two basis functions are. For sets of basis functions with only slight differences in parameterizations, the solution can be found in the geometric operators, which now take on even greater importance: the two objects are simply translated and scaled until they are in the same parameterization. To add two distributions centered at different (but nearby) points, for instance, we simply translate each distribution by a vector opposite to the half the vector to the other object. These new distributions then represent the expansions about the point located between the two original points, and as the parameterizations are now the same, they may be dealt with term-by-term. Scaling operations can be used in a similar manner to deal with differences in scale parameterization. This highlights an important feature of geometric transforms on parameterized basis: they exist both as changes in parameterization (often a trivial operation, changing, for instance, the point associated with the center of a distribution) and changes in the coefficients themselves without changing the parameterization point (highly non-trivial in some cases). While it is the former than will be more often used to actually, for instance, translate an object, the latter is also necessary to perform term-by-term operations. Another solution exists for sets of basis functions with parameterizations that are not simple: we can simply define “compound” representations composed of

several sets of coefficients and basis, combined by simple arithmetic operations. General operations on these compound objects then reduces to first performing the operation on the members of the compound object, and then combining them in the appropriate manner. This can quickly lead to very complicated compound objects, but the overall framework is much more appropriate than the choice of either large numbers of terms or large degrees of approximation that would be necessary in the previous solution if parameterizations differ significantly. Finally, there is one important class of operation for which a simple shortcut around both of these operations exist. In many applications, the intersection or union of two regions is not particularly meaningful, but the degree of intersection (the “overlap”) is. For distributions in general, this amounts to simply integrating over the output of the intersection operation (assumed to be a simple product as discussed earlier), but as noted above, this can be an expensive operation when parameterizations differ. Computing the integral of the intersection can be much simpler. The convolution h(x) of two functions f (x) and g(x) is defined as: Z x h(x) = f ∗ g = f (x0 )g(x − x0 )dx0 (24) 0

This produces a new function of x, which is essentially a smoothing filter. It is straightforward to extend the definition to higher dimensions: Z r h(r) = f ∗ g = f (r 0 )g(r − r 0 )d3 r (25) 0

Computing the overlap of two distributions, then (when intersection is defined as a product) thus amounts to evaluating the convolution at the vector that connects the center of one object to the center of the other. It turns out that many classes of series representations (including Fourier Series and Shapelets) have a simple analytic form for the convolution even when the two objects are in differently parameterized bases, allowing overlap to be computed much more simply than performing a full intersection and integration operation.

6

The Shapelet Basis

The Shapelet basis is actually two complementary sets of basis functions developed by Refregier and Massey [14, 13], and independently but partially by Jarvis and Bernstein [8], as a method for astronomical image analysis and compression. Both sets of basis functions describe perturbations to a circular Gaussian function, and differ only in the coordinate system used: one set uses a Cartesian (x,y) system, and one uses a polar system (r,θ). The sets are complementary because one can convert between the two series representations in a simple and lossless (up to floating point errors) manner, and many of the necessary operations discussed above can be performed in only one of the bases. In essence, the Cartesian form uses Hermite polynomials weighted by a Gaussian, and the polar form uses Laguerre polynomials. Both are parameterized with the same scale factor β and a center point (x, y). Being perturbations of the Gaussian function

make them well-suited to model localized, smooth objects. As mentioned above, using Gaussian-like functions is especially beneficial for representing probabilistic distribution functions, because of the popularity of Gaussian functions in probabilistic theory. Using Shapelets, we are able to efficiently calculate all of the three classes of important operations. Simple arithmetic operations can be done directly on the Shapelet coefficients since the Shapelet functions build an orthonormal basis (after appropriate geometric transforms as discussed in 5.1). Furthermore, this basis allows computing integrals and integral moments as sums over the coefficients. Doing standard geometric transformations reduces to matrix multiplications in most cases, and sometimes even simpler operations, though some are defined only for one of the two bases, so some conversion may be necessary as well. Convolution of two Shapelets also reduces to matrix multiplications over their coefficients. In this section, we define the basis functions and provide the necessary background at a high level. More mathematical details and explicit formulae are given in [14] and [13], where most of the material reviewed below is developed. 6.1

Cartesian Basis Functions

The 1-dimensional Cartesian basis functions are defined as i h x2 1 φn (x) = 2n π 2 n! Hn (x)e− 2 ,

(26)

whereas Hn (x) denotes a Hermite polynomial of order n. Generalizing these to the 2-dimensional case, and adding a scale factor β gives: 2

+y − x 2β 2

Hn1 (β −1 x) Hn2 (β −1 y) e √ φn1 ,n2 (x, y; β) = β 2n πn1 !n2 !

2

.

(27)

In an actual implementation, we can only store a finite number n1 + n2 = nmax of Shapelet coefficients. However, using only nmax coefficients allows us 1 to capture non-Gaussian features only in the approximate range β(nmax + 1)− 2 1 and β(nmax + 1)+ 2 . Therefore, the factor β is needed to scale the Gaussian-like shape to the size of the object, and even with an optimal choice of β, more coefficients must be used to achieve a good approximate of objects that are less smooth. 6.2

Polar Basis Functions

In the polar Shapelet expansion, 2-dimensional functions are described in a polar coordinate system using the following complex-valued basis functions: n−|m| 2

χn,m (r, θ; β) =

(−1) β |m|+1

  n−|m|   21  2 ! −r 2 r   2   r|m| L|m| e 2β2 e−imθ . (28) n−|m| 2 β 2 π n−|m| ! 2

The two parameters n and m are integers where n > 0 and −n ≤ m ≤ n, and both must be odd or even (there is no state with n = 2 and m = 1, for instance). The Lqp are the Associated Laguerre polynomials. Shapelet coefficients are the complex numbers denoted as fnm . Many geometrical operations are very simple while using these polar Shapelet functions. Fortunately, once a function representation is available in the Cartesian form, it can be converted into the polar form and vice versa. The conversion is, again, a nested sum over the Shapelet coefficients. We omitted the exact conversion here and refer to [13] for more mathematical details. For now, it is sufficient to state that the sometimes more useful polar representation can be efficiently computed from the Cartesian representation and vice versa. 6.3

Shapelet Operations

As arithmetic operations reduce simply to term-by-term operations after appropriate transforms, we restrict our attention to integral moments and geometric operations. Both the Hermite and Laguerre polynomials are discussed extensively in the mathematical literature, and in particular obey several important recursion relations. For the Hermite polynomials: Hn (x) = 2xHn−1 (x) − 2(n − 1)Hn−2 (x)

(29)

dHn (x) = 2nHn−1 (x) (30) dx The first can be used to calculate the value of a basis function φn (x) at x with a complexity at most O(n). In addition, together they can be used to analytically compute any integral moment using integration by parts. In particular, the simple definite integral over a one-dimensional basis function φn (x) from a to b is: Z b In := φn (x)dx a

q h ib q = −β n2 φn−1 (x) + n−1 n In−2 a q h ib 1 1 2 erf(x) I0 = 2 βπ

(31)

a

ib √ h I1 = − 2β φ0 (x)

a

These can be applied separately to the independent x and y pieces of the full Cartesian basis, giving the integral over any arbitrary rectangular region (including the infinite region of all space). The integral may then computed with a complexity at most linear in the number of terms. Similar formulae may be derived for higher moments.

As any general region can be represented as a set of rectangles (to some arbitrary precision, based on the number of rectangles used), any integral moment of a Shapelet can be computed by summing the integral moment over such a collection of rectangles. For many useful moments to be calculated over all space, the polar representation often provides even simpler methods [13]. For the explicit forms of geometric operations, we again defer to Massey and Refrigier [13]. For our purposes it is sufficient to state that nearly all of these operations are simplest (linear in the number of coefficients) in the polar basis, which means that most of these operations may require conversion from the otherwise simpler (and non-complex valued!) Cartesian form. General Input Methods Shapelet representations can be created using several methods including the following: Explicit Definition Any Gaussian-shaped fuzzy region, whose center point is at (x, y), with a standard deviation of β whose integral over all space is a can be input as Cartesian Shapelet at position (x, y) with scaling factor β and only one Shapelet coefficient f1,1 = a. A completely general bivariate Gaussian (with different scales along different axis, and potentially some rotation) is more complicated, requiring one of the below methods. Operation Using basic operators like rescaling, transposition, or any other geometric transformations, new Shapelets can be created out of already existing ones. Analytic Functions If the original objects are given in the form of an analytical function f , numerical methods can be used to solve the integrals which describe the projection of f onto the basis function as given in (19). Raster Image Data Quite often, a vague spatial object is given as an image where each pixel value corresponds to the relative “vagueness” of the square covered by this pixel. This raw data can be transformed into Shapelet representations using “discrete” versions of the basis functions together with a least square fitting. The details for this method is described in the following section. 6.4

Raster Image Input

For input of raster image data we again refer to Massey and Refregier [13] for a complete discussion, though another method has been proposed by Berry et al [9]. We will quickly review the more important aspects here. For each basis function and each pixel in the x-y plane, we compute the volume under the curve and above the respective pixel. The linear combination of the basis functions to approximate the pixel image then degenerates to a single matrix multiplication. Consider a pixel image with width w and height h. Furthermore, assume we want to use n1 basis functions in x and n2 basis functions in y direction to

approximate the pixel data. Let then M be a (w · h × n1 · n2 )-matrix and its elements defined as follows: ZZ ZZ mp,(ni ,nj ) := φni ,nj (x, y; β)dxdy = φni (x; β)φnj (y; β)dxdy (32) pixel p

pixel p

Let us put all the Shapelet coefficients into one n1 · n2 vector fn1 ,n2 and all original pixel values into an x · y vector fx,y . Then, the linear combination in (13) for the first n1 · n2 Shapelet coefficients is the following matrix multiplication: fx,y = M fn,m

(33)

This linear equation system is over-determined, therefore we use a least square fit to calculate fn1 ,n2 . By including a w · h × w · h covariance matrix V , it is even possible to include error estimates for the original pixel values. If no error estimates are available, the identity matrix can be used as V . The equation for calculating the Shapelet coefficients fn,m is then: fn,m = (M T V −1 M )−1 M T V −1 fx,y

(34)

The integrals that compose the M matrix may be evaluated using the recursion relation integrals given in 6.3. Note that each set of integrals (an n1 · w or n2 · h vecotr) must only be calculated once; the matrix M is then the tensor product of the two vectors. Output as Pixeled Data To visualize vague spatial objects, it is necessary to convert the Shapelet representation back to pixel images. Such a raster image can then be used to display the vague objects as a shading layer on top of other layers, for example in a map showing sharp objects. Converting the Shapelet representation back into raster images is because of (33) just a multiplication with an appropriate M matrix. Again, this M matrix can be computed using the integrals given in 6.3.

7

Implementation

We have implemented Shapelets and operations on them as a C++ library. We chose C/C++ primarily for performance reasons, but a good C/C++ library can act as back-end for PostgreSQL, as well as for scripting languages such as Perl or Python. We chose to use the GNU Scientific Library [2] for doing expensive matrix multiplications and inversions. This is not only convenient, as we do not need to implement them on our own, but the GNU Scientific Library itself is a sophisticated and well-tested and optimized framework for mathematical operations. Furthermore, the library supports different back-end libraries for matrix inversions, some of which can leverage special hardware features which would tremendously speed up the operations.

We extended the open source database management system PostgreSQL by adding Shapelets as a new data type together with some useful operations as operators. PostgreSQL was an ideal candidate because of its extensible framework for adding new database functions, data types and operations. Furthermore, PostgreSQL allows to define database functions as wrappers around existing C functions, which made it very easy to combine the database system with our C++ library. Apart from this, functions can also be defined as Perl [3] code. Since we used Swig [6] to create a Perl module out of our Shapelet library, we could even use preliminary Shapelet routines only defined in Perl during database queries. In this section we first describe the architecture of our C++ implementation, including some pitfalls and how we decided to solve them. We then give an overview about the operations that our Shapelet data type currently supports. The source code (still under development) including the C++ library, the code for the PostgreSQL extension, and the Perl-binding using Swig for rapid prototyping can be downloaded at http://www.daniel-zinn.de/plone/studies/ winter-2006/ecs-289f/project. 7.1

The Shapelet C++ Library

The main class in our Shapelet C++ library is Shapelet. This class defines the abstract Shapelet data type and provides a high-level interface to the user. In particular the Shapelet class provides the following features: – Memory management for the Shapelet’s data members, including its center position (x, y), scaling factor β and the Shapelet coefficients. – Transparent conversion and representation of the Shapelet using either polar or Cartesian basis functions. – Changing the Shapelet’s resolution. The class provides a “resizing” constructor that allows creating Shapelets having more or less coefficients and thus providing better or worse approximations. – Input/output handling, as well as other operations on the Shapelet such as moving, scaling, performing arithmetic calculations, computing convolutions or integrals. – Standard geometric transforms. – Serializing. For permanent and machine-independent storage of Shapelet data, it provides routines for converting Shapelets to strings and vice versa. Memory Management We use a plain C struct data structure called RawShapelet as data container. This struct can contain either polar or Cartesian Shapelet data. The definition is shown in listing 1.1. The members beta, x, and y denote the scaling factor β and the Shapelet’s center position (x, y), respectively. In case the polar basis functions are used, polar is set to true, otherwise to false. Since the number of Shapelet coefficients can vary from Shapelet to Shapelet, this C

typedef struct RawShapelet { int size; double beta; double x; double y; bool polar; double data; // starting element for data array } RawShapelet;

1 3 5 7

Listing 1.1. Definition of RawShapelet

struct has a variable size which is always stored in size1 . All the Shapelet coefficients are stored as a plain double array which starts at the member position data; i.e. data is the first element in this array2 . For simplicity, we decided to use the same number nmax of coefficients for the two dimensions. The Shapelet coefficients are therefore a square matrix. As indicated in [13], only the upper left coefficients of the fn,m matrix are usually significantly different from 0. These are those coefficients for which n1 +n2 ≤ nmax . We therefore use and store only this upper left part of the matrix. To map these 1 2 nmax · (nmax + 1) coefficients into our flat data array, we wrote a separate Addresser class that implements an index mapping for storing triangular matrixes in a packed fashion. We used a scheme that is similar to those in linear algebra packages, like LAPACK, where aij is stored in data[i+jnmax −j(j −1)/2] for j ≤ nmax − i (note that both i and j start at 0). For embedding the complex coefficients for polar Shapelets into the flat array we leveraged the fact that for approximating real-valued functions some coefficients are the complex conjugate ∗ of others (fn,m = fn,−m ). We were therefore able to store also the Shapelet coefficients in an array equal in size to the array for the Cartesian case. Apart from this, we allow the user of the library to allocate and free the necessary memory for the RawShapelet structures itself if desired. Inside the Shapelet C++ class we provide functionality through which it is possible to wrap the C++ class around an existing RawShapelet. Because the raw data is used inside the C++ class, this method is very fast and does not result into expensive memory operations. The user can thus take care of the storage of the simple C-struct-like raw Shapelets (possibly including features like check-pointing, failsafety, etc.), but is also able to create a “user-friendly” C++ version in case this is needed. Cartesian vs. Polar Shapelets The C++ class offers a general “Shapelet” abstraction to the user. Since there are operations that should be performed on polar Shapelets, and other operations that should be performed on Cartesian Shapelets the C++ class transparently transforms between these two represen1 2

This convention for variable sized data types is also used in PostgreSQL Since C does not support growing data structures, the struct does not explicitly contain the data array. However, in our code data is treated as an array using pointer arithmetic.

tations when needed. These transformations are done in a “lazy” fashion, i.e. they are only performed if it is needed. Operations Besides supporting basic arithmetic operations like addition, multiplication, or scaling of the values, the C++ class also provides methods for integration and calculation of different moments. Furthermore, standard geometric operations like move, rotate or shear are also supported. For those operations, the Shapelet class does necessary pre-processing steps, like polar vs. Cartesian conversion or β/center-point unification, on its own if this is needed. Shapelets can be created based on image data. We designed and implemented a basic pixel-image class that contains the data, the weights (covariance matrix), and a list of bad pixels (those that should not been considered during the leastsquare-fitting). Based on this class Shapelets can be created that approximate the data given using the least-square fitting method described above. Furthermore, Shapelets and sets of Shapelets can also be exported as raster-image data. The resolution and position of the image can be chosen by the user. Currently our basic image class has specializations that support the PNG [4] as well as FITS (Flexible Image Transport System) [1] image formats. For storing and transferring Shapelet data machine-independently we added serialization methods that transform Shapelets to ASCII strings and vice versa. These routines are, for example, used in the context of PostgreSQL input/output functions. 7.2

The Shapelets Perl Module

We used the Swig [6] framework to create a Perl [3] module that provides a convenient way of accessing the C++ library as Perl shadow classes. Besides making our framework available as a Perl module, we used the Perl binding for rapid prototyping and testing. 7.3

Integration Into PostgreSQL

We added a new Shapelet data type to PostgreSQL. We built some small C wrapper functions to bind the most important C++ class methods to operators inside PostgreSQL using the C binding feature of PostgreSQL. As “in” and “out” functions we used the Shapelet’s string conversion routines. Currently supported operators are get/set scaling factor β, get/set the Shapelet’s center point (x, y), and display the Shapelet as a PNG image. Since PostgreSQL is able to use Perl as language for implementing database functions, we could even use Perl and our module inside the database system to define higher-level operations. Indexing Mechanisms Since our basis functions are scaled using the Gaus−x2

sian function e 2β2 the approximated function’s values are usually low for large

distances from the center point. In fact, for each threshold t we can calculate an r based on β and the Shapelet coefficients such that ∀x, |x| > r : f (x) < t holds. If the Shapelets model fuzzy regions, it is therefore possible to find minimum bounding circles (or boxes) for each threshold certainty t. Based on a given threshold, say 0.05, it is then possible to materialize minimum bounding boxes for all objects. Existing index structures for crisp regions, like R-trees, can then be used to efficiently access and query all Shapelets in the database system. If the Shapelets model PDFs, it makes often more sense to define regions for which the probability that the point is inside is more then a threshold t. This probability corresponds to the volume under the PDF function f . If f is a PDF, then σ 2 , defined in (4), is indeed the variance. We can then use Chebyshev’s inequality theorem to find a radius r for each threshold t, such that ZZ ∀x, |x| > r : f (x)d2 x < t. |x|

Suggest Documents