Data Modeling for Virtual Observatory Data Mining

Data Modeling of deep sky images James Handley*a, Holger Jaenischa,b, Albert Lima, Graeme Whitea, Alex Honsa, Miroslav Filipovicc,a, Matthew Edwardsb a James Cook University, Centre for Astronomy, Townsville QLD 4811, Australia b Alabama Agricultural and Mechanical University, Department of Physics, Huntsville, AL 35811 c University of Western Sydney, Locked Bag 1797 PENRITH SOUTH DC NSW 1797, Australia ABSTRACT We present a method for simulating CCD focal plane array (FPA) images of extended deep sky objects using Data Modeling. Data Modeling is a process of deriving functional equations from measured data. These tools are used to model FPA fixed pattern noise, shot noise, non-uniformity, and the extended objects themselves. The mathematical model of the extended object is useful for correlation analysis and other image understanding algorithms used in Virtual Observatory Data Mining. We apply these tools to the objects in the Messier list and build a classifier that achieves 100% correct classification. Keywords: Data Modeling, Virtual Observatory, astronomy, deep-sky, image modeling, noise modeling, Messier classification, functional modeling, component modeling

1. INTRODUCTION Data Modeling1,2 is the process of deriving a mathematical expression from measured data. Two approaches for modeling focal plane array (FPA) images are component based modeling and functional modeling. In component based modeling, the deep sky object and each noise effect are modeled independently. Arrays for each effect are stacked into a final image. Component based modeling uses simple equations to model deep sky objects and noise effects, or it can use statistical models. In contrast, functional modeling of deep sky objects is done strictly using univariate or multivariate equations. These equations are continuous and their nth order derivatives are available for analysis. Functional modeling also provides a common basis for robust image analysis.

2. COMPONENT BASED DATA MODELING Component based Data Modeling of deep sky objects and noise effects (thermal, shot, fixed pattern, FPA nonuniformity, and hot/dead pixels) are generated independently. Each independent model is stacked to build a Component Based Image Data Model. The Moffat function3

 ( x − x0 )α + ( y − y 0 ) γ  f ( x, y ) =  + 1 ρ  

−β

(1)

is used for modeling point-spread functions. The control parameters α, β, γ, and ρ modify the size, shape, and extent of the 2-D distribution. Deep sky objects are modeled using a series of Moffat functions and properly selecting coefficients. This Data Model of M51 used 2 Moffat functions with coefficients obtained using a genetic algorithm (GA)4 yielding

*[email protected]; phone 1 256 337 3769; James Cook University; Centre for Astronomy; Townsville, QLD 4811; AUSTRALIA.

 ( x − 35)1.55 + (( y − 20)(0.707 ))1.55  f ( x , y ) = 3.0  + 1 0.6  

−0 .5

 ( x − 28)1.4 + (( y − 45)( −0.342))1.4  + 1.5 + 1 0.6  

−0 .6

(2)

Next, noise effects on the FPA and the surrounding field stars are modeled using different types of random number distributions. Use of random numbers yield statistical rather than functional models. To eliminate random number generators, we propose using sin xe (Sxe function) as a pseudo random number approximating function5   (sin x e )  ∑∑ 2 − Min  x  f (i, j ) = D  Max − Min       (Nonuniformity)

N=F

sin x e + 1 size 2 2 (Dead pixels)

f (i, j ) = B − 2 log(

(3)

 (sin x e )   ∑ 2 − Min  x  f (i , j ) = C   Max − Min     

(4)

(Thermal)

f (i, j ) = 0

N=E

(6)

sin x e + 1 sin x e + 1 )σ ) sin(2π 2 2

N=

f (i, j ) = A

sin x e + 1 2

(5)

(Fixed Pattern)

sin x e + 1 f (i, j ) = 255 size 2 2 (Hot pixels)

sin x e + 1 size 2

(Shot Noise and Field Stars)

(7)

(8)

These functions use only an x-index value, but map into 2-D (i,j) using Hilbert sequencing. The number of Moffat functions needed in the final model and their coefficients is determined using a genetic algorithm (GA). Component models yield short numerical equations but of limited fidelity and not in real-time. Figure 1 depicts a Component Data Model of M51. Comparison shows good representation of the salient image features. Because M51 exhibits two bright cores, the genetic algorithm chose two Moffat functions to represent it. Comparison of the original image on the bottom left with the final model shows similarity in placement of the main cores, similar magnitude and extent of the cores, a similar number of field stars, and even hints of spiral structure.

3. FUNCTIONAL DATA MODELING 3.1 Univariate model – Turlington polynomial The Turlington polynomial yields one equation with continuous derivative per data set of the form (x − x j )   T ( x ) = y1 + m1 ( x − x1 ) + ∑ .001 ( m j − m j −1 ) log 10 1 + 10 .001  j=2   n −1

mj =

y j +1 − y j x j +1 − x j

(9)

where x and y are the original (x, y) data points and n the number of points from the original used for building the Turlington polynomial. The variable n can be either all of the points or a sub-sampled set of the original. The Turlington polynomial is built in a piecewise linear fashion one point at a time as data arrives. This lends itself to real-time construction. The logarithm term makes Turlington polynomials both orthogonal and differentiable. One drawback to the Turlington polynomial is in its use of n terms, equal to the number of points in the data set being modeled. This yields fast and streaming real-time derivatives of terms, but an exceptionally large final model6,7. 3.2 Univariate model - Eigenfunction Eigenfunctions1 approximate T(x) in compact form. T(x) models require coefficients be stored describing each data segment. The size is dependent on the data-sampling rate. Using eigenfunctions and the method of residuals, T(x) is approximated by

T ( x) ≅

n

∑A j =1

j

cos ω j x + iB j sin ω j x

(10)

In Equation (10) Aj is the amplitude of the jth eigenfunction term, Bj is the phase of the jth eigenfunction term, and ωj is 2π times the frequency f defined by the jth derived dominant Fourier frequency term. Dominant eigenfunction terms are added one at a time until correlation between T(x) and the integral function in (9) converges. Pseudo code used for creating eigenfunction models is shown in Figure 2. These smaller models are orthogonal, differentiable, and Lebesgue integrable. However, they require multiple Fourier Transforms, making them memory and computer intensive. 3.3 Multivariate functional Data Modeling A functional is a function whose variables are themselves functions. We approximate the Kolmogorov-Gabor polynomial by using Ivakhnenko’s Group Method of Data Handling8 (GMDH) using nested functionals of the form

y ( x1 , x2 ,K, xL ) = f ( y (b1 ( y (b2 (K ( y (bn ( y ( xi , x j , xk )))))))))

O[ y ( x1 , x2 , K , xL )] = 3n

(11)

This structured approach forms intermediate meta-variables from combinations of three inputs combined into a single new fused output. This fused meta-variable becomes a new input variable available at the next layer. Since the algorithm only uses inputs necessary to achieve convergence, pruning of inputs is automatic and requires no external intervention, enabling unsupervised learning under proper circumstances. This Functional Data Model is derived in near real-time like T(x) and yields a final model substantially shorter like the eigenfunction model. If derived from derivative data, it will yield a differential equation model. The algorithm for this process is listed in Reference 1. 3.4 2-D image to 1-D conversion Enabling functional modeling of images requires transformation from 2-D into 1-D without 2-D decorrelation. Several methods exist, including raster scanning and fixed pattern readout such as zigzag sequencing. However, only Hilbert sequencing preserves 2-D correlations at dyadic sample sizes. Hilbert sequencing is illustrated graphically in Figure 3 and is given by

H n +1 = w1 ( Pn ) ∪ w 2 ( Pn ) ∪ w 3 ( Pn ) ∪ w 4 ( Pn )

Pn +1 = w1 ( H n ) ∪ w 2 ( Pn ) ∪ w 3 ( Pn ) ∪ w 4 ( Pn )

(12)

subject to Lindenmayer’s L-system grammars represented and defined by L → + RF R → − LF F →

F

+ → +

− LFL + RFR

− FR + + FL −

F ⇒ ( x , y , α ) → ( x + l cos α , y + l sin α , α ) + ⇒ ( x , y ,α ) → ( x , y ,α − δ )

(13)

− ⇒ ( x , y ,α ) → ( x , y ,α + δ )

− → −

δ = 90 °

Psuedo code for generating the Hilbert sequence is readily available3.

4. APPLICATIONS 4.1 Training cases We chose a bitmap database of 110 Messier objects for use in classifier construction. Classifier flowcharts in Figures 4 and 5 will be described in Section 4.4 and 4.5. In our database, M102 (repeat of M101) was removed and replaced with NGC58669. Each bitmap is a 24-bit color image with varying sizes. Bitmaps were resized to 64x64, generating a thumbnail of each image. Thumbnails were then converted to gray scale and Hilbert sequenced into 1-D.

4.2 Turlington Data Models Turlington polynomials given in Equation (9) were used to construct Data Models of the entire image. Because Turlington polynomials are continuous functions, they can be interpolated to any resolution. Figure 6 compares the original and Turlington Data Models for M13 (cluster), M20 (nebula), and M101 (galaxy). T(x) models suppress noise by increasing the fitting parameter currently set to 0.001 in Equation (9). 4.3 Eigenfunction Data Models Eigenfunction models of Messier objects consisting of 5, 10, and 20 terms were built. Using more terms than 5 gave marginal improvement in correlation, yet doubled and quadrupled the model size. Therefore, we limited our models to 5 terms.

Figures 7-9 contain a library of eigenfunction Data Models, one for each of the 110 Messier objects. Each object is plotted next to 3 data graphs. The middle graph is the Hilbert sequenced original waveform, the bottom graph the eigenfunction model, and the top graph is a special histogram derived from the eigenfunction model that will be discussed in more detail in Section 4.4. Data Models of images with the object centrally located and somewhat symmetrical (except for open clusters) contained three dominant peaks. The width of these peaks was generally wider for extended objects. Open clusters did not show these peaks. Rather, the waveforms displayed 1/f structure5 (Figures 7-9). 4.4 Change detection Data Model The top graphs in Figures 7-9 are created by generating a double histogram with mode subtraction. First, the histogram of the data is calculated with number of bins equal to number of points. This histogram is normalized 0-255 and a new histogram (same number of bins) calculated. The mode of this histogram is removed, leaving the modified 2nd order histograms shown. Characterization of these histograms using descriptive statistics provides features for a hybrid change detector. Reference 1 describes change detection theory and descriptive features in detail. Our statistical feature classifier correctly identified 107 out of 110 objects. Figure 4 is a flowchart of the change detector, and the details are as follows:

Our first change detector (8 layer) ([O(3)8]polynomial) was constructed to identify clusters as nominal, and other objects (nebula and galaxy) as off-nominal. When this classifier was tested against cluster data, it achieved 100% correct classification. However, when galaxies and nebula were presented, 19 were mislabeled as clusters. Finally, a classifier was constructed resolving differences between clusters and mislabeled objects (galaxies and nebulae). The resolver is a 10 layer ([O(3)10]polynomial) Data Model. The classifier only post processes the data sets labeled as “clusters”. Our resolver correctly identified all nebulae, all clusters except M35, and all galaxies except M108. With clusters removed, the remaining objects are passed through a second change detector ([O(3)8]polynomial) that correctly identified all presented galaxies and all nebulae except M78 (which was mislabeled galaxy). Additional unusual and potentially difficult test cases were selected to explore performance envelopes of the classifier. We chose NGC869 (double cluster), the Large Magellanic Cloud (LMC), Small Magellanic Cloud (SMC), Comet Hale-Bopp, and Comet Neat, shown in Figure 10. These cases were presented to the change detector for assignment: Object Double Cluster LMC SMC LMC+SMC Hale-Bopp Neat

Cluster CD

Resolver

Galaxy CD

Interpretation

0.5002 0.5 0.5517 3.1 4.7 9.9

x108 -5.2 0.9558 N/A N/A N/A N/A

N/A N/A 3.1 3.4 0.5 4.0

Unlike any Messier list object Cluster Nebula Nebula Galaxy Nebula

9

x10 x1010 x109

x107 x104 x107

4.5 Stellar object classification 5-term eigenfunction models are constructed for each Messier object10,11, yielding 16 coefficients shown as a cluster plot in Figure 11 and given in tabular form in Figure 12. These features are used to build a Data Model classifier. First, we reduced the dimension of the output to two classes forming a cascade. The first classifier identifies clusters from the initial pool of objects (clusters, nebula, and galaxies). Once identified, a second classifier distinguishes between nebula and galaxies. Figure 5 is a flowchart of this classifier.

Using this approach, a 10 layer Data Model ([O(3)10]polynomial) was constructed that correctly distinguished all 57 clusters from galaxies and nebula. Next, a 2 layer Data Model ([O(3)2]polynomial) was constructed that correctly classified the remaining 40 galaxies and 13 nebulae into their proper classes. The total classifier uses both Data Models; the first one identifies clusters, while the second determines if the object is a galaxy or nebula. We obtained 100% correct classification for the Messier list of objects. Also, the 2 combined Data Models needed only 13 of the 16 available features from Figure 12. The three features not used were B(1), w(4)/2π, and B(5). Our classifier also detects novelties or changes. If the equation yielded a value outside of -0.0862< x < 1.1313 bounds, the object is flagged unique; exhibiting characteristics not observed in the Messier training set.

Figure 11 shows a graph using the best two discriminating features (A0 and A1) from Figure 12. These features were automatically selected by calculating the separability between the 3 classes of objects in all 16 dimensions of the feature space, and plotting the features that maximized the minimum cluster separability using a K-factor defined as

K=

µ 2 − µ1 σ 22 N2

+

(14)

σ 12 N1

where µi is the mean of each group, Ni is the number of points in each group, and σi is the standard deviation of each group. Unusual cases were presented to the classifier to score. Our classifier assigned them as follows: Object NGC 869 (Double Cluster) Large Magellanic Cloud (LMC) Small Magellanic Cloud (SMC) Combined LMC and SMC Comet Hale-Bopp Comet Neat

DM Value 1.5 0.2843 1.0173 3.5 0.9991 1.0269

Interpretation x1014 x109

Unlike any Messier list object Nebula Galaxy Unlike any Messier list object Cluster Cluster

These cases are unusual because they do not resemble any Messier object. The Double Cluster and combined LMC and SMC were flagged as novelties. We found our Data Model classifier can determine when new class definitions are required without supervision.

5. SUMMARY In conclusion, two approaches for modeling deep sky images was successfully demonstrated. Component based modeling allows very simple equations to be built. Functional modeling was demonstrated using two different techniques. Turlington polynomials were demonstrated for real-time applications, and eigenfunctions for short models. Very good exception handling of novel examples is exhibited using Change Detectors. Functional Data Modeling resulted in a classifier that correctly identified all 110 of the Messier objects and performed reasonably well classifying unusual objects.

ACKNOWLEDGMENTS The authors would like to thank Scott McPheeters, Tim Aden, and John Deacon for their continued support during the course of this work.

REFERENCES 1.

Jaenisch, H., Handley J., Lim A., M.D. Filipovic, White G., Hons A. , Crothers S., Deragopian G., Schneider M., Edwards M., “Data Modeling for Virtual Observatory data mining”, Proceedings of SPIE Vol. 5493 (2004). 2. Lim, A., Jaenisch, H., Handley, J., Berrevoets, C., White, G., Deragopian, G., Payne, J., Schneider, M., “Image Resolution and Performance Analysis of Webcams for Ground Based Astronomy”, Proceedings of SPIE Vol. 5489 (2004). 3. Jaenisch, H.M., Handley, J.W., Scoggins, J., Carroll, M.P., “ISIS: An IR Seeker Model Incorporating Fractal Concepts”, Proceedings of SPIE, Vol 2225 (1994). 4. Jaenisch, H.M., and Handley, J.W., “Automatic Differential Equation Data Modeling for UAV Situational Awareness”, Society for Computer Simulation, Huntsville Simulation Conference 2003, (October 2003). 5. Jaenisch, H. and Handley, J., “Data Modeling of 1/f noise sets”, Proceedings of SPIE Vol. 5114 (2003). 6. Jaenisch, H.M. and Handley, J.W., “Data Modeling for Radar Applications”, Proceedings of the IEEE Radar Conference 2003, (May 2003). 7. Jaenisch, H.M., Handley, J.W. , Faucheux, J.P., “Data Driven Differential Equation Modeling of fBm processes”, Proceedings of SPIE Vol. 5204(2003). 8. Madala, H.R., Ivakhnenko, A.G., Inductive Learning Algorithms for Complex Systems Modeling, Boca Raton, FL: CRC Press, 1994. 9. “Messier Objects”, http://www.3towers.com/messier.htm, June 1, 2004. 10. Jaenisch, H.M., Filipovic, M.D., “Classification of Jacoby Stellar Spectra Using Data Modeling”, Proceedings of SPIE Vol.4816 (2002). 11. Jaenisch, H.M., Collins, W.J., Handley, J.W., Hons, A., Filipovic, M.D., Case, C.T., Songy, C.G., “Real-time visual astronomy using image intensifiers and Data Modeling”, Proceedings of SPIE Vol. 4796 (2002).

FIGURES

Nonuniformity

Fixed Pattern Noise

M51

M51 Data Model

Thermal Noise

Shot Noise & Hot/Dead Pixels

M51 with Noise Effects

Fig. 1: Component based Image Data Model for M51, including surrounding star field and FPA noise effects.

rem initialize redim xdata(n),ydata(n),extreme(n),totaldat(n),r(n),c(n),newdata(n),newdata1(n),r1(n),c1(n) rem open x and y data array files call open_data_files(xdata,n) call open_data_files(ydata,n) rem sort x into ascending order and sort y with x to retain x associations

FOR i = 2 TO n1 - 2 IF extreme(i) > dmax THEN dmax = extreme(i) dloc = i END IF NEXT i rem construct sine and cosine representation of dloc frequency at sampling equal to original rem r1 and c1 hold eigenfunction coefficients, d the frequency term

call sort2(n,xdata,ydata)

r1(k1)=r2(dloc) c1(k1)=c2(dloc) d(k1)=dloc

rem put ydata into newdata for i=1 to n newdata(i)=ydata(i) next i

FOR i = 1 TO n newdata1(i) = newdata1(i) + CCUR(r2(dloc)) * COS(dloc * 2# * pi * (i / N)) newdata1(i) = newdata1(i) + CCUR(c2(dloc)) * SIN(dloc * 2# * pi * (i / N)) newdata1(i) = newdata1(i) / SQR(N) NEXT i

rem specify objective correlation value cobj=0.99

rem calculate residual between original and summation of all terms generated so far

rem calculate Data Modeling eigenfunction model of y(x) FOR i = 1 TO n newdata(i) = newdata(i) - newdata1(i) NEXT i

rem final result held in totaldat; initialize k1 to hold number of terms k1=0

rem add current term to previous terms

10 continue FOR i = 1 TO n totaldat(i) = totaldat(i) + newdata1(i) NEXT i

k1=k1+1 rem rem rem rem rem rem rem

generate Fourier transform of data and corresponding power spectra use linear regression to fit a line through the dB power spectra identify maxima in power spectra that occur above the linear fit pull out maxima locations in array extreme find maximum value in array extreme, and use location to extract real and imaginary Fourier terms held in r (real) and c(imaginary) for use in eigenfunction model

rem calculate correlation between original and current model (totaldat) CALL correl(totaldat,ydata,N,ycorrel) rem test to see if correlation criteria met rem test to see if maxterms criteria met

call calc_extremes(ydata,n,extreme,r,c)

IF ycorrel < cobj AND k1 < .125 * N THEN GOTO 99 END IF

rem find maximum value location ; only look at first half of data since symmetric rem ignore zeroth order location

rem output final model

n1=int(n/2) dmax = extreme(1) dloc = 1

CALL output_eigen(r1,c1,d,k1,N) END

Fig. 2: Pseudo-code for Data Modeling eigenfunction model construction.

Fig. 3: Hilbert sequence scanning maintains dyadic neighbor correlation.

START Cluster Change Detector

Cluster?

Yes

Resolver Classifier

Cluster?

No

Galaxy Change Detector

Yes Cluster

No

Galaxy? No Nebula

Fig. 4: Flowchart for Data Modeling Change Detector.

Yes

Galaxy

START Cluster Classifier

No

Cluster?

Galaxy Classifier

Galaxy?

Yes

No

Nebula

Yes Galaxy

Cluster

Fig. 5: Flowchart for Data Modeling Classifier.

Data Model (64 x 64) Data Model

Original (64 x 64) Original

Data Model (64 x 64)

Data Model

Original (64 x 64)

Original


Original (64 x 64)



Data Model

Original


Fig. 6: Comparison of Turlington Data Model output and original for M13 (top), M20 (middle), and M101 (bottom).

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

M11

M12

M13

M14

M15

M16

M17

M18

M19

M20

M21

M22

M23

M24

M25

M26

M27

M28

M29

M30

M31

M32

M33

M34

M35

M36

M37

M38

M39

M40

M41

M42

M43

M44

M45

M46

M47

M48

Fig. 7: Data Models for Messier objects M1-M48.

M49

M50

M51

M52

M53

M54

M55

M56

M57

M58

M59

M60

M61

M62

M63

M64

M65

M66

M67

M68

M69

M70

M71

M72

M73

M74

M75

M76

M77

M78

M79

M80

M81

M82

M83

M84

M85

M86

M87

M88

M89

M90

M91

M92

M93

M94

M95

M96


M97

M98

M99

M100

M101

M102

M103

M104

M105

M106

M107

M108

M109

M110

M7

M8


Fig. 10: Comparison of test objects: M51 (left), NGC869 (left center), LMC and SMC (right center), Comet Hale-Bopp (right top), and Comet Neat (right bottom).

G a la x y C lu s te r N e b u la

Fig. 11: Cluster plot showing distribution of Data Modeling features. Features plotted are A(0) on the x-axis and A(1) on the y-axis. Bottom plot shows Messier number designation, and top plot coded by object classification.

Fig. 12: Table of coefficients for the 5-term eigenfunction based Data Model for each Messier object.

Data Modeling for Virtual Observatory Data Mining

Data Modeling for Virtual Observatory Data Mining

Suggest Documents

Distributed Data Mining in the National Virtual Observatory - Kirk Borne

XML Data in the Virtual Observatory

Modeling Spatial Dependencies for Mining Geospatial Data

Download Data Mining Cookbook: Modeling Data for ... - Google Sites

A Data Inventory Service for the Virtual Observatory

Data Mining in Virtual Organizations - liacs

free [download] data mining cookbook: modeling data ... - Google Sites

Data-Driven Reliability Modeling, Based on Data Mining in Distribution ...

Bayesian Data Analysis for Data Mining

Bayesian Data Analysis for Data Mining

u DATA PREPARATION FOR DATA MINING - CiteSeerX

Data Mining Workbench for Interactive Data Exploration

Data Transformation For Privacy-Preserving Data Mining

Bayesian Data Analysis for Data Mining

Insurance Risk Modeling Using Data Mining Technology

Data Mining Modeling Techniques and Algorithm ... - IJARCSSE

Modeling Interactions Across Skills - Educational Data Mining

Insurance Risk Modeling Using Data Mining Technology

Mining Views: Database Views for Data Mining - Adrem Data Lab

The virtual solar observatory - cdaw data center - NASA

Data Mining

Data Mining

DATA MINING

Data Mining