Exploratory graphics for functional data - Rob J Hyndman

0 downloads 0 Views 10MB Size Report
Aug 3, 2010 - plot, functional variants of the bagplot and the highest density region ...... To demonstrate the functional bagplot, we consider a time series of ...
Exploratory graphics for functional data Han Lin Shang∗ and Rob J Hyndman Department of Econometrics and Business Statistics, Monash University, Clayton, Australia August 3, 2010

Abstract We survey some graphical tools for visualizing large sets of functional data represented by smooth curves. These graphical tools include the phase-plane plot, singular value decomposition plot, rainbow plot, functional variants of the bagplot and the highest density region boxplot. The latter two techniques utilize the first two robust principal component scores, Tukey’s halfspace location depth and highest density regions. The computer code and datasets are collected in the rainbow package for R, which is available at the Comprehensive R Archive Network (CRAN).

Keywords: Highest density regions, Kernel density estimation, Robust principal component analysis, Singular value decomposition, Tukey’s halfspace location depth.

1

Introduction

Functional data are becoming increasingly common in many scientific fields, ranging from astronomy to genomics. In essence, functional data sets are collections of functions — usually smooth curves, images or shapes (e.g., Locantore et al. 1999, Ramsay & Silverman 2005). There is a need to develop new statistical tools for exploring and analyzing such data. In this review paper, we focus on the problem of visualizing functional data comprising smooth curves. Examples of such data are age-specific mortality rates and fertility rates (Hyndman & Shang 2009), climatology data (Delaigle & Hall 2010), spectrometry data (Ferraty & Vieu 2002, Reiss & Ogden 2007) and term-structured yield curves (Kargin & Onatski 2008); other examples are described in Ramsay & Silverman (2002). Visualizing methods help in the discovery of characteristics that might not have been apparent using statistical models and summary statistics; and yet this area of research has not received much attention in the functional data literature to date. A notable exception is the phase-plane plot of Ramsay & Ramsey (2002), which highlights important distributional features from the first and second derivatives of functional data. Another exception is the singular value decomposition plot of Zhang et al. (2007), which displays the changes in singular columns and singular rows as the sample size and dimensionality increase. Hyndman & Shang (2010) recently proposed the rainbow plot, functional bagplot and functional highest density region (HDR) boxplot to visualize functional data and simultaneously identify functional outliers. In this paper, we review these five graphical tools and discuss their strengths and weaknesses. The five methods are introduced in the next five sections, and Section 7 concludes.

2

Rainbow plot

Figure 1 displays annual smoothed age-specific mortality curves for French males between 1899 and 2005. The data were taken from the Human Mortality Database (www.mortality.org) and were smoothed ∗ The first author thanks the participants of Interface 2010 for many stimulating conversations, especially Professors David van Dyk and Adalbert Wilhelm.

1

−4 −6 −8

Log death rate

−2

0

French male mortality: 1816

0

20

40

60

80

100

Age

Figure 1: French male age-specific log mortality rates for ages between 0 and 100 from years 1816 to 2006. The oldest years are shown in red, with the most recent years in violet. Curves are ordered chronologically according to the colors of the rainbow. Click on the graph to begin the animation.

using penalized splines with a partial monotonic constraint (for detail, see Hyndman & Ullah 2007). The mortality rates are defined as the ratios of death count to population exposure in the relevant year for the given age (based on one-year age groups). In this example, yi (x) denotes the logarithm of the mortality rates in year i for males of age x. Some years show large increases in the mortality rates between ages 20 and 40. This is due to increased deaths as a result of the first and second World Wars and the Spanish flu pandemic in 1919. Figure 1 is an example of an animated rainbow plot, where the colors of the curves follow the order of a rainbow, with the oldest data in red and the most recent data in violet. The plot has been animated (click on the graph) to emphasise the time-varying features of these data. The animation can be controlled with the control buttons beneath the plot. The simple rainbow plot (the final plot in the above animation) was introduced by Hyndman & Shang (2010) and shows all the data using a rainbow color palette based on an ordering of the data. For data that are ordered by time, the rainbow plot can be particularly useful for highlighting time order, as seen in Figure 1. Figure 2 shows another example using annual Australian fertility rates from 1921 to 2006 for ages between 15 and 49, obtained from the Australian Demographic Data Bank (Hyndman 2007). These are defined as the number of live births during each calendar year, according to the age of the mother, per 1000 female resident population of the same age at 30 June. Figure 2 reflects the changing social conditions affecting fertility rates. For instance, there was an increase in fertility rates in all age groups around the end of first and second world wars, achieving a peak in 1961, followed by a rapid decrease during the 1970s due to the increasing use of contraceptive pills, and then an increase in fertility rates at higher ages in most recent years caused by a tendency to postpone child-bearing while pursuing careers. Figures 1 and 2 show data in chronological order, but for many data sets it is desirable to use alternative orderings of data based on the values of the data themselves. Some methods for doing this are discussed in Hyndman & Shang (2010).

2

150 100 0

50

Fertility rate

200

250

Australian female fertility: 1921

15

20

25

30

35

40

45

50

Age

Figure 2: Australian female age-specific fertility rates for ages between 15 and 49 from years 1921 to 2006. The oldest years are shown in red, with the most recent years in violet. Curves are ordered chronologically according to the colors of the rainbow.

3

Functional bagplot

The functional bagplot was also introduced by Hyndman & Shang (2010) and is based on the bivariate bagplot (Rousseeuw et al. 1999) applied to the first two robust principal component scores obtained from {yi (x)}. There are several robust principal component algorithms. We use Croux & Ruiz-Gazen’s (2005) robust principal component algorithm which uses a form of projection pursuit. This approach can be applied even when the number of variables is significantly greater than the number of observations, which is typically the case for functional data. By applying the algorithm, we obtain a set of principal components {φk (x)} and a set of principal component scores {zi,k }. Much of the information inherent in the original data {yi (x)} is captured in the first few principal components and their associated scores (Jones & Rice 1992, Sood et al. 2009). Therefore, we consider the first two score vectors (z1,1 , . . . , zn,1 ) and (z1,2 , . . . , zn,2 ), and let zi = (zi,1 , zi,2 ). The bivariate principal component scores can be ordered using Tukey’s (1975) halfspace location depth, denoted by d(θ, Z) for some point θ ∈ R2 relative to the bivariate data cloud Z = {zi , i = 1, . . . , n}. We obtain the depth region Dk that is the set of all θ, with d(θ, z) ≥ k. Since the depth regions form a series of convex hulls, we have Dk1 ⊂ Dk2 for k2 > k1 . The Tukey median is defined as the value of θ which minimizes d(θ, Z) if a unique θ exists, otherwise it is defined as the center of gravity of the deepest region. The bagplot displays the center (also known as the median), an inner region (the “bag”), and an outer region (the “fence”), for the principal component scores, beyond which outliers are shown as individual points. The “bag” is defined as the smallest depth region containing 50% of the total number of observations. The outer region of the bagplot is the convex hull obtained by inflating the “bag” by a factor of ρ = 2.58 (by default). The functional bagplot maps the features of the bagplot applied to the principal component scores onto the original functional data.

3

4

1998 ● 28





1983 ●







0



● ●

● ●

● ●

● ● ●



● ●●















● ● ●



−2







● ●





● ●









−4

PC score 2

● ●

● ●● ●

Sea surface temperature





20

1982 ●

26







24

● ●

22

2



−6



1997 ● ●

−4

−2

0

2

4

6

8

10

2

4

PC score 1

6

8

10

12

Month

Figure 3: Bivariate bagplot and functional bagplot for the sea surface temperatures. The dark and light grey regions show the bag and fence regions respectively. The red asterisk is the Tukey depth median. In the right panel, the black line is the median curve surrounded by 95% pointwise confidence intervals. The curves outside the outer region are shown as outliers of different colors. To demonstrate the functional bagplot, we consider a time series of average monthly sea surface temperatures from January 1951 to December 2007, available online at www.cpc.ncep.noaa.gov/data/ no region” defined by the indices. These temperatures are measured by moored buoys in the “Ni˜ coordinate 0–10◦ South and 90–80◦ West. In the left panel of Figure 3, the dark gray region shows the 50% “bag” and the light gray region exhibits the 99% “fence”. Functional curves that are outside the fence region are considered outliers. The dotted blue lines give 95% pointwise confidence intervals for the median curve. In the right panel, the functional bagplot is shown.

4

Functional highest density region boxplot

A similar plot is obtained by computing a bivariate kernel density estimate (Scott 1992) on the first two robust principal component scores, applying the bivariate HDR boxplot of Hyndman (1996), and then mapping the features of the HDR boxplot back into the functional space. This gives the functional highest density region boxplot introduced by Hyndman & Shang (2010). We define the bivariate kernel density estimate of the scores as n

1X Kh1 (z1 − zi,1 )Kh2 (z2 − zi,2 ), fˆ(z1 , z2 ) = n i=1 where (zi,1 , zi,2 ) are the first two principal component scores associated with the ith observation, Khi (·) = K(·/hi )/hi is a kernel function, and hi is the bandwidth for the ith dimension. The bandwidths can be selected by smoothed cross validation method (Duong & Hazelton 2005). Using the kernel density estimates, a HDR is defined (Hyndman 1996) as Rα = {z : fˆ(z) ≥ fα }, R where fα is such that Rα fˆ(z)dz = 1 − α. The highest density regions can be treated as density contours, with an expanding coverage as α decreases.

4

● ● ● ●



0



1983● ●

● ●●



● ●

● ●

● ●

● ● ● ● ● ● ● ●

o ●











● ●● ● ●

● ● ●

−2



● ● ● ●

















−4

PC score 2



Sea surface temperature

2



1982●

20

−6



24



22

4

1998●

26

28

6

The functional HDR boxplot displays the mode curve (the curve with the highest density), and the inner and outer regions. The inner region contains 50% of the total number of curves. However, a user is required to specify the coverage probability of an outer region. Points outside the outer region are outliers. It is noteworthy that these outliers are not necessarily at the edge of the scatterplot of {(zi,1 , zi,2 )}. It is possible to have an outlier which is on the interior of this scatterplot, but which has no other points nearby, and will hence have a low density value. We demonstrate the use of functional HDR boxplot by using the sea surface temperature data described in the previous section. In Figure 4, the dark and light gray regions show the 50% HDR and the outer region, respectively. The curves that are outside the outer region are treated as outliers.

1997●

−8



−5

0

5

10

2

4

PC score 1

6

8

10

12

Month

Figure 4: Bivariate HDR boxplot and functional HDR boxplot for the sea surface temperatures. The dark and light grey regions show 50% and 92% HDR respectively. The black line is the modal curve. The curves outside the outer region are outliers.

5

Singular value decomposition plot

Zhang et al. (2007) proposed a plot for visualizing patterns of functional data and high-dimensional multivariate data. They utilize singular value decomposition to find low-dimensional projections that expose interesting features of the functional data. They first discretize the functional data on a dense grid, denoted by f (xi ) = [f1 (xi ), . . . , fn (xi )], for i = 1, . . . , p, where p is the number of dimensions, and n is the number of curves. Let {ri ; i = 1, . . . , p} and {cj ; j = 1, . . . , n} be the row and column vectors of the (p × n) matrix [f (xi )], respectively. The SVD of f (xi ) is defined as 0 f (xi ) = s1 u1 v10 + s2 u2 v20 + · · · + sK uK vK ,

where the singular columns u1 , . . . , uK form a set of orthonormal basis functions for the column space spanned by {cj }, and the singular rows v1 , . . . , vK form a set of orthonormal basis functions for the row space spanned by {ri }. The scalars s1 , . . . , sK are the singular values and the matrices {sk uk vk0 }, k = 1, . . . , K, are the SVD components. The SVD plot of Zhang et al. (2007) simultaneously presents the column and row information of the matrix [f (xi )], showing how the singular columns and singular rows change, and highlighting any interactions between the columns and rows of the matrix. Figure 5 shows image plots of the SVD approximation for the sea surface temperature data set. The plot labelled SVD1 shows s1 u1 v10 , SVD2 shows s2 u2 v20 , SVD3 shows s3 u3 v30 , “Reconstruction” shows the sum of the first three SVD components 5

and “Residual” shows the difference between the data and the reconstruction. The first SVD component captures the seasonal pattern, while the second and third SVD components show the contrasts of sea surface temperatures among different months. It is clear from the Reconstruction and Residual plots that the first three SVD components are sufficient to capture the features of the original data.

6

8

10

12

2000 2

4

6

8

10

12

2

6

8

SVD3

Reconstruction

Residual

8

10

12

12

10

12

1980 1960

Year

1980

Year

1960 6

10

2000

Month

2000

Month

1980

4

4

Month

1960 2

1980 1960

Year

1980 1960

Year

1980

Year

1960

4

2000

2

Year

SVD2

2000

SVD1

2000

Original data

2

4

Month

6

8

10

12

2

4

6

Month

8

Month

Figure 5: The image plots of the SVD for the sea surface temperature data set. The first three components provide similar underlying features.

SVD1

SVD2

18

−3

20

20

−2

22

22

−1

24

24

0

26

26

1

28

28

2

30

30

Original data

4

6

8

10

12

6

8

10

12

2

4

6

8

10

12

10

12

Residual

1 0 −1 −2

18

−2

20

−1

22

0

24

26

1

28

2

4

Reconstruction

30

SVD3

2

2

2

2

4

6 Month

8

10

12

2

4

6

8

10

12

2

Month

4

6

8

Month

Figure 6: Dynamic movies of the SVD plot. Figure 6 shows an alternative way of plotting by utilizing the rainbow-like plots with animations, enabling the separate curves from each year to be visualized.

6

6

Phase-plane plot

The phase-plane plot of Ramsay & Ramsey (2002) is a powerful tool for exploring functional data with harmonic variation. It is a graphical technique that plots the first derivative of functions against the second derivative of functions. It can be very useful in detecting non-linearity and highlighting a harmonic cycle. When the phase-plane plot moves from high velocity to high acceleration, the rate of change goes to zero. When the phase-plane plot is closer to zero, which indicates that both the velocity and acceleration are low. As the radius of the cycle gets larger, the more significant the rate of change is. These characteristics not only allow us to display the derivative information of a curve, but it also allows us to compare the shape of the cycles among curves. Figure 7 presents the phase-plane plots of the sea surface temperatures from years 1951 to 2007. The phase-plane plot displays the first derivative of function or velocity curve against the second derivative of function or acceleration curve. As an example, the sea surface temperature in year 2007 starts from January (labeled as 1) with a strong velocity and deceleration. The velocity decreases in February with an increased acceleration. The acceleration reaches zero at April, when the velocity is at maximum. From May to September, the acceleration generally increases with a decreased velocity. From October to December, the velocity increases. The acceleration increases between October and November, and approaches constant between November and December. The radius of the phase-plane plot of the sea surface temperatures in 2007 is comparably larger than the phase-plane plot of the sea surface temperatures in 1951. This indicates that the rate of change of the sea surface temperatures in 2007 is more significant than the rate of change of the sea surface temperatures in 1951.

2

3

Sea surface temperature in year 1951

12

8 10 11

0

4 5 6 7

3

1

2

−3

−2

−1

Acceleration

1

9

−3

−2

−1

0

1

2

3

4

Velocity

Figure 7: The velocity curve is plotted against the acceleration curve for the sea surface temperatures from the years 1951 to 2007.

7

7

Conclusion

We have reviewed some graphical methods for visualizing and exploring functional time series. Each of these methods has its unique advantages for revealing the features of functional time series. Some of the methods enable us to identify abnormal observations (i.e., SVD plot, functional bagplot and functional HDR boxplot), while others can be very useful in highlighting a harmonic cycle and comparing different features among curves. Overall, these graphical methods present a summary of functional time series, and should be considered as a preliminary step for any functional data analysis.

Supplemental materials R package for rainbow: The R package “rainbow” contains functions for constructing rainbow plots, functional bagplots, functional HDR boxplots, and the singular value decomposition plot as described in this article. The package also contains all datasets used as examples in this article. The R package can be obtained from CRAN (cran.r-project.org/web/packages/rainbow/). The phase-plane plot is constructed using the “fda” package (cran.r-project.org/web/packages/fda/).

8

References Croux, C. & Ruiz-Gazen, A. (2005), ‘High breakdown estimators for principal components: the projectionpursuit approach revisited’, Journal of Multivariate Analysis 95, 206–226. Delaigle, A. & Hall, P. (2010), ‘Defining probability density for a distribution of random functions’, The Annals of Statistics 38(2), 1171–1193. Duong, T. & Hazelton, M. L. (2005), ‘Cross-validation bandwidth matrices for multivariate kernel density estimation’, Scandinavian Journal of Statistics 32(3), 485–506. Ferraty, F. & Vieu, P. (2002), ‘The functional nonparametric model and application to spectrometric data’, Computational Statistics 17(4), 545–564. Hyndman, R. J. (1996), ‘Computing and graphing highest density regions’, The American Statistician 50(2), 120–126. Hyndman, R. J. (2007), addb: Australian demographic data bank. R package version 3.222. URL: http: // robjhyndman. com/ software/ addb Hyndman, R. J. & Shang, H. L. (2009), ‘Forecasting functional time series (with discussion)’, Journal of the Korean Statistical Society 38(3), 199–221. Hyndman, R. J. & Shang, H. L. (2010), ‘Rainbow plots, bagplots, and boxplots for functional data’, Journal of Computational and Graphical Statistics 19(1), 29–45. Hyndman, R. J. & Ullah, M. S. (2007), ‘Robust forecasting of mortality and fertility rates: a functional data approach’, Computational Statistics & Data Analysis 51(10), 4942–4956. Jones, M. C. & Rice, J. A. (1992), ‘Displaying the important features of large collections of similar curves’, The American Statistician 46(2), 140–145. Kargin, V. & Onatski, A. (2008), ‘Curve forecasting by functional autoregression’, Journal of Multivariate Analysis 99(10), 2508–2526. Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. & Cohen, K. L. (1999), ‘Robust principal component analysis for functional data’, Test 8(1), 1–73. Ramsay, J. O. & Ramsey, J. B. (2002), ‘Functional data analysis of the dynamics of the monthly index of nondurable goods production’, Journal of Econometrics 107(1-2), 327–344. Ramsay, J. O. & Silverman, B. W. (2002), Applied functional data analysis: methods and case studies, Springer, New York. Ramsay, J. O. & Silverman, B. W. (2005), Functional data analysis, 2nd edn, Springer, New York. Reiss, P. T. & Ogden, R. T. (2007), ‘Functional principal component regression and functional partial least squares’, Journal of the American Statistical Association 102(479), 984–996. Rousseeuw, P. J., Ruts, I. & Tukey, J. W. (1999), ‘The bagplot: a bivariate boxplot’, The American Statistician 53(4), 382–387. Scott, D. W. (1992), Multivariate density estimation: theory, practice, and visualization, Wiley, New York. Sood, A., James, G. M. & Tellis, G. J. (2009), ‘Functional regression: a new model for predicting market penetration of new products’, Marketing Science 28(1), 36–51. Tukey, J. W. (1975), Mathematics and the picturing of data, in R. D. James, ed., ‘Proceedings of the International Congress of Mathematicians’, Vol. 2, Canadian mathematical congress, Aug. 21-29, 1974, Vancouver, pp. 523–531. Zhang, L., Marron, J. S., Shen, H. & Zhu, Z. (2007), ‘Singular value decomposition and its visualization’, Journal of Computational and Graphical Statistics 16(4), 833–854.

9