Analysis of Time Varying. Histogram Data. Carlo Drago, Germana Scepi, Carlo Lauro. Dipartimento di Matematica e Statistica. Università âFederico IIâ di Napoli.
Workshop in Symbolic Data Analysis October 18th - 20th 2009 / Wienerwaldhof
Visualization and Functional Analysis of Time Varying Histogram Data Carlo Drago, Germana Scepi, Carlo Lauro Dipartimento di Matematica e Statistica Università “Federico II” di Napoli
Our aim Searching for non-parametric or parametric density function for Visualizing and Analysing histograms time series of complex and high frequency data
Beanplot
Histogram data Model data
Histogram Data and High Frequency Time Series Data Characteristics
the number of observations in high-frequency data sets can be overwhelming
data are often recorded with errors and need to be cleaned and corrected prior to direct analysis
transaction-by-transaction, data on trades and quotes are, by nature, irregularly spaced time series with random daily numbers of observations
high-frequency data typically exhibit periodic (intra-day and intra-week) patterns in market activity
the need for an aggregation arises
Histogram Data: For a generic variable, the i-th histogram data is a model to represent an empirical distribution described as a set of H ordered pairs Y(i)={(Ih,h)} such that: I h i z h i , z h i with z h i z h i I h i min z h i , max z h i h 1,... H h 1,..., H h 1,..., H
h h
h 0
I h I h
h
h 1,..., H
1
where z hi and z hi represent the lower (min) and upper bound (max).
Histogram data for visualizing time series In the context of high-frequency data Arroyo and Matè (2009, and Gonzales 2009b) suggest of substituting time series of observations with histogram time series Advantages: simple and flexible structure, their capacity to describe the essential features of the data with reasonable accuracy, without imposing any distribution
but multiple histograms are difficult to handle and to compare when there are many of them plotted in one graph
Visualizing and Analyzing high frequency data Two different approaches 1. Histogram Visualization by non-parametric density function to detect the data structure 2 .Modelling Histogram Data by parametric density function, in order to synthesize and classify high frequency data described by their model parameters
Different tools for visualizing Time Series data Some examples of different plots
Boxplots
Beanplots
Beanplots :New symbolic data !
Beanplot construction Beanplot: Kampstra (2009), Density Trace: Hintze, Nelson (1998)
• The beanplot is characterized by a 1-d scatterplot diagram and a density trace (mirrored). The density trace: Sample size Interval Width =1 if the ith data value is in te interval [xh/2, x+h/2] =0 otherwise Bandwith selection: Sheather-Jones method to select a bandwidth for batch, which seems to be preferred and close to optimal (Kampstra 2009)
Some advantages of the beanplot representations
Visualize general features of the series Visualize trend, seasonality ,..... Visualize structural changes
Visualize outliers Visualize the intra-day variability
The beanplot allows to detect change-points in the time-series visualized by intraobject mixture distribution
Example I
Example II based on different temporal intervals
the choice is directly linked with the information we are interested in
-0.004
-0.002
x
0.002
0.002
Considering more than one Time Series
0
20
40
60
80
100
1
1
2
3
1
2
3
-0.006
0.000
0.006
0.004 0.000 -0.004
x
3
Time Series 1
Time
0
20
40
60
80
100
60
80
100
Time Series 2
-0.004
-0.006
0.000
0.000
0.006
0.004
Time
x
2
0
20
40 Time
Time Series 3
Here are represented already the units
Double beanplots
In case of more than one variable we can combine the beanplots of the first principal component s of the histogram data.
Cointegration visualization
cointegration visualitation: adjustament
boxplot visualitation
Generali and Alleanza (source: Yahoo Finance) The data are collected daily from 19/12/2006 to 18/12/2009 beanplot visualitation
Symbolic data representation allows to distinguish shocks (outliers) in the short run from the structural breaks, helping the interpretation of adjustments process
From time varying histograms to “model data”... Several statistical techniques for exploring a data structure are naturally interpretable in the context of the following operational model: (Caussinus,1986)
Data=Structure+Noise in which a random part is combined with a structural one.
Symbolic Data
Mathematical Model
HISTOGRAM= MODEL
Error term
+
ERROR
WHAT FUNCTION TO USE FOR THE HISTOGRAM APPROXIMATION?
BASIS FUNCTIONS •Polynomial Basis •Piecewise Polynomial Basis •Spline Basis •B-spline Basis
From the time varying histograms to “model data”... Transform the histogram data by means of an approximation function in order to control the error deriving from empirical data, to characterize the data structure, and then use multivariate techniques on a model representation of the histograms The type of parametric function used to approximate the histograms characterizes the choice of the parameters of the model. Among approximation functions, it is chosen to use B-splines (De Boor C., 1978 ) that is a spline functions that has minimal support with respect to a given degree, smoothness, and domain partition. Since they are characterized by the capacity to adjust themselves to represent rather smooth curves, we have considered the so called control points as parameters that give us information on the histogram form spline functions because of the simplicity of their construction, their ease and accuracy of evaluation, and their capacity to approximate complex shapes through rather smooth curve
Alternative Representations from the initial Histogram Data t1
t2
From the histograms we can approach nonparametrically by using the density traces and obtaining the beanplots as visualization tool t1
t2
Model Data by Functional Approximation t1
t2
Analysis of Model Data Matrix After the histograms will be suitable processed, the new data matrix consists in as many parameter blocks as the histogram variables consider. The methodologies of analysis to cope with a partitioned matrix are the Multiple Factorial Analysis to extract the main features of the approximated histogram variables (symbolic models) and their weighting system, and , in addition, a symbolic models Cluster Analysis to define the temporal or asset profiles. To classify symbolic models we refer to Inter-Models distance proposed by Romano E., Giordano G. and Lauro C. (2006) in based on a linear combination of two distances embedding information both on the model estimated parameters referred to location, shape and size of the histograms, and on the model index of fitting.
Model Data Matrix b are parameters estimates of the models I are the models goodness of fit index Model 1
Model 2
Model p
Model may refer to histogram time series or to variables or to both 3-way data observations x variables x times
Model Data Analysis (Functional Data)
SYNTHESIS
Multiple Factor Analysis (Escofier, Pagès, 1982)
The multiple factor analysis allows to extract the principal factor of a partitioned data matrix (model parameters)
CLASSIFICATION
Definition of a distance Between Model Data Hierarchical or partitioning methods can be used to classify models but suitable distances that take into account the model fitting have to be considered (Lauro et alii 2007)
Distance between Model Data The idea (Lauro, Romano, Giordano 2005 ) is to combine the structural and the random information trough a convex combination of two quantities: 1) the Euclidean distance between coefficients and 2) a relative index of goodness of fit.
Inter-Model distance for Model Data is the sum of 3 components: a convex linear combination of two quantities, the control points and the error term; a distance between the location terms; a distance between the size terms.
where the coefficients are individually optimized to define the best distance which discriminates among the individual models, for each variable.
In choosing the value of this parameter we focus on the maximization of the function Coph measuring the distortion of the classification. It indicates how readily the data fits into the structure suggested by the classification, namely it is the Cophenetic coefficient index.
Coph( j , j ' | )
(D i j
ij
d )( D 'ij d ' )
( Dij d ) i j
2
' ' 2 ( D d ij ) i j
The vector D contains the distance Dij between each pairs of rows in the matrix The vector D’ of generic element D’ij contains the corresponding linkage distances between the objects paired in the clusters. The d and d’ are the averages of D and D’ respectively
Prototypes of a model based classification of financial data
Final Remarks and Open Questions Substituting histogram data, by beanplots or model data seems a very promising approach in analyzing high frequency data.
What next? Look for suitable intervals for histograms representation and for beanplots Evaluating other parametric approximation functions for histograms use parameters as statistical interpretation (Drichelet, Generalized Lambda) Work on Three Way Symbolic Data observations by variables by time (observations x variables x times)
Some References •-Arroyo J., Matè C. (2009) ” Forecasting Histogram Time Series with K-Nearest Neighbours Methods” International Journal of Forecasting, 25, pp.192-207
•- Drago C. Scepi G. (2009) “Univariate and Multivariate Tools for Visualizing Financial Time Series” , Proceedings of Cladag 2009, Cleup eds. •- De Boor Carl, (1978) A Practical Guide to Splines, Springer-Verlag, New York •-Engle R.F. Russel J.F. (2004) “Analysis of High Frequency Financial Data” Working Paper •-Romano E., Giordano G., Lauro N.C. (2006), An inter-models distance for clustering utility functions, Statistica Applicata-Italian Journal of Applied Statistics, Vol.17, n.2 . •-Signoriello S. Contributions to Symbolic Data Analysis: A Model Data Approach, Ph.D. thesis, Dep. of Mathematics and Statistics -University of Naples Federico II, (2008).