Histogram Data

Workshop in Symbolic Data Analysis October 18th - 20th 2009 / Wienerwaldhof

Visualization and Functional Analysis of Time Varying Histogram Data Carlo Drago, Germana Scepi, Carlo Lauro Dipartimento di Matematica e Statistica Università “Federico II” di Napoli

Our aim  Searching for non-parametric or parametric density function for Visualizing and Analysing histograms time series of complex and high frequency data

Beanplot

Histogram data Model data

Histogram Data and High Frequency Time Series Data Characteristics

 the number of observations in high-frequency data sets can be overwhelming

 data are often recorded with errors and need to be cleaned and corrected prior to direct analysis

 transaction-by-transaction, data on trades and quotes are, by nature, irregularly spaced time series with random daily numbers of observations

 high-frequency data typically exhibit periodic (intra-day and intra-week) patterns in market activity

the need for an aggregation arises

Histogram Data: For a generic variable, the i-th histogram data is a model to represent an empirical distribution described as a set of H ordered pairs Y(i)={(Ih,h)} such that: I h i  z h i , z h i  with z h i  z h i   I h i   min z h i , max z h i   h 1,... H  h 1,..., H h 1,..., H



h  h

h  0

I h  I h  

 h

h 1,..., H

1

where z hi and z hi represent the lower (min) and upper bound (max).

Histogram data for visualizing time series In the context of high-frequency data Arroyo and Matè (2009, and Gonzales 2009b) suggest of substituting time series of observations with histogram time series Advantages: simple and flexible structure, their capacity to describe the essential features of the data with reasonable accuracy, without imposing any distribution

but multiple histograms are difficult to handle and to compare when there are many of them plotted in one graph

Visualizing and Analyzing high frequency data Two different approaches 1. Histogram Visualization by non-parametric density function to detect the data structure 2 .Modelling Histogram Data by parametric density function, in order to synthesize and classify high frequency data described by their model parameters

Different tools for visualizing Time Series data Some examples of different plots

Boxplots

Beanplots

Beanplots :New symbolic data !

Beanplot construction Beanplot: Kampstra (2009), Density Trace: Hintze, Nelson (1998)

• The beanplot is characterized by a 1-d scatterplot diagram and a density trace (mirrored). The density trace: Sample size Interval Width =1 if the ith data value is in te interval [xh/2, x+h/2] =0 otherwise Bandwith selection: Sheather-Jones method to select a bandwidth for batch, which seems to be preferred and close to optimal (Kampstra 2009)

Some advantages of the beanplot representations

Visualize general features of the series Visualize trend, seasonality ,..... Visualize structural changes

Visualize outliers Visualize the intra-day variability

The beanplot allows to detect change-points in the time-series visualized by intraobject mixture distribution

Example I

Example II based on different temporal intervals

the choice is directly linked with the information we are interested in

-0.004

-0.002

x

0.002

0.002

Considering more than one Time Series

0

20

40

60

80

100

1

1

2

3

1

2

3

-0.006

0.000

0.006

0.004 0.000 -0.004

x

3

Time Series 1

Time

0

20

40

60

80

100

60

80

100

Time Series 2

-0.004

-0.006

0.000

0.000

0.006

0.004

Time

x

2

0

20

40 Time

Time Series 3

Here are represented already the units

Double beanplots

In case of more than one variable we can combine the beanplots of the first principal component s of the histogram data.

Cointegration visualization

cointegration visualitation: adjustament

boxplot visualitation

Generali and Alleanza (source: Yahoo Finance)‫‏‬ The data are collected daily from 19/12/2006 to 18/12/2009 beanplot visualitation

Symbolic data representation allows to distinguish shocks (outliers) in the short run from the structural breaks, helping the interpretation of adjustments process

From time varying histograms to “model data”... Several statistical techniques for exploring a data structure are naturally interpretable in the context of the following operational model: (Caussinus,1986)‫‏‬

Data=Structure+Noise in which a random part is combined with a structural one.

Symbolic Data

Mathematical Model

HISTOGRAM= MODEL

Error term

+

ERROR

WHAT FUNCTION TO USE FOR THE HISTOGRAM APPROXIMATION?

BASIS FUNCTIONS •Polynomial Basis •Piecewise Polynomial Basis •Spline Basis •B-spline Basis

From the time varying histograms to “model data”...  Transform the histogram data by means of an approximation function in order to control the error deriving from empirical data, to characterize the data structure, and then use multivariate techniques on a model representation of the histograms  The type of parametric function used to approximate the histograms characterizes the choice of the parameters of the model. Among approximation functions, it is chosen to use B-splines (De Boor C., 1978 ) that is a spline functions that has minimal support with respect to a given degree, smoothness, and domain partition.  Since they are characterized by the capacity to adjust themselves to represent rather smooth curves, we have considered the so called control points as parameters that give us information on the histogram form spline functions because of the simplicity of their construction, their ease and accuracy of evaluation, and their capacity to approximate complex shapes through rather smooth curve

Alternative Representations from the initial Histogram Data t1

t2

From the histograms we can approach nonparametrically by using the density traces and obtaining the beanplots as visualization tool t1

t2

Model Data by Functional Approximation t1

t2

Analysis of Model Data Matrix  After the histograms will be suitable processed, the new data matrix consists in as many parameter blocks as the histogram variables consider.  The methodologies of analysis to cope with a partitioned matrix are the Multiple Factorial Analysis to extract the main features of the approximated histogram variables (symbolic models) and their weighting system, and , in addition, a symbolic models Cluster Analysis to define the temporal or asset profiles.  To classify symbolic models we refer to Inter-Models distance proposed by Romano E., Giordano G. and Lauro C. (2006) in based on a linear combination of two distances embedding information both on the model estimated parameters referred to location, shape and size of the histograms, and on the model index of fitting.

Model Data Matrix b are parameters estimates of the models I are the models goodness of fit index Model 1

Model 2

Model p

Model may refer to histogram time series or to variables or to both 3-way data observations x variables x times

Model Data Analysis (Functional Data)

SYNTHESIS

Multiple Factor Analysis (Escofier, Pagès, 1982)‫‏‬

The multiple factor analysis allows to extract the principal factor of a partitioned data matrix (model parameters)

CLASSIFICATION

Definition of a distance Between Model Data Hierarchical or partitioning methods can be used to classify models but suitable distances that take into account the model fitting have to be considered (Lauro et alii 2007)

Distance between Model Data The idea (Lauro, Romano, Giordano 2005 ) is to combine the structural and the random information trough a convex combination of two quantities: 1) the Euclidean distance between coefficients and 2) a relative index of goodness of fit.

Inter-Model distance for Model Data is the sum of 3 components:  a convex linear combination of two quantities, the control points and the error term;  a distance between the location terms;  a distance between the size terms.

where the coefficients are individually optimized to define the best distance which discriminates among the individual models, for each variable.

In choosing the value of this parameter we focus on the maximization of the function Coph measuring the distortion of the classification. It indicates how readily the data fits into the structure suggested by the classification, namely it is the Cophenetic coefficient index.

Coph( j , j ' |  ) 

 (D i j

ij

 d )( D 'ij  d ' )

 ( Dij  d ) i j

2

' ' 2 ( D  d  ij ) i j

The vector D contains the distance Dij between each pairs of rows in the matrix The vector D’ of generic element D’ij contains the corresponding linkage distances between the objects paired in the clusters. The d and d’ are the averages of D and D’ respectively

Prototypes of a model based classification of financial data

Final Remarks and Open Questions Substituting histogram data, by beanplots or model data seems a very promising approach in analyzing high frequency data.

What next? Look for suitable intervals for histograms representation and for beanplots Evaluating other parametric approximation functions for histograms use parameters as statistical interpretation (Drichelet, Generalized Lambda) Work on Three Way Symbolic Data observations by variables by time (observations x variables x times)

Some References •-Arroyo J., Matè C. (2009) ” Forecasting Histogram Time Series with K-Nearest Neighbours Methods” International Journal of Forecasting, 25, pp.192-207

•- Drago C. Scepi G. (2009) “Univariate and Multivariate Tools for Visualizing Financial Time Series” , Proceedings of Cladag 2009, Cleup eds. •- De Boor Carl, (1978) A Practical Guide to Splines, Springer-Verlag, New York •-Engle R.F. Russel J.F. (2004) “Analysis of High Frequency Financial Data” Working Paper •-Romano E., Giordano G., Lauro N.C. (2006), An inter-models distance for clustering utility functions, Statistica Applicata-Italian Journal of Applied Statistics, Vol.17, n.2 . •-Signoriello S. Contributions to Symbolic Data Analysis: A Model Data Approach, Ph.D. thesis, Dep. of Mathematics and Statistics -University of Naples Federico II, (2008).

Histogram Data

Histogram Data

Suggest Documents

progressive reversible data hiding by symmetrical histogram ...

Generalized Histogram Shifting-Based Reversible Data ... - APSIPA

progressive reversible data hiding by symmetrical histogram

The median of a set of histogram data

Reversible Data Hiding Based on Histogram Gradient Adjacent ...

Chi-Squared Distance Metric Learning for Histogram Data

Consistency of Data-driven Histogram Methods for Density Estimation ...

Stripe Noise Reduction in MODIS Data by Combining Histogram

HiSbase: Histogram-based Data Management in P2P Networks - TUM

Lossless Data Hiding Using Histogram Shifting Method Based on ...

Optical Performance Monitoring Via Histogram: A Data-Driven ... - Merl

Node Histogram vs. Edge Histogram: A ... - Semantic Scholar

The Effect Of Histogram Size On Histogram-derived ... - IEEE Xplore

Color Histogram Specification by Histogram Warping - Mark Grundland

GPU Histogram Computation - CiteSeerX

Histogram Based FACE

Broad Histogram Monte Carlo

Histogram Based FACE

LABOCA NEFD Histogram - arXiv

Histogram/List-Mode

Histogram 113

Broad Histogram: An Overview

Improved Histogram-Based Feature

Differentially Private Histogram Publication - CiteSeerX