Clustering financial time series: an application to ... - Semantic Scholar

5 downloads 5382 Views 322KB Size Report
Computational Statistics & Data Analysis 47 (2004) 353–372 ... managers who are interested in evaluating and comparing different financial products.
Computational Statistics & Data Analysis 47 (2004) 353 – 372 www.elsevier.com/locate/csda

Clustering #nancial time series: an application to mutual funds style analysis Francesco Pattarina , Sandra Paterlinib;∗ , Tommaso Minervac a Dipartimento

di Economia Aziendale, viale Berengario 51, Modena, 41100 Italy di Economia Politica, Viale Berengario 51, Modena, 41100 Italy c Dipartimento di Scienze Sociali, Cognitive e Quantitative, Via Giglioli Valle 9, Reggio E., 42100 Italy b Dipartimento

Received 3 November 2003; received in revised form 3 November 2003; accepted 8 November 2003

Abstract Classi#cation can be useful in giving a synthetic and informative description of contexts characterized by high degrees of complexity. Di3erent approaches could be adopted to tackle the classi#cation problem: statistical tools may contribute to increase the degree of con#dence in the classi#cation scheme. A classi#cation algorithm for mutual funds style analysis is proposed, which combines di3erent statistical techniques and exploits information readily available at low cost. Objective, representative, consistent and empirically testable classi#cation schemes are strongly sought for in this #eld in order to give reliable information to investors and fund managers who are interested in evaluating and comparing di3erent #nancial products. Institutional classi#cation schemes, when available, do not always provide consistent and representative peer groups of funds. A “return-based” classi#cation scheme is proposed, which aims at identifying mutual funds’ styles by analysing time series of past returns. The proposed classi#cation procedure consists of three basic steps: (a) a dimensionality reduction step based on principal component analysis, (b) a clustering step that exploits a robust evolutionary clustering methodology, and (c) a style identi#cation step via a constrained regression model #rst proposed by William Sharpe. The algorithm is tested on a sample of Italian mutual funds and achieves satisfactory results with respect to (i) the agreement with the existing institutional classi#cation and (ii) the explanatory power of out of sample variability in the cross-section of returns. c 2003 Elsevier B.V. All rights reserved.  Keywords: Classi#cation; Clustering; Genetic algorithms; Mutual funds style analysis



Corresponding author. Tel.: +39-059-2056848; fax: +39-059-2056947. E-mail addresses: [email protected] (F. Pattarin), [email protected] (S. Paterlini), [email protected] (T. Minerva). c 2003 Elsevier B.V. All rights reserved. 0167-9473/$ - see front matter  doi:10.1016/j.csda.2003.11.009

354

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

1. Introduction In recent years, large #nancial databases have become conveniently available to investors. This has motivated researchers in quantitative #nance to investigate new data analysis tools, where classical statistical methodologies and soft computing ones, such as genetic algorithm and arti#cial neural networks, are merged. A branch of this research aims to develop objective, representative, consistent and empirically testable mutual funds classi#cation schemes in order to give reliable information to investors and #nancial practitioners. A mutual fund is a portfolio of #nancial assets managed by a professional investor on behalf of his clients. Clients commit their money to the fund by buying its shares, thereby accepting them to be invested according to the provisos stated in the contract they agree upon with the fund manager. These are the same for every shareholder of the fund and typically de#ne the kind of securities the manager might buy or sell, which #nancial markets he can trade on and the maximal risks he can take. These features, together with the manager’s investment style, de#ne the nature of the fund. By “investment style” it is broadly meant whether the fund is actively or passively managed and, in the former case, what kind of investment strategy the manager follows. In order to provide investors with a guide to the mutual funds market, specialized #nancial analysts issue classi#cations of existing funds according to the investment objectives stated by fund managers. In Italy, as well as in other European countries, the national association of mutual funds managers (“Assogestioni”) keeps and publishes its own classi#cation, which is based on the periodical screening of funds portfolio holdings. Each of Assogestioni’s classes is de#ned by speci#c asset allocation limits that must be satis#ed by any fund belonging to it. Managers declare the class they want their fund to be attributed to, thus committing themselves not to violate the stated asset allocation limits; Assogestioni periodically checks if these are met by querying managers about their portfolio holdings. In case of violation, the manager is compelled to rebalance his portfolio or to choose a di3erent class for the fund; if the manager defaults, mandatory reclassi#cation is undertaken by Assogestioni itself. These classi#cation procedures have many drawbacks. First, since they rely on information provided by managers, there might be instances of misclassi#cation due to intentional misreporting. Because investors compare a fund performance with those of competitors that share the same investment objective (i.e. those in the same “peer group”), a manager may bene#t by having his fund allocated to the group where he knows it will achieve a good rank. Enforcing a correct classi#cation by monitoring the holdings of funds, as Assogestioni does, is a possible solution to this problem, but a costly one. Second, funds analysts sell their classi#cations for making a pro#t and are either unwilling or cannot a3ord to spend too much money and time for checking fund managers reports. Finally, even if it applies, the control and revision process may take some months time, leaving investors without reliable guidance in the meantime. Statistical classi#cation procedures based on past returns of funds are a low-cost solution to these issues. Contrary to portfolio holdings, time series of returns can be easily and cheaply gathered and updated on daily basis from public sources and cannot be made up for too long.

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

355

In this paper, a classi#cation scheme that integrates di3erent statistical methodologies in order to identify mutual funds styles by analysing time series of past returns is introduced. Such a classi#cation scheme consists of three basic steps: dimensionality reduction, clustering and style identi#cation. The dimensionality reduction is based on principal component analysis. Its purpose is to reduce the size of the matrix of the returns of funds time series without losing relevant information by retaining the main principal components. The clustering step uses a robust evolutionary algorithm to group mutual funds. Then, using Sharpe’s (1992) constrained regression it is possible to estimate the sensitivities with respect to di3erent market indices for each group’s benchmark in order to identify di3erent mutual fund styles. The method is applied to a sample data set of Italian equity funds, and its results are checked against Assogestioni’s classi#cation. The two classi#cations are broadly similar, in the sense that the Genetic Algorithm for Medoid Evolution (GAME)-based partitioning of funds into groups turns out to be easily interpretable in terms of Assogestioni’s styles. Since the latter is based on a wider information set, this result is taken as evidence that the new scheme performs well in extracting relevant information from the time series of returns; furthermore, a comparison of the explanatory power in terms of cross-section variability of out-of-sample returns reveals that it provides an improvement over Assogestioni’s. The paper is organized as follows. Section 2 introduces cluster analysis and the main concepts about genetic algorithms. In particular, Section 2.1 describes in a formal way the clustering problem, while Section 2.2 describes the structure of the canonical genetic algorithm and reports a short literature overview on application of genetic algorithms in non-hierarchical clustering. Section 3 describes the evolutionary clustering algorithm, GAME (Genetic Algorithm for Medoid Evolution), which is used for the proposed mutual funds style classi#cation scheme. Section 4 reports the main contributions in mutual funds style analysis and describes the proposed three steps mutual funds classi#cation procedure. Section 5 reports the results for a sample of Italian equity funds and Section 6 discusses the main results and concludes the paper. 2. Technical framework The clustering problem is set out in a formal way and the basic structure of genetic algorithms is described; some review of the relevant literature on applications of genetic algorithms to non-hierarchical clustering is also provided. 2.1. The clustering problem Let O = {o1; o2 ; : : : ; on }, be a set of n objects and let Xn×q be the pro#le data matrix, with n rows and q columns. Each object oi is characterized by a real-value q-dimensional pro#le row vector xi (i = 1; : : : ; n) of Xn×q ; each element xij in xi correspond to the jth real value feature (j = 1; : : : ; q) of the ith object (i = 1; : : : ; n). Non-hierarchical cluster analysis algorithms try to determine a partition

356

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

g G = {C1 ; C2 ; : : : ; Cg } (i.e. Ck = ?; ∀k; Ck ∩ Ch = ?; ∀k = h; k=1 Ck = O) in g clusters Ck (k = 1; : : : ; g) such that the similarity among objects within the same cluster is maximized and dissimilarity between objects in di3erent clusters is maximized, respectively. It is then necessary to de#ne a measure of adequacy to rank di3erent feasible partitions and then to identify the one, which allows the best grouping of the objects. The aim in cluster analysis is the identi#cation of the partition of the objects that has optimal adequacy. In general, the clustering of a set of objects works as follows. First, the p 6 q most relevant object features must be chosen (feature selection); second, it is usually required to set a priori the number of clusters in which to partition the set of objects, unless the procedure can automatically detect it; third, a statistical criterion to rank di3erent feasible partitions must be de#ned. Then, the goal is to determine the partition G ∗ that has optimal adequacy with respect to all the other feasible solutions Gg = {G 1 ; G 2 ; : : : ; G N (n; g) } (i.e. G i = G j ; i = j) where   g−1 g 1  k (g − k)n (−1) N (n; g) = g! k k=0 is the number of all the feasible partitions when the number of clusters is equal to g. Therefore, a search strategy must be de#ned in order to determine the optimal clustering solution for the set of objects under investigation. Hence, after selecting the p 6 q most relevant features, de#ning the number of clusters, g, and a statistical–mathematical criterion, f(∗) , that quanti#es the goodness of the partition, the clustering task can be considered as an optimization problem with the objective to #nd the partition G ∗ in g clusters such that optimize f(Xnxp ; G); Gg

where G ={C1 ; C2 ; : : : ; Cg } corresponds to a single partition in the set Gg ={G 1 ; G 2 ; : : : ; G N (n; g) } of all feasible solutions when the number of clusters is equal to g. Given its combinatorial nature, it has been shown that such problem is NP-hard when the number of clusters exceeds three (Brucker, 1978). If the number of clusters g is not known a priori, the cardinality the set of all feasible solutions increases and corresponds gof max to N(n; 1; : : : ; gmax ) = g=1 N (n; g), where gmax is the maximum number of clusters in which to partition the set of objects. The choice of criterion f(∗) can be crucial in order to determine the size, the shape, the volume and the orientation of the clusters in the space of features (Ban#eld and Raftery, 1993, p. 805, Table 1). Moreover, some criteria can help in identifying what is the optimal number of groups g∗ . Many di3erent statistical criteria have been proposed in the literature (Marriott, 1982). These criteria usually minimize a transformation of the within groups scatter matrix W , such as the trace or the determinant, in order to have objects as similar as possible in the same cluster. Equivalently, a transformations of the between groups scatter matrix B is maximized, aiming to achieving the highest dissimilarity among

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

357

objects belonging to di3erent clusters. The pooled-within scatter matrix, W , is de#ned as W=

g 

Wk ;

k=1

where Wk is the covariance matrix of the features of all objects allocated in cluster Ck (k = 1; : : : ; g). Thus, if xl(k) indicates the lth object feature vector in cluster Ck and nk the number of objects in cluster Ck . Wk =

nk  l=1

(xl(k) − xL(k) )(xl(k) − xL(k) )

where xL(k) =

n k  l=1

0

0

1

 xl(k)

1

features to be considered

nk is the centroids vector for cluster Ck :

m 11 m 12 m 13 m 14 m 21

m 22 m 23 m 24 m 31 m 32

m 33 m 34

group 1 medoid coordinates group 2 medoid coordinates group 3 medoid coordinates

Fig. 1. GAME encoding. The dotted line virtually segments fragments 1 and 2. If the number of features, q, is 4, then four binary cells determine which features to consider and each medoid is identi#ed by four coordinates. If the number of groups, g, is 3, then there are three blocks of four coordinates.

The between scatter matrix, B, is de#ned as B=

g 

nk (xL

(k)

− x)( L xL

(k)

− x) L

k=1



  n  n: where xL = xi i=1

Then, the total scatter matrix T, of the n observations can be decomposed as T =B+W . The two following criteria are considered: • Variance ratio criterion (VRC) trace(B)=(g − 1) ; trace(W )=(n − g) where (n − g) are the degrees of freedom of the within scatter matrix and (g − 1) are the degrees of freedom of the between scatter matrix. The optimal partition is the one for which the VRC is maximized. Since the trace is the sum of the main diagonal elements of a matrix, VRC does not consider the covariances between features—i.e. clusters are treated as if they were spherical in the features space. Also, because minimizing Trace(W ) is equivalent to minimizing the sum of the eigenvalues of W , orthogonal transformations of the data are admissible. (Calinski and Harabasz, 1974; Friedman and Rubin, 1967).

358

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

• Marriott’s criterion (MC) g2

det(W ) : det(T)

The optimal partition is the one for which MC achieves its minimum. Since det(T) is constant given Xn×q , this is equivalent to minimizing g2 det(W ). Marriott’s criterion is invariant with respect to linear and non-singular transformations of the data (Friedman and Rubin, 1967). It allows considering the correlation between variables and detecting elliptical clusters with axes that are not parallel to the coordinates. Minimizing det(W ) is equivalent to minimizing the product of the eigenvalues of W . Marriott’s criterion is commonly used to search for clusters characterized by such a strong internal correlation that one or more eigenvalues are equal to zero (Marriott, 1971, 1982). The MC and the VRC criteria depend on the parameter g. Apart from determining the optimal solution for given g, both can be optimized with respect to g itself in order to select the best number of groups for given optimal partitions. However, these criteria can fail to identify a best number of groups: in fact, it is possible that VRC (MC) values monotonically increase (decrease) as g varies (Fabbris, 1997 pp.337–338). For further information about these and other criteria the reader is referred to Everitt (1993). When the aim is to build a classi#cation scheme for mutual funds style analysis by considering time series of returns, the set of objects is that of the mutual funds to be classi#ed. Each fund is characterized by a monthly time series xt (t = 1; : : : ; T ), which can be considered as an object described by T features, i.e. the features are the observations at di3erent points in time. The clustering of #nancial time series datasets can be a diNcult task because of the size of the data matrix. Since each point in time corresponds to a feature, as further information is added by extending the sample period, the complexity of the classi#cation problem also increases. Di3erent approaches have been proposed to tackle these issues, such as ad hoc distance measures (Bohte et al., 1988) or data reduction techniques that are intended to summarize most of the information in fewer features (e.g. Discrete Fourier Transform, Discrete Wavelet Transform and Principal Component Analysis). The problem of classifying mutual funds styles is tackled by applying principal component analysis to the matrix of mutual fund returns time series. The loadings associated to the largest eigenvalues are kept as characterizing features of each mutual fund. Considering only the main principal components instead of the whole time series data matrix allows reducing the size of the problem and speeds up the classi#cation procedure signi#cantly. Moreover, as shown in Appendix B, relevant information does not seem to be lost, since there is not a remarkable di3erence among the classi#cations performed on the whole time series matrix and those based on the 10 principal component loadings.

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

359

2.2. Genetic algorithms in non-hierarchical clustering Genetic algorithms (GAs) are stochastic search heuristics that explore the search space by evolving simultaneously candidate solutions of the optimization problem through operators inspired by natural selection and gene inheritance mechanisms. A genetic algorithm consists of a population of individuals or chromosomes. Each individual encodes a candidate solution to the problem under investigation. Each individual is composed of genes, with binary or real values (alleles). A genetic algorithm starts by randomly generating a population of individuals. Each individual is evaluated by a #tness function that quanti#es the goodness of the encoded solution. Then, the algorithm explores the space of the feasible solutions in order to determine the optimal one by maintaining, mutating, recombining and comparing several candidate solutions simultaneously. The population of individuals is evolved iteratively through di3erent generations by applying evolutionary operators, such as selection, crossover and mutation. The selection operator selects a subset of individuals from the current population, according to their #tness values, and then alters them by crossover and mutation. Also, the selection operator may allow maintaining the best individual (a feature known as elitism). The crossover operator recombines randomly selected fragments of individuals (parents) to create new individuals (o3spring), while the mutation randomly alters genes. Many di3erent operators have been proposed in literature in order to improve the GA e3ectiveness and robustness. The reader is referred to Goldberg (1989) for an introduction to GAs. The evolutionary process of a GA is iterated until a termination condition is satis#ed, such as that a certain number of iterations have been executed or that the individual with the best #tness value does not change for a #xed number of generations. When the algorithm stops, the individual with the best #tness value encodes the optimal solution of the problem under investigation (Holland, 1975; Goldberg, 1989). Genetic algorithms have been successfully applied in many #elds, including statistics (Chatterjee et al., 1996). Why GAs instead of local search heuristics? Local search heuristics can be strongly dependent on the initial conditions and are poor in coping with local optima, because they only re#ne a single solution. On the contrary, genetic algorithms, which are population-based stochastic search heuristics, have turned out to be successful in dealing with multimodal landscapes and NP-hard optimization problems. This is why the research on application of GAs to non-hierarchical clustering is very active. In addition, GAs can be used to search for clusters with di3erent shapes by using di3erent criteria, in contrast to other non-hierarchical algorithms, such as for example the k-means, which can detect only spherical-type clusters. The #rst study on using GA in non-hierarchical clustering is by Raghavan and Birchand (1979). They propose to use the genetic encoding to allocate n objects in g clusters directly: n genes with integer values in the interval [1; g] form each individual and determine the allocation of each object to a speci#c cluster. The ith object (i = 1; : : : ; n) is allocated in the cluster labelled by the ith value of the individual (e.g. individual 1122 allocates the #rst and second object in the #rst cluster and the third and

360

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

fourth object in the second cluster such that G = ({12}; {34})). Many papers on applications of GAs to non-hierarchical clustering have been published in the last 10 years, when new research perspectives have arisen thanks to the availability of increasingly powerful and faster computational tools. The research in using GAs for non-hierarchical clustering is variegated. Three different approaches may result by considering the di3erent genetic encoding and the search strategies proposed in the literature. Many authors have followed the approach of Raghavan and Birchand using di3erent statistical criteria as #tness function. The clustering of simulated and real datasets with this approach has been shown to be more robust in the convergence toward the optimal partition than standard clustering algorithms, such as the k-means (e.g. Krishna and Murthy, 1999). However, the encoding proposed by Raghavan and Birchand has been criticized because of its redundancy (e.g. individuals 1122 and 2211 determine the same partition G = ({12}; {34})). This approach has been further re#ned by ad hoc genetic operators and improved encoding that can avoid such redundancy and other shortcomings of the original approach (e.g. Falkenauer, 1998). The second approach is to encode candidate solutions as relevant points in the space of features, like medoids, centroids, baricentres or seed points. Each cluster is represented by one of such points in the encoding of candidate solutions, so that GA individuals are strings of g × q cells. An association of objects to points is then determined for each candidate solution, according to some measure of proximity (e.g. Euclidean distance), and the resulting data partition determines the #tness function value of the solution. (Maulik and Bandyopadhyay, 2000; Chiou and Lan, 2001; Bandyopadhyay and Maulik, 2002; Paterlini and Minerva, 2003). Other researchers have proposed to encode the origins that may be considered as relevant points, and also the lengths of the axes and the orientations of di3erent ellipsoids in the features space. Such encoding allows the partitioning of objects by allocating those belonging to the same ellipsoid in the same cluster (Srinkanth et al., 1995). The third approach consists of using the GA encoding to determine hyperplanes that divide the space of the features that characterize the objects, such that objects that belong to di3erent clusters are separated by the hyperplanes (Bandyopadhyay et al., 1995, 1998). Finally, there have been some studies on hybrid algorithms, which merge GAs with other standard clustering algorithms, such as the k-means or fuzzy c-means, in order to improve the robustness and to have better exploration of the search space. (e.g. Hall et al., 1998; Cowgill et al. 1999; Krishna and Murthy, 1999). 3. A genetic algorithm for medoids evolution (GAME) The clustering algorithm GAME is brieSy explained; the reader is referred to Paterlini and Minerva (2003) for a complete description and a comparison with k-means, fuzzy c-means and EM algorithms in the analysis of simulated and real datasets. In the three-step classi#cation procedure, the GAME algorithm is used for clustering the set of mutual funds. GAME is preferred to other algorithms, like k-means, because previous

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

361

analysis has shown that it converges steadily towards the global optimal grouping solution over di3erent runs, avoiding stopping at local extrema (Paterlini and Minerva, 2003). The GAME algorithm aims to determine the most relevant p 6 q features to be considered for clustering, the optimal number of clusters and their composition. For example, in this paper the set of objects consists of a number of Italian equity funds, and the features for each fund are the #rst 10 principal components of their past returns data matrix. More speci#cally, each individual in the GA may be considered as composed by two fragments. The #rst fragment of each individual can be activated or not, depending on whether the researcher is interested in using all the q features in the pro#le data matrix or wants the GA to do the feature selection automatically. This fragment consists of q binary cells. If the ith (i = 1; : : : ; q) cell (or allele) has unitary value, then the corresponding measurement is considered in the clustering phase otherwise not. The second fragment of each individual is composed of Gray encoded binary cells, which are transformed in g groups of q real valued cells. Each group of q cells corresponds to the medoids of a speci#c cluster. The domain of the medoids is equal to the domain of the features. Each medoid coordinate is a real number between the minimum and maximum values of the corresponding feature. Fig. 1 shows an example with q = 4 and g = 3. The #rst fragment is made by four binary cells (i.e. the third and fourth measurements are considered), while the second consists of three groups of four cells that correspond to the medoids coordinates of the di3erent groups. Therefore, m11 ; m21 ; m31 can assume values between the minimum and maximum of the #rst column of the pro#le matrix Xnxq ; m12 ; m22 ; m32 can assume values between the minimum and maximum of the second column and so on. The algorithm starts by randomly generating 200 individuals. Then the #rst individual is considered. If the #rst fragment is activated, a feature is selected for clustering if its value is one. Then, the Euclidean distances between each of the n objects in the dataset and the g di3erent medoids, determined by the genetic string, are computed. Each object is then allocated to the group with the closest medoid. Once all the observations have been allocated to g clusters, the algorithm computes a #tness value, which quanti#es the degree of optimality of the identi#ed clustering solution with respect to the a priori de#ned criterion—either MC or VRC. This evaluation procedure is repeated for all individuals in the population. Then, the population is evolved by iteration of selection, crossover, mutation and evaluation. 80% of the current population is selected to create an intermediate population by using a stochastic universal sampling (SUS) selection method with ordinal ranking (ps = 0:7). Single point crossover (pc = 0:7) and mutation (pm =0:7=Length(GA string)) operators are then applied to the intermediate population, which then is passed to the next generation. The individual with the best #tness value and a 10% share of the current population selected by SUS selection method are directly re-inserted in the population for the next generation without any transformation. Also, some new individuals are randomly generated such that the size of the population does not change from a generation to the next one. The new population is then evaluated with respect to the #tness function. The procedure is repeated iteratively until the #tness value of the best individual is constant for 400 generations or the genetic algorithm

362

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

has evolved for 8000 generations; the individual from the last generation with the best #tness value corresponds to the best-identi#ed solution to the clustering problem. Either the MC or VRC criteria is used as the #tness function in order to build an iterative procedure that searches the optimal number of clusters in which to partition the set of objects. The procedure consists in running GAME consecutively for an increasing number of groups, from two to a user-de#ned upper limit. The optimal number of groups is set to the #rst minimum (maximum) of the MC (VRC) criterion found in the process. If the upper limit is reached without #nding any extremum, the algorithm fails to identify a single best solution. In this case, a heuristic choice of the preferred classi#cation can be performed by the user through statistical examination of the partitions determined by each run of GAME. It is recommended to carry out this kind of analysis even if an extremum is found, because the goodness of a classi#cation scheme is hardly established by maximization tools only: an assessment of how data features are distributed among groups on the basis of the analyst’s knowledge of the phenomenon is crucial. The time-complexity of the GAME algorithm is equal to the sum of the orders of magnitude of the computational times of its elementary tasks. Measuring the distance of any object from each medoid takes an O(ng) time, as well as #nding the closest medoid to each object. Since the computation of scatter matrices takes O(n2 ) time, the time-complexity for each GAs individual in a single generation is O(2ng + n2 ). If there are N individuals and M generations, the overall time-complexity of GAME is O(NM (2ng + n2 )). The time complexity of the k-means algorithm is O(ng). 4. Return-based style analysis The research on mutual funds style analysis has been developing during the last 10 years. The main contributions have been the ones proposed by Sharpe (1992) and Brown and Goetzmann (1997). Sharpe’s approach consists in estimating a linear asset class factor model with a time series linear regression of fund’s returns on a set of given market or style indices. Regression slopes (or style weights) are constrained to be non-negative and to sum to unity, as typical mutual funds asset class weights are. The fund’s investment policy is determined by looking at the combination of those indices with the highest style weights, which are also commonly used to infer risk and to form peer groups (Cucurachi, 2000; Lucas and Riepe, 1996). The main problem with Sharpe’s model is that returns on market indices are often highly contemporaneously correlated, which might hinder a correct model speci#cation. In these instances, coeNcient estimates are quite unstable and it is diNcult to assess whether any particular style index is signi#cant. Brown and Goetzmann (1997) propose a second approach based on clustering time series of monthly returns with the k-means algorithm. Their procedure does not rely on the speci#cation of a factor model, and therefore the analyst is not committed to the choice of a set of indices as in Sharpe’s regression. Taken as it comes, the k-means output does not provide any clue about the association of styles to groups, but

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

363

auxiliary qualitative information on funds’ characteristics (e.g. their declared investment objectives) or a second step asset class regression on centroids may be exploited to gain further insight. A shortcoming from using k-means is that it is very likely to converge to a local optimum, an outcome that the analyst cannot check out without performing many trials with di3erent starting values. A classi#cation tool based on several statistical methodologies (principal component analysis, cluster analysis, constrained regression analysis) which allowed not only to identify classes of similar funds but also to determine which style characterizes each class would be a signi#cant improvement of the state of the art. Hence, Sharpe’s and Brown and Goetzmann approaches are jointly considered in order to build such a classi#cation scheme. The #rst step of the classi#cation procedure consists in reducing the dimension of the time series data matrix Xn×T (n = number of funds, T = number of time-periods) by extracting its main q principal components Un×q ≡ (Xn×T − xL1×T ⊗ –n×1 )BT ×q ; where q T; BT ×q is the matrix of the eigenvectors associated to the q largest eigenvalues of the sample covariance matrix of the transpose of X and the super-imposed bar denotes sample average taken along the columns of X . The choice of q may either be done by heuristics, such as variance attribution or visual examination of the scree-plot of the covariance matrix (see Appendix A), or with the well-known eigenvalues chi-square tests if the joint normality of time series returns is a tenable assumption (Flury, 1988). Since principal components analysis has turned out to perform well in extracting relevant information from the given time series, the #rst fragment of each GAME string is not activated. In the second step of the procedure, GAME is applied to the U matrix with size n × q, where n = number of funds and q = number of the main principal components considered, to #nd the optimal grouping solution for the funds in the sample. Note that since the q principal components loadings in matrix B are proxies for the realization of risk factors that a3ect portfolios’ returns, the rows of U represent the exposures of each of the n portfolios to unobserved risks; Connor and Korajczyk (1986) show that as n increases B converges to the true risk factors realizations. The algorithm runs for consecutive values of the number of clusters g. The best #tness value is stored for each run, and the procedure is stopped either when the #rst optimum of the criterion is found, or when g = 10. Once the classi#cation has been formed, the groups’ centroids are computed to provide class-speci#c time series of benchmark returns. In the #nal step, Sharpe’s regression of benchmarks against a set of market indices is run in order to assess the investment style that characterizes each class. Note that since regression slopes are constrained to be non-negative, their signi#cance cannot be evaluated using standard t-tests. The stability of the estimated investment styles may be gauged by plotting Sharpe’s style weights against time to check their variability (e.g. see Fig. 2). Alternatively, Kim et al. (2000) suggested computing the associated con#dence intervals by Monte Carlo simulations. Notice that running style regressions after having classi#ed funds do not inSuence their membership attribution, as would be the case in Sharpe’s approach, since

364

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

regression results are not used to classify funds but just to asses which portfolios of market indices replicate the class benchmarks in the best way.

5. An application to Italian equity funds The GAME-based three-steps classi#cation procedure is applied to a data set of 186 equity funds that were present in the Italian market between 1996 and 2000, for which the end-of period Assogestioni classi#cation by investment objectives is known. The distribution of the sampled funds by Assogestioni classes is shown in Table 1. Sixty monthly returns for each fund, collected from Thomson Financial Datastream database for the 1996 –2000 period, have been considered for the classi#cation procedure. The principal components approximation in the data-reduction step is based on the 10 loadings associated to the highest eigenvalues, which account for 88.2% of the total sample variation; the shares of the #rst and the 10th eigenvalues are, respectively, 51.9% and 1.0%. Albeit the latter accounts for less variance than the average single feature would if features were independent, 10 eigenvalues have been kept because minor factors may be important in discriminating between funds; also, in this way 90% of the overall variance is accounted for, which is a sensible threshold for time series data. Indeed, as shown in Appendix B, relevant information does not seem to be lost. In fact, the optimal partitions identi#ed when the whole time series matrix is considered (each fund has T = 60 features that correspond to di3erent observations in time) and the optimal partitions identi#ed when the reduced principal component matrix is considered (each fund has q = 10 ¡ T features that are the corresponding principal component values) are very similar. Appendix A reports the largest 20 eigenvalues, their shares of variability and the corresponding scree plot. GAME has been run both with the Variance Ratio and the Marriott criteria: both criteria have not allowed identifying a single optimal grouping solution. However, Table 1 Assogestioni classi#cation of sampled funds as of 31st December 2000

Assogestioni investment style

CODE

Number of funds

Percentage

Euro area America European International NT&T International all sectors Italy Paci#c Emerging Other equity styles Total

AZAE AZAM AZEU AZINNT AZINTS AZIT AZPA AZPE AZAS

7 19 24 2 40 40 19 11 24 186

3.76 10.22 12.90 1.08 21.51 21.51 10.22 5.91 12.90 100.00

The second column displays the short names of each class used in the tables below. The third and fourth columns show the number of funds for each class and the corresponding percentage.

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

365

Table 2 Style weights for class benchmarks

Market indices

Investment styles (GA clusters) (1) Paci#c

Italy (%) Europe (%) US (%) Paci#c (%) Emrg. Mkts (%) E.M.U. G B. World G.B. Cash 1 m (%) R-Squared N.of Funds

(2) Emrg. Mkts

(3) International

14.19 64:11 12.54

0.72 73:06

29.68 38:57 8.49 2.83

23.35 0.929 20

12.03 0.958 13

19.83 0.969 58

(4) Europe

(5) Italy

10.99 71:82

73:38 23.02

2.93 6.41 7.84 0.919 42

3.60 0.896 53

Regressors are on the rows and classes on the columns. Cells corresponding to zero style weight estimates are left blank to avoid cluttering the table. Prevalent weights that determine the class label are printed bold face. Number of observations = 60.

GAME converged to an optimal grouping solution for every value of g from two to 10; the two criteria have lead to very similar solutions. A number of groups g = 5 is selected because this is the least number of funds classes for which the results of Sharpe’s regressions performed on groups’ benchmarks deliver a clear and meaningful investment style characterization. The regressions are run against eight market indices. The equity indices are from Morgan Stanley Capital International and cover the Italian, European (excluded Italy), United States, Paci#c basin, and Emerging countries equity markets. Three J.P. Morgan #xed income indices have also been included, since equity managers are allowed to invest a limited portion of the fund in #xed income and money markets: E.M.U. Government Bonds, World Government Bonds and the one month Italy Cash Index. Table 2 shows the results of style regressions for all #ve benchmarks. Classes from (1) to (5) are labelled according to the prevalent estimated style weights. The characterization of GAME classes is quite clear-cut, with signi#cant over-weighting of a limited number of di3erent key indices for each class; determination coeNcients are also high, ranging from 89.6% to 96.9%. The reliability of the class description has been checked by performing style regression on 36 rolling 24-months windows of the benchmarks’ time series. For all classes the prevalent weights appear to be stable, with minor variability of the smaller ones. Fig. 2 shows a typical result. Further, two checks on the results from the procedure were performed. Since the Assogestioni end-of-period classi#cation is known, a two-way table of the GAME-based classi#cation output against it is provided in Table 3 (see Appendix B for such comparison when g varies from two to 10). The two classi#cation schemes appear to be similar in several respects, although GAME’s is more parsimonious than Assogestioni’s. Memberships overlap considerably

366

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

Fig. 2. Monthly updated style weights from December 1998 to December 2000 for the Italy class. For each displayed month, estimates are made on the previous 24 monthly returns. Overall style weights for the class are MSCI Italy = 73:38%, MSCI Europe = 23:2% and JPM 1 m cash = 3:60%.

Table 3 Scatter matrix of GAME vs. Assogestioni classi#cation

GAME’s Assogestioni investment styles investment styles AZAE AZAM AZAS AZEU Paci#c Emrg. Mkts International Europe Italy

1 1 5

17 2

3 1 9 8 3

2 19 3

AZINNT

2

AZINTS

29 10 1

AZIT

40

AZPA

AZPE

17 2

10 1

Cells contain the joint absolute frequencies of funds memberships. Bold-faced numbers are the highest row frequencies. A legend of Assogestioni styles is in Table 6 and Appendix B.

for the Italy, Emerging Markets and Paci#c classes. The International GAME class includes almost all Assogestioni’s America funds, and 72.5% of the International ones. GAME’s Europe class catches many funds from Assogestioni’s International and other specialization classes, which are quite broad and include a variety of di3erent styles,

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

367

Table 4 Cross-sectional variability of annual returns explained by di3erent classi#cation schemes

Classi#cation scheme

Period

R-squared

Assogestioni

1998 1999 2000

0.8258 0.5826 0.5375

GAME-VRC on PC’s

1998 1999 2000

0.8590 0.6021 0.6290

GAME-MC on PC’s

1998 1999 2000

0.8760 0.6010 0.5880

CoeNcient of determination of the regressions of annual returns against class membership indicators. Class memberships for GAME are based on past 24-months data.

since investment limits for them are very loose. Indeed, the Other specialization class is scattered among all GAME’s groups. The comparisons in Table 3 show the potential of the proposed method. The distribution of funds’ memberships is easy to interpret according to Assogestioni’s, which is based on more information than ours and on keen monitoring procedures. The agreements between the two schemes suggest that GAME succeeds in recognizing fundamental di3erences among funds. As a second check, the three-step procedure was applied to 24-months long rolling sub-periods of the data set and computed how much of the cross-sectional variability of the returns over the subsequent year is accounted for by the resulting classi#cation. Table 4 displays these diagnostics for both the VRC and MC: the explanatory power of Assogestioni classi#cation is also shown at the top of the table. For all years, coeNcients of determination from GAME’s classi#cations results are higher than those based on Assogestioni’s classes. In all cases the R-squared values are fairly high, ranging from 53.8% to 87.6%, pointing out that all schemes provide relevant information on the risk and return pro#les of the sampled funds. 6. Conclusions Financial classi#cation can help investors and #nancial practitioners to distinguish and compare di3erent #nancial products and to support their investment decisions. However, reliable and objective classi#cations are not always freely available. A classi#cation algorithm for mutual funds style analysis is proposed, which is based on a combination of statistical tools and uses the time series of returns to identify clusters of funds characterized by the same style. Funds are partitioned into di3erent classes through an evolutionary clustering algorithm that runs on pre-selected principal components; then, characterizing styles for each class are identi#ed by using Sharpe’s constrained regression model.

368

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

The classi#cation of a sample of Italian equity funds that results is compared with Assogestioni’s, which is based on close monitoring of funds’ portfolio holdings. The new classi#cation matches Assogestioni’s to a signi#cant extent, but is also more parsimonious and accounts for a larger proportion of out-of-sample cross-sectional variation in returns. The new method succeeds in extracting relevant information historical data, and therefore can be considered as a low-cost solution compared with institutional classi#cations. Furthermore, since the analysis of past returns cannot be inSuenced by misreporting of portfolio holdings, it may be used as a tool to monitor managers’ investment behaviour.

Acknowledgements Research grant allowed by the University of Modena and Reggio Emilia is gratefully acknowledged. The authors wish to thank two anonymous referees for useful comments and suggestions. Appendix A. Table 5 shows the 20 largest eigenvalues, sorted in decreasing order, of the monthly returns time series data matrix—186 funds by 60 monthly observations. The #rst Table 5 The 20 largest eigenvalues, sorted in decreasing order, of the monthly returns time series data matrix—186 funds by 60 monthly observations

Rank

Eigenvalue

Share (%)

Cumulated (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.0357417 0.0087134 0.0050342 0.0037016 0.0027646 0.0013502 0.0011261 0.0008972 0.0007472 0.0006878 0.0006108 0.0005487 0.0005206 0.0004377 0.0004279 0.000387 0.0003355 0.0003291 0.0003135 0.0002969

51.88 12.65 7.31 5.37 4.01 1.96 1.63 1.30 1.08 1.00 0.89 0.80 0.76 0.64 0.62 0.56 0.49 0.48 0.46 0.43

51.88 64.53 71.84 77.21 81.22 83.18 84.82 86.12 87.20 88.20 89.09 89.88 90.64 91.28 91.90 92.46 92.95 93.42 93.88 94.31

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

369

Scree Plot 0.04 0.035 Eigenvalues

0.03 0.025 0.02 0.015 0.01 0.005 0 0

5

10 Rank

15

20

Fig. 3. Scree plot.

column reports the rank, the second the eigenvalues, the third and the fourth, respectively, the simple and cumulative variability shares. The Fig. 3 displays the scree plot. Appendix B. Table 6 shows the scatter matrices of GAME classi#cation scheme against Assogestioni’s. The optimal partitions, identi#ed by GAME (VRC), for g = 2; : : : ; 10. The pro#le data matrix is composed by the #rst 10 principal components of the returns time series of 186 Italian mutual funds. The numbers in brackets indicate the di3erences with respect to the optimal partitions identi#ed by GAME (VRC), for g = 2; : : : ; 10 when the pro#le data matrix is composed by the raw returns time series. The #rst column reports the progressive clusters numbers, the second the number of funds allocated to each cluster by GAME, the third the mean values of the returns of the clusters centroids and the fourth their standard errors. Clusters are ordered by increasing mean returns. Fig. 4 below displays the VRC optimal values when the number of clusters, g, varies from 2 to 10 and the pro#le data matrix corresponds to the #rst 10 principal

Best Fitness Value when g=2,..,10 (VRC)

Best Fitness Value (VRC)

180 170 160 150 140 130 120 110 1

2

3

4

5

6 N.clusters

7

8

9

10

11

Fig. 4. VRC optimal values when the number of clusters g, varies from 2 to 10 and the pro#le data matrix corresponds to the #rst 10 principal components.

370

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

Table 6 Scatter matrices of GAME classi#cation scheme against Assogestioni’s Cluster N.Funds Mean S.E. AZAE AZAM AZAS AZEU AZINNT AZINTS AZIT value (%) (%) g = 2 (2 clusters) 1 129 1.12 2 57 2.08

5.15 2 6.84 5

18 1

21 3

19 5

1 1

39 1

g = 3 (3 clusters) 1 32 0.37 2 101 1.39 3 53 2.10

6.26 4.82 2 6.92 5

19

3 18 3

21 3

2

39 1

g = 4 (4 clusters) 1 20 0.32 2 16 0.75 3 96 1.38 4 54 2.09

5.68 7.29 4.69 2 6.91 5

3 3 18(+1) 15 1(−1) 3

g = 5 (5 clusters) 1 20 0.32 2 13 0.50 3 58 1.30 4 42 1.52 5 53 2.10 g = 6 (6 clusters) 1 20 0.32 2 13 0.50 3 60 1.30 4 35 1.52 5 6 1.60 6 52 2.11

5.68 7.16 4.54 1 5.17 1 6.92 5 5.68 7.16 4.59 1 4.83 1 6.96 6.94 5

g = 7 (7 clusters) 1 20 0.32 2 13 0.50 3 60 1.31 4 33 1.47 5 11 1.68 6 2 2.11 7 47 2.14

5.68 7.16 4.56 4.89 5.89 8.77 7.01

g = 8 (8 clusters) 1 19 0.32 2 13 0.44 3 18 1.30 4 51 1.30 5 25 1.54 6 11 1.68 7 2 2.11 8 47 2.14

5.62 7.20 5.28 4.51 4.64 5.89 8.77 7.01

1 1 3

17 2

17 2

17 1 1

2

1 1 3 2

1 16 1 1

3 1 9 8 3

21 3

2 19 3

3 1 10 4(+2) 22 3(−2) 3 2 3 1 9 6 1 1 3

1 19 3 1

2(+1) 1(−1) 5 8 3(+1) 20 1(−1) 3 1 3 1

1 1

2

1 1

1 1

39 1

29(+1) 10(−1) 1

32 5(+2) 2(−2) 1

40

19

10 1

19

10

40

1 17 2

40

40

40

1

39

1(+1)

1(−1)

39(−1)

10

1 17 2

1(+1)

10

1 17 2

1

10 1

17 2

32 6 1

10 1

17 2

12 26 1 1

AZPA AZPE

10

1

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

371

Table 6 (continued) Cluster N.Funds Mean value (%) g = 9 (9 clusters) 1 3 0.31 2 19 0.32 3 11 0.54 4 42 1.20 5 23 1.48 6 25 1.54 7 15 1.77 8 6 1.60 9 42 2.17 g = 10 1 2 3 4 5 6 7 8 9 10

(10 clusters) 3 0.31 19 0.32 11 0.54 39 1.21 10 1.30 17 1.53 27 1.52 11 1.68 2 2.11 47 2.14

S.E. AZAE AZAM AZAS AZEU (%) 6.79 5.62 7.21 4.50 4.79 4.64 6.09 6.96 7.11 6.79 5.62 7.21 4.57 4.17 4.93 4.87 5.89 8.77 7.01

1 1 4 1

1 1 3 2

1 2 1 (+1) 6(−1) 17(−1) 5(+2) 1 3 20 1 3 3(−1) 3 1

(+1) 1(−1) 16 1 1

1 2 1(−1) 3(+5) 7(−3) (+1) 5(−2) 1 1 3

AZINNT AZINTS AZIT AZPA AZPE

2 17

1 1

36 (+1) (+1) 2 2(−2)

10

4 36

1 2 17

(+1) 2(−2)

36(−8) (+10)

18(+1) 3 1 1 1

2(−2) 1

1

1

39

10

1

Assogestioni investment style

Code

Euro Area America European International NT&T International all sectors Italy Paci#c Emerging Other equity styles

AZAE AZAM AZEU AZINNT AZINTS AZIT AZPA AZPE AZAS

The legend gives short names for Assogestioni’s classes.

components. No optimal number of clusters, g∗ , can be identi#ed by the iterative procedure.

References Bandyopadhyay, S., Maulik, U., 2002. Genetic clustering for automatic evolution of clusters and application to image classi#cation. Pattern Recognition 35, 1197–1208.

372

F. Pattarin et al. / Computational Statistics & Data Analysis 47 (2004) 353 – 372

Bandyopadhyay, S., Murthy, C.A., Pal, S.K., 1995. Pattern classi#cation with genetic algorithm. Pattern Recognition Lett. 16, 801–808. Bandyopadhyay, S., Murthy, C.A., Pal, S.K., 1998. Pattern classi#cation using genetic algorithm: determination of H . Pattern Recognition Lett. 19, 1171–1181. Ban#eld, J.D., Raftery, A.E., 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821. Bohte, Z., Cepar, D., Kosmelj, K., 1988. Clustering of Time Series, in COMPSTAT, Physica-Verlag, Wurzburg, pp. 587–593. Brown, S.J., Goetzmann, W.N., 1997. Mutual fund styles. J. Finan. Econom. 43, 373–399. Brucker, P., 1978. On the complexity of clustering problems. In: Beckmenn, M., Kunzi, H.P. (Eds.), Optimization and Operations Research, Lecture Notes in Economics and Mathematical Systems, Vol. 157. Springer, Berlin, pp. 45–54. Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Comm. Statist. 3 (1), 1–27. Chatterjee, S., Laudato, M., Lynch, L.A., 1996. Genetic algorithm and their statistical applications: an introduction. Comput. Statist. Data Anal. 22, 633–651. Chiou, Y.-C., Lan, L.W., 2001. Theory and methodology genetic clustering algorithms. European J. Oper. Res. 135, 413–427. Connor, G., Korajczyk, R., 1986. Performance measurement with the arbitrage pricing theory: a new framework for analysis. J. Finan. Econom. 15, 373–394. Cowgill, M.C., Harvey, R.J., Watson, L.T., 1999. A genetic algorithm approach to cluster analysis. Comput. Math. Appl. 37, 99–108. Cucurachi, P.A., 2000. I peer groups nella valutazione della performance dei fondi comuni di investimento azionari. Econom. Manage. 1, 103–112. Everitt, B.S., 1993. London, UK, Cluster Analysis, 3rd Edition. Halsted Press. Fabbris, L., 1997. Statistica Multivariata. McGraw-Hill, Milano, pp. 337–338. Falkenauer, E., 1998. Genetic Algorithms and Grouping Problems. Wiley, Chichester. Flury, B., 1988. Common Principal Components and Related Multivariate Models. Wiley, New York. Friedman, H.P., Rubin, J., 1967. On some invariant criterion for grouping data. J. Amer. Statist. Assoc. 63, 1159–1178. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Wesley, Reading, MA. Hall, L.O., Ozyurt, B., Bezdek, J.C., 1998. Clustering with a genetic optimized approach. IEEE Trans. Evol. Comput. 3 (2), 103–112. Holland, J.H., 1975. Adaptation in Natural and Arti#cial Systems. University of Michigan Press, Ann Harbor. Kim, T., Stone, D., White, H., 2000. Asymptotic and Bayesian con#dence intervals for Sharpe’s style weights. Discussion Papers, University of Nottingham. Krishna, K., Murthy, M.N., 1999. Genetic k-means algorithm. IEEE Trans. System Man Cybernet. 29 (3), 433–439. Lucas, L., Riepe, M.W., 1996. The role of returns-based style analysis: understanding, implementing and interpreting the technique. Working Paper, Ibbotson Associates. Marriott, F.H.C., 1971. Practical problems in a methods of cluster analysis. Biometrics 27, 501–514. Marriott, F.H.C., 1982. Optimization methods of cluster analysis. Biometrics 69 (2), 417–422. Maulik, U., Bandyopadhyay, S., 2000. Genetic algorithm-based clustering technique. Pattern Recognition 33, 1455–1465. Paterlini, S., Minerva, T., 2003. Evolutionary Approaches for Cluster Analysis. In: Bonarini, A., Masulli, F., Pasi, G. (Eds.), Soft Computing Applications. Springer, Berlin, pp. 167–178. Raghavan, V.V., Birchand, K., 1979. A clustering strategy based on a formalism of the reproductive process in a natural system. In: Proceedings of the Second Annual International ACM SIGIR Conference on Information Storage and Retrieval, Information implications into the eighties, pp. 10 –22, Sept. 27-28, 1979, Dallas, TX. ACM Press, New York. Sharpe, W.F., 1992. Asset allocation: management style and performance measurement. J. Portfolio Manage. 18, 7–19, Winter 92. Srinkanth, R., George, R., Warsi, N., Prabhu, D., Petri, F.E., Buckles, B.P., 1995. A variable-length genetic algorithm for clustering and classi#cation. Pattern Recognition Lett. 16, 789–800.

Suggest Documents