229 Statistica Neerlandica (2016) Vol. 70, nr. 3, pp. 229–259 doi:10.1111/stan.12086
Combining homogeneous groups of preclassified observations with application to international trade Andrea Cerasa∗ European Commission, Joint Research Centre (JRC), Institute for the Protection and Security of the Citizens, Ispra (VA)
This article proposes three methods for merging homogeneous clusters of observations that are grouped according to a pre-existing (known) classification. This clusterwise regression problem is at the very least compelling in analyzing international trade data, where transaction prices can be grouped according to the corresponding origin–destination combination. A proper merging of these prices could simplify the analysis of the market without affecting the representativeness of the data and highlight commercial anomalies that may hide frauds. The three algorithms proposed are based on an iterative application of the F-test and have the advantage of being extremely flexible, as they do not require to predetermine the number of final clusters, and their output depends only on a tuning parameter. Monte Carlo results show very good performances of all the procedures, whereas the application to a couple of empirical data sets proves the practical utility of the methods proposed for reducing the dimension of the market and isolating suspicious commercial behaviors. Keywords and Phrases: clusterwise regression, statistically homogeneous merging, recursive F-test, permutation test, international trade, groups-adjusted Rand index.
1
Introduction
The issue of statistical merging has been extensively debated in past literature, and many clustering methods have been proposed (for example, Kaufman and Rousseeuw, 2009). In this work, we deal with a classification problem different from the typical context of cluster analysis. Here, the final aim is not to form homogeneous clusters of individual observations, but rather to form homogeneous clusters of groups of observations, where the groups are defined on the basis of a pre-existing classification. In particular, we assume to observe two characters, namely, Y and X , in a sample of N units, with Y linearly dependent from X . Observations belong to m known distinct *
[email protected] This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made. © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research Published by Wiley Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.
230
A. Cerasa
groups, so that the whole sample can be represented as } { Yi,g , Xi,g i = 1, … , N where g is the label representing the group containing observation i. For ease of notation, we can label the m groups with the integer numbers 1, … , m. Within each group, the linear relation between Y and X can be expressed as ( ) Yi,g = 𝛽g Xi,g + 𝜖i,g with 𝜖i,g ∼ N 0, 𝜎g ,
(1)
where the constant term is suppressed in line with the application field considered in this article. Its presence, however, is likely to influence only the execution time of the procedure proposed, but not its statistical properties, as will be later clarified. In expression (1), we have a total of m linear relations, and our final target is to investigate the possibility to reduce this number without affecting the statistical information embedded in the original m groups. In other words, we are assuming that the true data generating process (DGP) involves only k linear relations (with k < m), representing a partition of the original m populations. Therefore, we need a method for combining the m labels and forming k clusters of labels that are homogeneous with respect to the regression coefficients 𝛽. At the end of the process, the result will be an estimate k̂ for the final number of clusters consistent with a predetermined confidence level and a ̂ k-elements partition G of the set of the labels {1, 2, … , m}, that is, } { G = G1 , … , Gk̂ { } Gj ⊆ {1, … , m} ∀j ∈ 1, … , k̂ Gj ∩ Gl = ∅
∀j ≠ l
k̂
⋃
Gj = {1, … , m}.
1
As a consequence, the model (1) will reduce to ) ( Yi,g = 𝛽Gj Xi,g + 𝜖i,Gj with 𝜖i,Gj ∼ N 0, 𝜎Gj
(2)
with Gj representing the partition element that includes the group label g. After this formal explanation of the problem, it is clear that typical clustering methods are not suitable for finding a solution because they provide a partition of the observations and not of groups of observations. At the same time, previous studies on hierarchical cluster merging do not properly adjust to our context. In fact, their starting point is a finite mixture density, so the assumption is that Rp valued observations 𝐱1 , … , 𝐱N are i.i.d. and distributed according to the probability density function: f (𝐱; 𝜋) =
m ∑
𝜋j fj (𝐱),
j=1 © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
231
where 𝜋j is the probability that an observation belongs to the j th sub-population, fj (𝐱) is generally the p-dimensional Gaussian distribution, and the number m of clusters is usually determined through the Bayesian information criterion. Hierarchical cluster merging methods were proposed in order to properly reduce the number of clusters m. For example, Hennig (2010) provides a method based on concepts of unimodality and misclassification, whereas Baudry et al. (2010) suggest a second stage of analysis based on the integrated completed likelihood criterion. Despite the flexibility and variety of these methods, they are not suitable for our clustering problem. Firstly, because in our case, we know both the value of m and the group each observation belongs to. This is because we start from a pre-existing classification, and we do not need to estimate it through a finite mixture density model. Secondly, hierarchical cluster merging algorithms look for homogeneous groups with respect to the p-dimensional distribution of the observations. Here, instead, the objective is to form homogeneous groups of linear relations, that is, to merge the groups g with the same 𝛽g in expression (1). We provide a statistical method that, starting from the more general case where each of the m linear relations is considered as a single group, iteratively applies the well known F-test to reduce the number of groups and stops when no more merging is statistically justifiable. Three different algorithms are presented. We solve the possible problems of excessive computational burden due to the recursive structure of the algorithms by considering only groups with contiguous 𝛽 estimates. This, as proven in Appendix A, has no effect on the intermediate search of the optimal sub-models. In addition to its innovative character, the proposed methods have the advantage of not requiring to fix in advance the final number of clusters, as is typically carried out in clustering techniques. Determining the number of clusters is still an open issue. Finally, the results of our procedure simply rely on the choice of a single-tuning parameter, that is, the significance level for the iterative F-test, that defines the stopping rule and whose interpretation is explained in section 3.4. The performance and the accuracy of the algorithms were measured through a Monte Carlo experiment. For evaluating the accuracy of the final clustering, a variant of the Rand index (RI) was developed (Appendix B), considering the particular nature of the problem. Experimental results showed very satisfactory performances for all the three algorithms proposed, with an accuracy of more than 90% in most cases. Despite the generality of the statistical problem solved by the proposed method, we will present it with a particular focus on international trade data applications, where this group merging task is particularly important, as will be extensively motivated in section 2. Moreover, the algorithms are applied to two practical cases. These empirical examples show how the proposed method can be an effective and efficient working tool for customs and member states (MS) of the European Union (EU) and how it can be helpful in detecting anomalous commercial behaviors. The application of the procedures to two real data sets confirmed their utility for a summary representation of the market and for highlighting suspicious prices. The paper proceeds as follows. The operational context and the motivating application are described in the next section. In the third section, three different algorithms for © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
232
A. Cerasa
the homogeneous aggregation of groups are presented, together with some considerations on the interpretation of the significance level and on their computation times. In section 4, the performances of the algorithms are tested and evaluated through a Monte Carlo experiment. Finally, in section 5, the algorithms are applied to some practical cases. Section 6 concludes and proposes possible developments.
2 Operational context and motivating application International transactions of goods coming from non-EU countries are registered daily by national customs offices. Relevant data variables are sent to national institutes of statistics and then to Eurostat, who aggregate them to form monthly quantities and values for each trade flow. Aggregates are collected in the ComExt database and freely distributed on the Eurostat website (http://epp.eurostat.ec.europa.eu/nextweb/). Products are classified according to the combined nomenclature (CN): a classification system based on the harmonized system nomenclature and enriched with further union subdivision. At the finest level of detail, CN classifies more than 10,000 different products. The statistical analysis of import and export aggregates can shed light on several important questions pertaining to international trade. The detailed information provided by ComExt database allows not only studying the European international economy from a global perspective but also deepening some aspects concerning particular markets, as for example, in Pula and Santabárbara (2012). Moreover, it can help in detecting anomalies and market distortions that threaten the equilibrium of the European economy. European markets are regulated and monitored by the EU institutions with the aim of ensuring perfect competition and equal opportunities for all EU MS. These targets are pursued not only through the establishment and the acceptance of a set of common rules and commercial agreements for the internal market but also with a regulatory framework for transactions involving third countries. In particular, imports into EU countries from non-EU partners are subject to tariff rates, import duties, and protection measures such as anti-dumping and quotas, the infringement of which can cause market distortions, unfair competition, and can have a negative impact on the EU budget. In this context, the Joint Research Centre of the European Commission and the European Anti-Fraud Office have more than 10 years of collaboration aimed at developing statistical tools for the analysis of international trade data, with particular attention to the detection of potential irregularities in imports data. A typical example of irregularities in imports is the under-declaration of the value of the goods imported by an EU trader, resulting in a loss of import duties or VAT tax and a consequent economic advantage over other traders. The same result would be obtained by a trader who intentionally substitutes the code of the imported good with the code of a product free from import duty. A market disequilibrium could also be generated by a non-EU exporter benefiting from state aids who can then reduce the price of the good exported below its fair market level. Finally, misdeclaration in the © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
233
import invoices could also hide a money laundering scheme, with characteristics and effects on the market extensively discussed by the Financial Action Task Force reports (FATF, 2006; 2008). When systematically repeated, these mispricing practices may generally result in groups of outliers or linear structures in the scatter plots representing traded values versus traded quantities. Figure 1 is a classical example of such patterns. The product considered is aluminum hydroxide, and each point in the scatter plot represents a monthly sum of quantity and value for transactions between a non-EU country of origin and an EU country of destination. The plot clearly shows several linear structures corresponding to different product prices. They may correspond to different quality levels for the same good or to product codes that are naturally heterogeneous. But they could be also due to the systematic reiteration of the unfair behaviors previously introduced. Two approaches are mainly followed to analyze such patterns. The first consists in applying robust clustering methods on all data in order to classify the different linear structures (Garc´ıa-Escudero et al., 2010a). TCLUST-REG (Garc´ıa-Escudero et al., 2010b) or RLGA (Garc´ıa-Escudero et al., 2009), or their thinned versions proposed by Cerioli and Perrotta (2014) are examples of robust clustering techniques that we have experimented in the Joint Research Centre. These methods have, however, three important drawbacks: (i) they require a predetermined value for the number of linear relations; (ii) their output is usually very sensitive to small variations in the tuning constants required for their implementations (e.g., the trimming level and the restriction factor for the variances); (iii) they ignore the information concerning the origin and the destination of each record. These points are particularly relevant when the analysis is not limited to a single product but involves all the 10,000 products included in the database. They indeed imply a preliminary analysis of the data that is practically impossible to perform. In addition, the effects of the researcher’s perception of what the true clusters are on classical clustering procedures (i.e., what Hennig and Liao, 2013
Fig. 1. Product 28183000 (aluminum hydroxide), 2011 monthly transactions. © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
234
A. Cerasa
called the ‘natural human intuition’) may lead to problems in the interpretation of the groups. In the second approach, the transactions for each combination of origin–destination (OD) are analyzed separately, and an ‘OD robust price’ is estimated using methods that take into account the possible presence of outliers. Examples of these techniques are the forward search proposed by Riani et al. (2008) and Atkinson et al. (2010) or the least trimmed squares developed by Rousseeuw and Leroy (2005). In practice, once the product and the time span are fixed, it is possible to extract from the ComExt database all the pairs {Qt,ODg ; Vt,ODg } where g = 1, … , m and t = 1, … , T, representing respectively the aggregated quantity and the values traded for the OD combination ODg in month t. Therefore, for each g, it is possible to calculate an ODg price by estimating the model Vt,ODg = PODg Qt,ODg + 𝜖t .
(3)
The robust estimation of model (3) for all g allows to produce a list of m OD prices PODg . As a final result, we end up with a list of robust prices, one for each OD combination. For the product represented in Figure 1, g = 29 corresponding to 29 linear relations. However, considering again the mass of data involved in the analysis, this list can be long; it is not unusual to encounter in ComExt files products with more than 300 OD combinations. It is evident that the consequent large number of OD prices resulting from the estimation of expression (3) for all products makes the analysis of the market quite difficult, and possible commercial distortions may elude even a careful examinations. Most importantly, the price estimated through model (3) does not consider the effect of time. If the time span considered is 3 or 4 years long, a duration that in economics is usually defined as medium run, the assumption of constant prices is too strong and may not represent the real situation of some markets. Finding a statistical method for reducing the number of prices without affecting, at the same time, the representativeness of the market is a fundamental task in simplifying the analysis of international trade and above all to improve the detection of anomalous OD prices. Looking again at Figure 1 where g = 29, it is clear that four or five regression lines could be enough to fully represent the data set, that is, four or five prices. The large number of products requires a statistical procedure that is flexible and as free as possible from tuning constants, so that it could represent an effective and efficient working tool for customs in MS that can be periodically and automatically applied to all the 10,000 CN products. The algorithms introduced in the next section match these requirements, as they are extremely flexible, and their results only rely on the significance level for the stopping rule of the algorithm. The final clusters obtained represent groups with clear interpretation, inherited from the pre-existing classification; in this sense, the approach avoids the possible problems because of natural human intuition. Moreover, even if the methods are presented assuming constant prices in time, they can be easily adapted to include a time effect on price, as shown in section 5. A final simplification adopted at this stage of the analysis concerns the possible presence of outliers that, without loss of generality, is not taken into account here. Considerations on this © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
235
issue, whose importance in international trade is extensively discussed in Perrotta and Torti (2010), is left to the conclusions.
3
Grouping algorithms
In this section, three different algorithms for merging homogeneous groups are presented. All of them make use of the F-test, require only a significance level 𝛼 for the stopping rule, and return two outputs: (i) an estimate k̂ of the number of clusters and (ii) a statistically acceptable partition of the m groups in k̂ homogeneous clusters. The evaluation of the three algorithms should take into consideration not only the accuracy of the resulting classification but also its execution time. Because of their iterative nature, large values of m could represent a serious barrier and make a slow algorithm unsuitable for practical applications. 3.1 Algorithm I (A.I) The definition of this merging strategy follows the approach suggested by Hennig (2010). Its steps can be summarized as 1. Start by considering each of the m groups as a separate group; 2. Find the pair of groups most likely to merge (i.e., the most homogeneous pair found at the current stage); 3. Verify with an F-test if the restricted model corresponding to merging the pair is statistically acceptable (i.e., it cannot be rejected according to the chosen 𝛼); and 4. If merged, repeat from step 2; else stop and obtain the current classification. Therefore, we start considering each of the m groups as an element of the initial partition, that is, k̂ = m and ̂ = {m}. G1 = {1}; G2 = {2}; … Gk̂ = {k}
(4)
Assuming furthermore that 𝜎Gj = 𝜎 ∀j
(5)
expression (2) can be generalized as Yi,g = [𝛽G1 IG1 (g) + … + 𝛽Gk̂ IGk̂ (g)]Xi,g + 𝜖i ;
𝜖i ∼ N(0, 𝜎),
(6)
where IGj (g) is the indicator function, with IGj (g) = 1 ⇐⇒ g ∈ Gj . This equation can be easily estimated with ordinary least square (OLS) and represents the unrestricted model, that is, the model with m linear relations. The first step consists in verifying if it is possible to reduce its dimension by one. This can be carried out by testing the null H0 ∶ 𝛽Gj = 𝛽Gl (j ≠ l) against the alternative H1 ∶ 𝛽Gj ≠ 𝛽Gl with a test F. According to step 2 of the merging strategy, the pair (j, l) to be tested should be the © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
236
A. Cerasa
most likely to merge, which in this case corresponds to the pair with the lowest value of the F-test (and consequently the highest p-value). This involves the calculation of ̂ k̂ − 1)∕2 values of the test in each step, one for each possible (j, l) combination with k( j ≠ l. Alternatively, we can use the fact that the restricted model with the lowest value of the F-test corresponds to the one with the lowest sum of squared residuals (SSR).† If we represent the SSR of the restricted model that assumes 𝛽Gj = 𝛽Gl as SSR(𝛽Gj = 𝛽Gl ), the minimization problem can be expressed as (j ∗ , l ∗ ) | SSR(𝛽Gj∗ = 𝛽Gl∗ ) = min {SSR(𝛽Gj = 𝛽Gl )}. ∀(j,l) | j≠l
Suppose now that, without loss of generality 𝛽̂G1 ≤ 𝛽̂G2 ≤ … ≤ 𝛽̂Gk , where 𝛽̂Gj are the OLS estimates of model (6). It is possible to prove (Appendix A) that, in this situation, the best pair is necessarily formed by two contiguous values (j ∗ and j ∗ + 1) corresponding to two contiguous OLS estimates (𝛽̂Gj∗ and 𝛽̂Gj∗ +1 ). In other words, the minimization problem reduces to j ∗ | SSR(𝛽Gj∗ = 𝛽Gj∗ +1 ) =
min SSR(𝛽Gj = 𝛽Gj+1 ),
̂ j=1,…,k−1
(7)
which involves the estimation and comparison of only k̂ − 1 restricted models.‡ After finding the best model with (k̂ − 1) linear relations, the corresponding value of the F-test determines if it can be rejected or not. If not, the value of k̂ is updated to k̂ − 1, and the procedure continues looking again for the best restricted model in the same way. The procedure stops when the F-test rejects the restricted model according to the chosen significance level and returns an estimate for the number of groups k̂ and a partition of the m initial groups in k̂ sets. Table 1 presents an example of A.I in a generic case with m = 6. In the first iteration, the possible restricted models with k̂ = 5 are 5. Among them, we select the one that minimizes the constrained SSR (in the example, model i). Then we test it against the unrestricted model using an F-test with (1, N − 6) degrees of freedom. If it is not rejected, the restricted model will be used in the second iteration to build the four restricted models with k̂ = 4. The best of them (in this case, model iii) is tested again versus the unrestricted one using an F-test with (2, N − 6) degrees of freedom, and so on. Therefore, in each iteration s, the procedure looks for the best model nested in the optimal model obtained (and not rejected) in the previous iteration (s − 1). The final result of A.I may then be affected by ‘path dependence’, because the model in each step depends on the optimal model found in the previous one. For example, if two groups (j, l) not belonging to the same partition set are merged at some iteration of the † ‡
The restricted model with the lowest SSR is also the one that maximizes the likelihood. This simplification cannot be adopted if the model includes also an intercept, unless it is assumed to be the same in each group.
© 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations Table 1.
237
An example of the application of A.I to the simple case m = 6 𝛽1
Unrestricted model Restricted models →
→
≤
𝛽2
i ii iii iv v
𝛽12 𝛽1 𝛽1 𝛽1 𝛽1
𝛽3 𝛽23 𝛽2 𝛽2 𝛽2
i ii iii iv
𝛽123 𝛽12 𝛽12 𝛽12
𝛽4 𝛽34 𝛽3 𝛽3
≤
𝛽3
Iteration 1 𝛽4 𝛽4 𝛽34 𝛽3 𝛽3 Iteration 2 𝛽5 𝛽5 𝛽45 𝛽4 …
≤
𝛽4
≤
𝛽5 𝛽5 𝛽5 𝛽45 𝛽4 𝛽6 𝛽6 𝛽6 𝛽56
𝛽5
≤
𝛽6
𝛽6 𝛽6 𝛽6 𝛽6 𝛽56
not rej.
not rej.
Note: The right arrows point out the best restricted models in each iteration.
algorithm, this selection bias mistake will remain also in the following iterations. The next algorithm is introduced to evaluate the effect of path dependence on A.I results. 3.2 Algorithm II (A.II) The starting model is again the unrestricted model (6) with m linear relations, but now, the steps of the merging strategy are defined as 1. Start by considering each of the m groups as a separate group and k̂ = 1; ̂ 2. Find the best k-partition of the m groups (i.e., the one that minimizes the SSR); 3. Verify with an F-test if the restricted model corresponding to the optimal ̂ k-partition is statistically acceptable; and 4. If rejected, k̂ = k̂ + 1 and repeat from step 2; else stop and obtain the current classification. In this case, instead of gradually reducing the dimension of the unrestricted model and stopping when a reduced model is rejected, we start with the smallest value for k̂ and increment it by one in each iteration until a reduced model cannot be rejected. In other words, starting with k̂ = 1, the restricted model with a single group is estimated and tested against the unrestricted one again through an F-test. If rejected according to the chosen 𝛼, the value of k̂ is incremented by one. Considering all the possible partitions of the m groups in k̂ = 2 groups, we look for the one that minimizes the SSR and test it against the unrestricted model. If rejected, the algorithm proceeds with a larger value ̂ otherwise it stops. As before, the output consists in an estimate for the number of of k, final groups k̂ and a classification of the m initial groups in k̂ sets. The combinatorial approach of A.II could pose a problem, because it can be extremely time-consuming, especially when the value of k̂ increases. Consider, for example, the case of m = 20: when k̂ = 1, we have obviously only one possible partition to test. If the restricted model is rejected, the number of all possible partitions to test with k̂ = 2 is © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
238
A. Cerasa
̂ ( ) ⎤ k−1 ⎡ ∑ k̂ m ⎥ ⎢k̂ m − j ∕k! = 524, 287, ⎢ ⎥ j j=1 ⎣ ⎦ and, in case of a rejection, this number further increases. It is obvious that under these conditions, this algorithm is not viable either in a simulation experiment nor in practical application. Again, the number of all possible restricted models to test can be strongly reduced by considering that, in each iteration, only a partition formed by groups with contiguous estimates can be the optimal one.§ So, assuming again m = 20, with k̂ = 2 the restricted models to estimate are just 19, whereas with k̂ = 3, they are 171.¶ Even though also this number quickly increases with k̂ and m, in an experimental design, these values can be controlled in advance to avoid excessively time-consuming routines. Considering again the simple case of m = 6, Table 2 provides an example of the application of A.II. Supposing that the trivial sub-model with k̂ = 1 has been rejected, the table starts by considering the case k̂ = 2. The number of possible partitions of six contiguous elements in two sets is 5, and among them, only the best one is tested against the unrestricted model using an F-test with (4, N − 6) degrees of freedom. If rejected, the 10 possible partitions of six contiguous elements in three sets is considered, and the best of them is tested against the unrestricted model using an F-test with (3, N − 6) degrees of freedom and so on. Therefore, this algorithm is not divisive, that is, the best model obtained in the generic iteration s is not necessarily related to the model obtained in iteration (s − 1). In this sense, the results of A.II can be extremely useful to evaluate the effect of path dependence on A.I. Both algorithms are based on the same assumptions, on the same starting model and on the F-test, but A.I gets to the final classification gradû When both ally, whereas A.II directly compares all possible alternatives for a fixed k. methods lead to the same classification and experimental results, then A.I should be preferred as it is expected to be faster. Otherwise, it is clear that the agglomerative hierarchy of A.I leads to some sub-optimal classification.
3.3 Algorithm III (A.III) Classical Gaussian assumptions are not always reliable in international trade. Furthermore, assuming homoscedasticity between groups as in expression (6) could be too restrictive in practical applications. For this reason, A.III provides a non-parametric alternative for the merging algorithm, which is free from distributional assumptions. It differs from A.I in that, in each step, the p-values of the F-test are not calculated with respect to the theoretical distribution, but rather according to an empirical distribution obtained through a permutation approach.‖ In other words, in step 3 of A.I, §
Also in this case the simplification cannot be adopted in case of a fixed term in the model, unless it is supposed to be the same for all the groups. ( ) ¶ The number of all possible restricted models in this case is given by m−1 . ̂ k−1 ‖ For a detailed description of permutation methods and tests, see Pesarin and Salmaso (2010). © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations Table 2.
239
An example of the application of A.II to the simple case m = 6 𝛽1
Unrestricted model Restricted models →
→
i ii iii iv v
𝛽1 𝛽12 𝛽123 𝛽1234 𝛽12345
i ii iii iv v vi vii viii ix x
𝛽1 𝛽1 𝛽1 𝛽1 𝛽12 𝛽12 𝛽12 𝛽123 𝛽123 𝛽1234
≤ k̂ = 2
𝛽2 𝛽23456 𝛽3456 𝛽456 𝛽56 𝛽6 k̂ = 3 𝛽2 𝛽23 𝛽234 𝛽2345 𝛽3 𝛽34 𝛽345 𝛽4 𝛽45 𝛽5 …
≤
𝛽3
≤
𝛽4
≤
𝛽5
≤
𝛽6
rejected
𝛽3456 𝛽456 𝛽56 𝛽6 𝛽456 𝛽56 𝛽6 𝛽56 𝛽6 𝛽6
not rejected
Note: The right arrows point out the best restricted models in each iteration.
instead of using the asymptotic critical value to verify if the merging is statistically acceptable, we use a permutation-based critical value. It is important to remark that, if the homoscedastic assumption does not hold, the restricted model with the minimum SSR selected in step 2 is not necessarily the one that maximizes the likelihood, nor the one with the highest p-value. Consider again the example in Table 1. In the first step, we need to test the null H0 ∶ 𝛽1 = 𝛽2 through the value of the F statistic:
F12 =
SSR(𝛽1 = 𝛽2 ) − USSR n − m , USSR 1
where USSR is the sum of the squared residuals of the unrestricted model. Now, if the null is true, then each observation belonging to group 1 could also be part of group 2 and vice versa. This means that the empirical distribution of the F12 -test under the null can be obtained through a number of permutations, say P, between observations belonging to groups 1 and 2, provided that the dimensions of the two groups remain unchanged in each permutation. So, for a generic permutation p, we have
p
F12 =
SSR(𝛽1 = 𝛽2 ) − USSRp n − m , USSRp 1
and the null is rejected if © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
240
A. Cerasa P ∑
p
I(F12 ≥ F12 ) < 𝛼,
p=1
where 𝛼 is again the chosen significance level. In the second step of Table 1, the null is H0 ∶ 𝛽1 = 𝛽2 ∧ 𝛽4 = 𝛽5 . The procedure is the same, but now the P permutations involve four groups: observations of groups 1 and 2 are permuted as before and, in the meantime, also observations of groups 4 and 5 are permuted. This approach is remarkably slow, because it requires P permutations for each iteration in order to obtain the empirical distribution of the F-test for each pair (j, l) we are testing. However, it is also extremely flexible, as it does not require any distributional assumption. For this reason, it has been included in the experimental design, where its robustness to the homoscedastic assumption can be evaluated and compared with its parametric version. 3.4 Considerations on the significance level for the stopping rule The stopping rule for all algorithms is based on a significance level 𝛼 for the iterative F-test. In A.I and A.III, we stop when a merge is rejected, whereas in A.II, we stop when a restricted model is not rejected. As pointed out by one of the referees, the standard interpretation of the significance level does not hold here, because of the hierarchical nature of the procedures and the data-dependent selection of the models to test in each iteration. For this reason, 𝛼 is rather to be intended as a tuning parameter for the algorithms. On the other hand, its value can somehow be related to the probability of not ending up with a classification corresponding with the true DGP. The aim of this section is to explain this relation and to clarify how the chosen significance level has to be interpreted. Consider the application of A.I** provided in Table 1 and suppose that the partition corresponding to the actual DGP is G = {{12}, {3}, {456}}. The value chosen for 𝛼 corresponds to the probability of rejecting the true null, as in the standard theory of statistical tests, only in the third iteration and only if the restricted model to test is exactly G. But two problems may arise in the previous iterations: (i) a true restricted model is rejected, ending up with an over-parameterized model and (ii) a wrong restricted model is selected (selection bias). Moreover, supposing that the true restricted model in iteration 3 is not rejected, it could be that also in iteration 4, the best restricted model is not rejected, ending up with a wrong (under-parameterized) final classification. It is obvious, then, that 𝛼 cannot be interpreted as the standard overall significance level of the procedure. Suppose that selection bias is not an issue, that is, that in each iteration we are sure to select one of the true sub-models. In the first iteration, we estimate five restricted models that are mutually independent and correspond to five specific null hypotheses to test: model i is for testing the null H0i ∶ 𝛽1 = 𝛽2 , model ii is for testing the null **
The same arguments apply also to A.III.
© 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
241
H0ii ∶ 𝛽2 = 𝛽3 and so on. Because selection bias is ruled out, only three of them are of interest, namely, i, iv, and v, which correspond to the three true null hypotheses, given the actual DGP. If we indicate with F∗−1 the percentile of the F statistics for testing the restricted model ∗, the probability of rejecting the best among i, iv, and v is Prob(max{Fi−1 , Fiv−1 , Fv−1 } < 𝛼) = Prob(Fi−1 < 𝛼 ∧ Fiv−1 < 𝛼 ∧ Fv−1 < 𝛼) = 𝛼 3 . As a consequence, the probability of not rejecting the best true restricted model in iteration 1 is (1 − 𝛼 3 ). Whatever of the three sub-models is chosen and not rejected in iteration 1, we know that in iteration 2 we will have four sub-models: two of them are true and two are false. Using similar arguments, it is possible to show that the probability of not rejecting the best of the two true sub-models is (1−𝛼 2 ). Similarly, the probability of not rejecting the unique true model in iteration 3 is (1 − 𝛼). Therefore, the overall probability of not rejecting the true model until iteration 3 is given by (1 − 𝛼 3 )(1 − 𝛼 2 )(1 − 𝛼) ≈ 1 − 𝛼, which implies that the probability of ending up with an over-parameterized classification is approximately equal to 𝛼.†† However, 𝛼 cannot be fixed at 0, so as to have this probability exactly equal to 0. This is because in iteration 4 we need to reject the two (false) restricted models, in order to avoid an under-parameterized classification. In other words, the F-test in the fourth iteration should be powerful enough to reject the best of the two restricted models. Fixing 𝛼 = 0 would instead imply no power at all, and we will always end up with k̂ = 1. Summing up, if the effects of selection bias are negligible, we expect the overall probabilities of ending up with the correct classification equal to ≈ (1 − 𝛼)𝜌(4) , (1 − 𝛼 3 )(1 − 𝛼 2 )(1 − 𝛼)𝜌(4) 𝛼 𝛼 (j)
where 𝜌𝛼 (with a slight abuse of notation because the letter 𝛽 has been used for the model parameters) represents the power of the test in iteration j as a function of the chosen significance level 𝛼. The output of A.II too could be influenced by its hierarchical structure, but in a different way with respect to A.I. In this case, the best model in each iteration is not related to the previous ones. Looking at the example of Table 2 and assuming that the actual DGP is again G, this sub-model is estimated only in iteration 3. So, we need to reject the false best sub-model in iteration 1 (k̂ = 1) and in iteration 2 (k̂ = 2). ††
To have an idea of the approximation, if we choose 𝛼 = 0.01, the overall probability of an over-parameterized classification would be equal to 0.0101, whereas with 𝛼 = 0.05, this probability would be 0.0525. These values remain quite stable with changing k or m, because they can be calculated by the following expression: ∏
m−k+1
1−
(1 − 𝛼 i ).
i=1 © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
242
A. Cerasa
This means that the test should have enough power to avoid an under-parameterized classification. Once in iteration 3 (k̂ = 3), we cannot be sure that the best restricted model to test is exactly v, unless selection bias is not an issue. If this is the case, the probability of rejecting the best restricted model in iteration 3 is Prob(Fv−1 < 𝛼) = 𝛼. Therefore, the overall probability of ending up with the true model is given by (2) 𝜌(1) 𝛼 𝜌𝛼 (1 − 𝛼).
If the true model is rejected, the procedure continues with k̂ = 4, and the algorithm will end up with an over-parameterized classification. A final consideration is on the application of corrections for the significance level, as for example, the Bonferroni or Šidák corrections (Šidák, 1967), which are not necessary in this context. In fact, the previous expressions showed that there are no problems of simultaneous inference and/or multiple testing that can lead to an over-rejection of the true null. 3.5 Computation time In the operational context described in section 2, we stressed that the execution time of the classification algorithm is of primary importance. The large number of products and the potentially large number of different OD combinations can make the algorithm excessively time-consuming and unfeasible in practice. For this reason, we studied the computation times of the proposed algorithms with the following experiment. Starting from the baseline case N = 1000 and m = 20, a sample of N pairs (Yi , Xi ) was randomly extracted from a standard bivariate normal distribution and randomly assigned to one of the m groups. Then the algorithms were applied on the data without fixing a significance level, that is, without stopping rule. It means that A.I and A.III will stop only when k̂ = 1. For A.II, in order to avoid unfeasible calculations, ̂ namely, 5 and 7. Finally, the effect of N and m on we fixed two maximum values for k, computation times is evaluated by monitoring the elapsed time when either of them is doubled while keeping the other at its baseline value. For A.III, also the effect of doubling the number of permutations was evaluated, its benchmark value being 1000. All routines were coded in Matlab (R2014a) and executed on an Intel(R) Core(TM) i3-3110M 2.40 GHz with 16 GB of RAM. Execution times are evaluated through the Matlab function timeit.m. The results are listed in Table 3. As expected, A.I is always the quickest algorithm, with short execution times also for increasing values of N and m. The effect on all algorithms of doubling N is approximately linear, and for A.III, it is very similar to doubling the number of permutations. On the contrary, the effect of doubling m changes depending on the procedure. Whereas A.III seems to be only marginally affected, the execution time for A.I is almost four times as long. However, the effect of doubling m is particularly evident for A.II, especially when k̂ max = 7. This is not surprising given the exponential increase of estimated sub-models discussed in section 3.2 and restates that the interest in studying its performances can be only theoretical. © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations Table 3.
243
Execution times (in seconds) for the three algorithms
A.I A.II (k̂ max = 5) A.II (k̂ max = 7) A.III
Baseline settings: N = 1000, m = 20, n.perm = 1000 0.0349 1.3401 13.1257 1.2963
N = 2000 0.0530 2.5306 21.6005 2.5308
m = 40 0.1335 25.9690 1968.8118 1.9953
n.perm = 2000
2.4418
Note: The effect of N, m, and the number of permutations is evaluated by doubling each of them while keeping the others at their baseline value.
4
Monte Carlo experiment
We assess the performance of the algorithms through a Monte Carlo experiment in which data are simulated with different regression mixture components. For this purpose, we use a data simulation software that is part of the FSDA Matlab toolbox (Riani et al., 2012). More precisely, we used a generalization of the regression context of the FSDA MixSim.m function, which implements the multivariate MixSim approach of Maitra and Melnykov (2010). The original R package and its FSDA Matlab version are described respectively in Melnykov et al. (2012) and Riani et al. (2015). We have chosen this data simulation framework because samples are generated from mixture distributions according to a prespecified overlap, defined as the sum of the misclassification probabilities. Therefore, there is no need to specify the desired model parameters but only some (up to two) constraints on the overlap value (maximum, average, or standard deviation) values. This is carried out using the AS 155 algorithm of Davies (1980), on the basis of the cumulative distribution function of linear combinations of independent non-central chi-squared and normal random variables. In particular, for our Monte Carlo experiment, we use the following settings: • N: the number of simulated observations, fixed at 1000 in all experiments; • k: the number of sets in the partition of the labels (i.e., the number of linear structures in the data). Consistently with most of the international trade data analyzed in practical applications, we select k = 2, 3, and 4. • 𝛽: the parameters vector of the k linear relations. They are uniformly randomly chosen in [10, 80] with the only constraint that, in each simulation, the angles must be at least 10◦ apart. These conditions seem realistic from our experience with international trade data. The flexibility of the FSDA implementation of MixSim offers this possibility. • 𝜔: this is a key option in MixSim because it offers the possibility to choose a given misclassification probability or equivalently the average overlap between the different linear components. We choose two different values for 𝜔, namely, 0.01 and 0.15, corresponding to a low and a high level of overlap. • hom: this parameter decides whether the errors of the k linear components are homoscedastic (i.e., hom = true) or heteroscedastic (i.e., hom = false). The value (homoscedastic case) or the k values (heteroscedastic case) of the standard error © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
244
A. Cerasa
Fig. 2. Some examples of simulated populations.
of the residuals are automatically calculated by a function in the MixSim library on the basis of the chosen misclassification probability 𝜔 and the parameters vector 𝛽. Both values for hom are considered in the experiment, so as to verify the robustness of the algorithms to the homoscedastic assumption. • X : the distribution of the independent variable. Among all the possible choices offered by MixSim, we choose for our simulations the HalfNormal distribution, that is, X ∼ |N(0, 1)|. It indeed guarantees that X ≥ 0, and its high density close to zero mimics the so-called small trade area (Cerioli and Perrotta, 2014), which is a high density region of small-sized transactions close to the origin of the V andQ axes typical of international trades. Figure 2 shows examples of simulated populations for two particular combinations of settings. The vector of parameters 𝛽 is the same, so they only differ in the degree of overlap. As expected, a higher overlap leads to a more confused situation, where the true number of linear patterns can be very difficult to predict. Once the k linear structures are generated, they must be split into m groups. This step is also randomized, with the constraint that each of the m groups must contain at least five observations. Two values for m were selected for the experiments: 20 and 40. The combinations of these parameters produce 24 different experiments, covering a wide range of situations that can emerge in real data. For each experiment, the number of simulated populations is 1000, and the significance level for the stopping rule was fixed to 0.01 for all algorithms. Finally, for A.III, the number of permutations in each iteration is 1000. 4.1 Evaluation of the grouping algorithms The application of the algorithms on each simulated population generates two results: ̂ an estimate of the number of linear relations k̂ and a k-elements partition of the original groups. © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
245
The accuracy of the former can be evaluated through Prob(k̂ ≠ k), that is, the probability to obtain a biased estimate of the number of elements of the partition. A direct estimate of this probability is provided by the proportion of misspecified k: (k̂ ≠ k). For the latter, the starting point is the RI (Rand, 1971) or its corrected-for-chance version derived by Hubert and Arabie (1985), one of the most widely known measures for assessing the performance of a classification method. They are suited to canonical cluster analysis, where the objective is to obtain a suitable partition of the observations. Here, the partition involves the m groups of N observations and two different ̂ values of the RI can be calculated: one with respect to the k-partition of the m groups ̂ and another with respect to the resulting k-partition of the N observations. Both may be inappropriate. The first, indeed, gives the same importance to each group, independently of the number of observations they contain. It is obvious that misclassifying a group of 100 observations is a problem more serious than misclassifying one of five. The second does not take into account that observations belonging to the same group g will be necessarily included in the same partition element, resulting in an overvaluation of the index. Therefore, an adjusted version of the RI, say groups-adjusted Rand index (GARI) is proposed and detailed in the Appendix, in order to eliminate the boosting effect due to the pre-existing classification of the observations. The mean of this index over 1000 simulations gives an estimate of its expected value and measures the average accuracy of each algorithm. In order to evaluate the effect of the estimate k̂ on the accuracy of the classification, other two expected values of GARI are estimated. One is conditioned on (k̂ = k) that is limited to those simulations where the number of partition elements is correctly estimated, and the other is conditioned on (k̂ ≠ k), which refers to simulations where k is wrongly estimated. 4.2 Simulation results Results of the simulations for the homoscedastic populations are collected in Table 4, whereas Table 5 presents the heteroscedastic cases. As previously mentioned, the significance level for the stopping rule was fixed at 0.01, but further experiments showed that the main results and conclusions remain unchanged with 𝛼 = 0.05. At first glance, the tables show that the performances of the algorithms are in general very good. The overall average of GARI is never lower than 99%, whereas it is usually equal to 100% when k̂ = k. If we also consider the fact that the proportion of misspecified k is close to the chosen 𝛼, we can derive that the algorithms do not end up with a final classification corresponding to the true DGP in about (1 − 𝛼) per cent of cases. On the basis of what we stated in section 3.4, this is coherent with what we expect in case of negligible effects of the selection bias and enough power of the test for avoiding under-parameterized classifications. Even when k is wrongly estimated, the accuracy of the partition is satisfactory, with the value of GARI higher than 0.850. This means that, even if the procedures end with a wrong estimation of the number of linear patterns, the resulting classification © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
3
4
2
3
4
2
3
4
5
6
40
40
40
20
20
m 20
A.I A.II A.III
A.I A.II A.III
A.I A.II A.III
A.I A.II A.III
A.I A.II A.III
Alg. A.I A.II A.III
0.011 0.011 0.007
0.006 0.006 0.007
0.010 0.010 0.010
0.006 0.006 0.008
0.006 0.005 0.009
(k̂ ≠ k) 0.017 0.017 0.013
(k̂ ≠ k) 0.015 0.015 0.013
1.000 1.000 1.000
1.000 1.000 1.000
0.999 0.999 0.999
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
0.972 0.972 0.973
0.956 0.955 0.954
0.898 0.878 0.884
0.974 0.970 0.969
0.959 0.953 0.951
0.002 0.000 0.004
0.013 0.012 0.013
0.013 0.012 0.013
0.007 0.005 0.004
0.006 0.005 0.008
0.996 0.996 0.997
0.998 0.998 0.998
0.998 0.998 0.998
0.999 0.999 0.999
0.999 0.999 0.999
̂ E(GARI) 0.998 0.998 0.999
0.996 0.996 0.997
0.999 0.999 0.999
1.000 1.000 1.000
0.999 0.999 0.999
1.000 1.000 1.000
̂ E(GARI |k̂ = k) 1.000 1.000 1.000
̂ E(GARI |k̂ ≠ k) 0.894 0.891 0.897
̂ E(GARI |k̂ = k) 1.000 1.000 1.000
̂ E(GARI) 0.998 0.998 0.999
Note: GARI, groups-adjusted Rand index.
k 2
High overlap (𝜔 = 0.15)
Low overlap (𝜔 = 0.01)
Monte Carlo results (homoscedastic cases)
Exp. 1
Table 4.
0.957 — 0.963
0.949 0.948 0.944
0.899 0.887 0.892
0.969 0.970 0.979
0.950 0.949 0.958
̂ E(GARI |k̂ ≠ k) 0.891 0.888 0.894
246 A. Cerasa
© 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
247
is accurate, in the sense that only a very small percentage of observations are improperly classified. Comparing the results of A.I and A.II, their proportions of misspecified k are very similar, both in the homoscedastic and heteroscedastic simulations. Even if in some cases A.II performs slightly better in terms of (k̂ ≠ k), it is also true that the overall estimated expected values of the GARI are almost the same, and A.I shows higher values of the index when k̂ ≠ k. This is particularly true in the high overlap section of experiments 4, 5, 9, and 11. In summary, their overall performances are comparable. The good accuracy of all algorithms does not seem to be affected by the number of groups m nor by the number of linear relations k. It is interesting to note that also the higher overlap does not seem to remarkably affect the properties of the algorithms. There are pairs of experiments, for example, 4 or 5, where, all other settings being equal, the introduction of a higher degree of overlap leads to an increase in the (k̂ ≠ k) and to a lower GARI. However, there are also experiments, as for example, 10, where the opposite happens. Generally, a high level of overlap does not systematically worsen the performances of the algorithms, although the different linear patterns are very difficult to distinguish also by visual inspection. As expected, the presence of heteroscedasticity is an issue to take into account in the choice of the algorithm. If the values of (k̂ ≠ k) are very close to 0.01 in all homoscedastic experiments, the same cannot be said of the heteroscedastic counterpart. A.I and A.II, both based on the homoscedastic assumption, tend to return a wrong estimation of k in a higher percentage of simulations. The most evident cases are experiment 10 and the low overlap version of experiment 11, where both algorithms give a wrong estimate of k in more than 5% of the simulations. Their classifications, however, remain accurate enough, with an average value of GARI almost always higher than 90%. The non-parametric version of the procedure, namely, A.III, does not seem to be affected by heteroscedasticity as its overall performance remains unchanged. Concluding, all the presented algorithms show very satisfactory performances, both in terms of estimation of the number of linear structures and of the accuracy of the classification. The possible presence of heteroscedasticity negatively affects the capacity of A.I and A.II in estimating k. So, when the value of k̂ is essential in the analysis and the homoscedastic assumption is not trustworthy, A.III is to be preferred because it guarantees a probability of wrongly estimating the number of linear relations always very close to the chosen significance level. Otherwise, A.I is more advisable because it offers a good classification accuracy also when k is wrongly estimated and has the advantage of being much faster than A.II and A.III.
5
Empirical applications
The properties of the algorithms and their behaviors in real applications is now evaluated by considering two ComExt products. The first is aluminum hydroxide, seen in Figure 2, whereas the second is a fishery product that has been subject to investigations that revealed a fraud. © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
3
4
2
3
4
8
9
10
11
12
40
40
40
20
20
m 20
A.I A.II A.III
A.I A.II A.III
A.I A.II A.III
A.I A.II A.III
A.I A.II A.III
Alg. A.I A.II A.III
0.044 0.044 0.008
0.068 0.068 0.007
0.080 0.080 0.017
0.038 0.038 0.010
0.042 0.042 0.012
(k̂ ≠ k) 0.024 0.024 0.008
(k̂ ≠ k) 0.035 0.035 0.006
0.999 0.999 1.000
0.997 0.996 1.000
0.992 0.991 0.998
0.999 0.999 1.000
0.998 0.998 0.999
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 1.000
0.973 0.971 0.973
0.952 0.948 0.954
0.894 0.882 0.899
0.976 0.975 0.977
0.952 0.950 0.949
0.041 0.039 0.026
0.050 0.047 0.012
0.059 0.058 0.009
0.041 0.039 0.022
0.047 0.045 0.011
0.992 0.991 0.992
0.994 0.994 0.996
0.992 0.992 0.998
0.996 0.996 0.997
0.997 0.997 0.999
̂ E(GARI) 0.996 0.996 0.999
0.994 0.993 0.994
0.998 0.997 0.998
0.999 0.999 0.999
0.998 0.997 0.998
0.999 0.999 0.999
̂ E(GARI |k̂ = k) 1.000 1.000 1.000
̂ E(GARI |k̂ ≠ k) 0.891 0.884 0.893
̂ E(GARI |k̂ = k) 1.000 1.000 1.000
̂ E(GARI) 0.997 0.997 0.999
Note: GARI, groups-adjusted Rand index.
k 2
High overlap (𝜔 = 0.15)
Low overlap (𝜔 = 0.01)
Monte Carlo results (heteroscedastic cases)
Exp. 7
Table 5.
0.940 0.939 0.889
0.930 0.926 0.868
0.880 0.872 0.881
0.962 0.960 0.948
0.948 0.947 0.934
̂ E(GARI |k̂ ≠ k) 0.889 0.886 0.911
248 A. Cerasa
© 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
249
As previously mentioned, starting the algorithms with the unrestricted model (6) means ignoring the possible effect of time on prices. In real economic data, this assumption could produce misleading results. For this reason, the empirical applications that follow will consider also the starting model: t Yi,g
= [𝛽1 I1 (t) + … + 𝛽T IT (t) + 𝛽G1 IG1 (g) + … + 𝛽Gk̂ IGk̂ (g)]t Xi,g + 𝜖it ,
(8)
where Is (t) is the indicator function equal to one only for transactions occurred in month s. Therefore, it assumes that the unitary price for the general OD combination at month t is given by the sum of two components: one concerning the OD group and the other related to the period. Actually, Equation (8) cannot be directly estimated because of multicollinearity problems. As usual, the model needs to be slightly modified and re-parameterized in order to guarantee the identification of all the parameters. In particular, the starting unrestricted model to estimate is t Yi,g
= [𝛽̃0 + 𝛽̃2 I2 (t) + … + 𝛽̃T IT (t) + 𝛽̃G2 IG2 (g) + … + 𝛽̃Gk̂ IGk̂ (g)]t Xi,g + 𝜖it , (9)
where 𝛽̃0 = 𝛽1,G1 . This new parameters structure implies some minor adjustments in the iterative procedures. In fact, assuming that the OD combinations are ordered according to 0 ≤ 𝛽̃G2 ≤ … ≤ 𝛽̃Gk̂ , the null H0 ∶ 𝛽G1 = 𝛽G2 in Equation (8) corresponds to the null H0 ∶ 𝛽̃G2 = 0 in model (9). For the remaining OD combinations, instead, the tests do not change. As in the Monte Carlo experiment, the value of 𝛼 is fixed at 0.01. However, only in one case the final classification changes with a value of 0.05. 5.1 Aluminum hydroxide The 2011 data set contains 340 observations, each one corresponding to a monthly aggregate of quantity and value. The number of possible OD combinations is 78, but the ones with less than five transactions were removed from the analysis, in order to preserve the efficiency of the results.‡‡ This reduces the starting number of OD groups to 29 and the number of observations to 254. The outputs of the algorithms are presented in Figure 3. In the three panels on the left, the effect of time is ignored (i.e., the starting unrestricted model is expression (6)), whereas in the three panels on the right time is part of the regressors (i.e., the starting unrestricted model is expression (9)), and the transactions points are represented after removing the time effect. In both cases, the algorithms start by considering each of the 29 OD combinations representing a separate group and try to iteratively merge pairs of them. ‡‡
Since the data cover a whole year, the maximum number of monthly transactions for each OD combination is 12.
© 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
250
A. Cerasa
Fig. 3. Empirical results for aluminum hydroxide. In each scatter plot, the x axis represents quantity (tons) and the y axis represents values (thousands of euros).
The results for A.I and A.II are almost exactly equal (Figure 3a–d). Except for few OD combinations pertaining to the so-called small trade area, they always provide the same classification, both if the time factor is considered or not. The final © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
251
partition includes five different linear relations/prices that reasonably represent all the linear trends of the market. Results for A.III are similar but more parsimonious. In fact, A.III ends with a four-cluster partition when time is not included in the starting model (Figure 3e) and with only three clusters when time is taken into account (Figure 3f). In the former case, the two lines corresponding to the highest prices in the output of A.I and A.II are merged to form a single set. In the latter, also the two clusters corresponding to the lowest prices are merged. Interestingly, this is the only case in which the significance level has an effect on the final partition; the three-price model in Figure 3f would have been rejected with a 0.05 significance level, in favor of the same four clusters of groups of Figure 3e. Obviously, as in the canonical cluster analysis, there are no objective criteria for determining which classification is correct and which is wrong. However, it is important to remark that all of them seem reasonable in the sense that they provide a synthetic and informative representation of the market structure of the product considered. 5.2 Fishery product This second example helps us understand how the suggested tools can be useful in detecting suspicious OD prices. The product under consideration is a fishery product that has been object of official investigations that found a fraudulent systematic under-pricing strategy involving a particular OD pair. The same product was considered firstly in Riani et al. (2008), and the data are included in the FSDA toolbox (Riani et al., 2012). Data in this case refer to years 2004 and 2005, including 804 transactions grouped into 84 different OD groups. After removing the OD combinations with less than five transactions, we still have 718 observations and 37 OD groups in the sample, including also the fraudulent trades. Figure 4 summarizes the output of the algorithms, again applied with and without the time effect. Unlike the previous case, now the introduction of the time factor affects the classification provided by A.I and A.II and leaves unchanged the one provided by A.III. In particular, the number of final clusters increases from two to five for A.I and A.II (Figure 4a–d), whereas for A.III, it remains to two (Figure 4e,f). However, the most remarkable result is that in all classifications there is an isolated cluster that presents a price notably lower than the rest. In four out of six cases (namely, Figure 4a,b,e,f), this ‘lower price’ element of the partition is composed only of the fraudulent OD group, whereas in the remaining two cases (Figure 4c,d), the lowest price cluster is still composed of the fraudulent OD combination, merged with some other ‘small trade area’ components. Obviously, this tool does not per se prove that a fraud has occurred; deeper and more detailed analyses are left to the investigators or subject matter experts. But this example shows how the tool is able to highlight suspicious patterns. In this case, it © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
252
Fig. 4.
A. Cerasa
Empirical results for fishery product. In each scatter plot, the x axis represents quantity (tons) and the y axis represents values (thousands of euros).
detects an OD group that is not part of the ‘small trade area’ and which is associated with a price sensibly lower than the price paid for the same product in all the other groups (respectively around €6.90 vs. €12.80 per kilogram). © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations 6
253
Conclusions
A new method has been introduced in this article for dealing with an unconventional clustering problem that is particularly compelling in international trade analysis, even if it can be adapted to many general contexts. The statistical method proposed is based on a recursive application of the F-test, and its final aim is to form homogeneous clusters of groups of observations. The performance and characteristics of the suggested approach were tested both in a simulated environment and on real data. In the Monte Carlo experiment, the simulated populations try to mimic as close as possible the main features that are present in real international trade data, as for example, high overlap between linear components as well as heteroscedasticity. Moreover, a fair evaluation of the performances of the algorithms required the definition of a new adjusted version of the RI that has been proposed in order to take into account the peculiarities of the classification problem. The simulation results show in general very satisfactory performances for all the algorithms both in terms of number of final clusters estimated and the classification accuracy. The results remain satisfactory also when the level of overlap between linear patterns is high and do not seem to be affected by the value of m and k. Algorithms based on the homoscedastic assumption (namely, A.I and A.II) tend to return a biased estimate of the number of clusters in a percentage of cases slightly higher than the permutation-based version of the algorithm (namely, A.III) in heteroscedastic simulations. However, the accuracy of their classifications remains excellent (higher than 87%) also when the number of final clusters is wrongly estimated. Finally, the algorithms were applied to some empirical cases, where the objective was to form homogeneous clusters of OD prices in order to give a synthetic but yet representative description of the market. In particular, two products were selected: aluminum hydroxide and a fishery product that was subject of a proven fraud. The empirical results show that 1. All the algorithms provide very similar (when not exactly equal) classifications; 2. The classifications give a synthetic but accurate representation of the markets, through a consistent reduction of the number of OD prices; 3. The permutation-based algorithm A.III tends in general to be more parsimonious than the other two, returning a lower number of final clusters; 4. The algorithms are useful to detect anomalous OD prices and then to isolate potentially fraudulent commercial strategies. Concluding, the proposed methodology seems to properly work on real data, with good statistical properties (assessed on simulated populations). It can therefore represent an important tool of analysis that offers a comprehensive description of the market through the main linear patterns and allows a clear detection of anomalous OD prices. It is also extremely flexible, so that it can be applied to all the 10,000 products included in ComExt database, without any particular adjustment. Finally, it does not require to © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
254
A. Cerasa
specify in advance the number of final groups, like the clustering procedures typically used in this context. The only input required is the desired 𝛼 for the stopping rule. The satisfactory results shown on real data are obtained even though we have not taken into account the presence of occasional outliers, that is, single or episodic anomalous transactions. These observations are known to likely affect the estimates, producing misleading, where not wrong, results. Furthermore, in an international trade context, these occasional outliers may point to frauds that do not occur systematically and should be therefore identified. For this reason, before being made available to end users, the model estimates will be robustified through the already mentioned least trimmed squares or forward search. This further step can only improve the already good results seen in this article.
Acknowledgements This article has been conducted under support from the Work Programme 2014–2020 of the Joint Research Centre of the European Commission. The author is grateful to Domenico Perrotta, Spyros Arsenis, Andrea Cerioli, and two anonymous referees for helpful discussion on previous drafts of this article.
Appendix A: proof of expression (7) Given the ordered OLS estimates of Equation (6) 𝛽̂G1 ≤ 𝛽̂G2 ≤ … ≤ 𝛽̂Gk , we want to prove that the pair (j, l) that minimizes SSR(𝛽Gj = 𝛽Gl ) is necessarily formed by two contiguous values (j, j + 1). This is true only if, for each j = 1, … , k − 2, we have that ( ) { ( ) ( )} SSR 𝛽Gj = 𝛽Gj+2 ≥ min SSR 𝛽Gj = 𝛽Gj+1 , SSR 𝛽Gj+1 = 𝛽Gj+2 because this would also imply that, for each j = 1, … , k − 3 ) { ( ) ( SSR 𝛽Gj = 𝛽Gj+3 ≥ min SSR 𝛽Gj = 𝛽Gj+1 , ( ) ( )} SSR 𝛽Gj+1 = 𝛽Gj+2 , SSR 𝛽Gj+2 = 𝛽Gj+3 and so on. So, without loss of generality, we can assume to have only three groups labeled as {1, 2, 3} and to represent as 𝐲g and 𝐱g the vectors of values for respectively the dependent and the independent variable belonging to group g. Equation (6) can then be expressed as ⎡𝐲1 ⎤ ⎡𝐱1 0 0 ⎤ ⎡𝛽1 ⎤ ⎡𝜖1 ⎤ ⎢𝐲 ⎥ = ⎢ 0 𝐱 0 ⎥ ⎢𝛽 ⎥ + ⎢𝜖 ⎥ , 2 ⎢ 2⎥ ⎢ ⎥ ⎢ 2⎥ ⎢ 2⎥ ⎣𝐲3 ⎦ ⎣ 0 0 𝐱3 ⎦ ⎣𝛽3 ⎦ ⎣𝜖3 ⎦ © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
255
whereas the OLS estimates of the parameters are given by 𝛽̂g = (𝐱g′ 𝐲g )∕(𝐱g′ 𝐱g ), with 𝛽̂1 ≤ 𝛽̂2 ≤ 𝛽̂3 . The SSR of this model can be expressed as ( ) ( ) SSR = TSS − ESS = 𝐲1′ 𝐲1 + 𝐲2′ 𝐲2 + 𝐲3′ 𝐲3 − 𝐲1′ 𝐱1 𝛽̂1 + 𝐲2′ 𝐱2 𝛽̂2 + 𝐲3′ 𝐱3 𝛽̂3 ,
where, as usual, TSS means ‘total sum of squares’ and ESS means ‘explained sum of squares’. The OLS estimates minimizes the SSR, i.e. maximizes the value of the ESS. The restricted model assuming 𝛽1 = 𝛽2 can be represented as ⎡𝐲1 ⎤ ⎡𝐱1 0 ⎤ [ ] ⎡𝜖12 ⎤ ⎢𝐲 ⎥ = ⎢𝐱 0 ⎥ 𝛽12 + ⎢𝜖 ⎥ , ⎢ 2 ⎥ ⎢ 2 ⎥ 𝛽3 ⎢ 12 ⎥ ⎣𝐲3 ⎦ ⎣ 0 𝐱3 ⎦ ⎣ 𝜖3 ⎦ and consequently the parameters estimates and the value of SSR are given by 𝛽̂12 = (𝐱1′ 𝐲1 + 𝐱2′ 𝐲2 )∕(𝐱1′ 𝐱1 + 𝐱2′ 𝐱2 ) 𝛽̂3 = (𝐱3′ 𝐲3 )∕(𝐱3′ 𝐱3 ) SSR12 = (𝐲1′ 𝐲1 + 𝐲2′ 𝐲2 + 𝐲3′ 𝐲3 ) − (𝐲1′ 𝐱1 𝛽̂12 + 𝐲2′ 𝐱2 𝛽̂12 + 𝐲3′ 𝐱3 𝛽̂3 ). Similarly, for the other restrictions, the values of the SSR are given by SSR23 = (𝐲1′ 𝐲1 + 𝐲2′ 𝐲2 + 𝐲3′ 𝐲3 ) − (𝐲1′ 𝐱1 𝛽̂1 + 𝐲2′ 𝐱2 𝛽̂23 + 𝐲3′ 𝐱3 𝛽̂23 ) SSR13 = (𝐲1′ 𝐲1 + 𝐲2′ 𝐲2 + 𝐲3′ 𝐲3 ) − (𝐲1′ 𝐱1 𝛽̂13 + 𝐲2′ 𝐱2 𝛽̂2 + 𝐲3′ 𝐱3 𝛽̂13 ), and what we need to prove is that SSR13 ≥ min{SSR12 , SSR23 }. Supposing that this is not the case, we would have that
SSR13 < SSR12 ⇒ 𝐲1′ 𝐱1 (𝛽̂13 − 𝛽̂12 ) + 𝐲2′ 𝐱2 (𝛽̂2 − 𝛽̂12 ) + 𝐲3′ 𝐱3 (𝛽̂13 − 𝛽̂3 ) > 0 (A.1)
SSR13 < SSR23 ⇒ 𝐲1′ 𝐱1 (𝛽̂13 − 𝛽̂1 ) + 𝐲2′ 𝐱2 (𝛽̂2 − 𝛽̂23 ) + 𝐲3′ 𝐱3 (𝛽̂13 − 𝛽̂23 ) > 0. (A.2)
Now, if we define pij = 𝐱i′ 𝐱i ∕(𝐱i′ 𝐱i + 𝐱j′ 𝐱j ), with 0 < pij < 1, we will have that © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
256
A. Cerasa 𝛽̂12 = p12 𝛽̂1 + (1 − p12 )𝛽̂2 ;
𝛽̂2 − 𝛽̂12 = p12 (𝛽̂2 − 𝛽̂1 );
𝛽̂12 − 𝛽̂1 = (1 − p12 )(𝛽̂2 − 𝛽̂1 );
𝛽̂23 = p23 𝛽̂2 + (1 − p23 )𝛽̂3 ;
𝛽̂3 − 𝛽̂23 = p23 (𝛽̂3 − 𝛽̂2 );
𝛽̂23 − 𝛽̂2 = (1 − p23 )(𝛽̂3 − 𝛽̂2 );
𝛽̂13 = p13 𝛽̂1 + (1 − p13 )𝛽̂3 ;
𝛽̂3 − 𝛽̂13 = p13 (𝛽̂3 − 𝛽̂1 );
𝛽̂13 − 𝛽̂1 = (1 − p13 )(𝛽̂3 − 𝛽̂1 ).
So, using the fact that 𝛽̂1 ≤ 𝛽̂12 ≤ 𝛽̂2 ≤ 𝛽̂23 ≤ 𝛽̂3 and also that 𝛽̂1 ≤ 𝛽̂13 ≤ 𝛽̂3 , expression (A.1) can be simplified as 𝐲1′ 𝐱1 (𝛽̂13 − 𝛽̂12 ) + 𝐲2′ 𝐱2 (𝛽̂2 − 𝛽̂12 ) + 𝐲3′ 𝐱3 (𝛽̂13 − 𝛽̂3 ) = 𝐲′ 𝐱1 (𝛽̂13 − 𝛽̂1 ) + 𝐲′ 𝐱1 (𝛽̂1 − 𝛽̂12 ) + 𝐲′ 𝐱2 (𝛽̂2 − 𝛽̂12 ) + 𝐲′ 𝐱3 (𝛽̂13 − 𝛽̂3 ) =
1 𝐲1′ 𝐱1 (1
1
2
3
− p13 )(𝛽̂3 − 𝛽̂1 ) + 𝐲1′ 𝐱1 (p12 − 1)(𝛽̂2 − 𝛽̂1 ) + 𝐲2′ 𝐱2 p12 (𝛽̂2 − 𝛽̂1 )+ − 𝐲′ 𝐱3 p13 (𝛽̂3 − 𝛽̂1 ) = 3
(𝛽̂3 − 𝛽̂1 )[𝐲1′ 𝐱1 (1 − p13 ) − 𝐲3′ 𝐱3 p13 ] + (𝛽̂2 − 𝛽̂1 )[𝐲1′ 𝐱1 (p12 − 1) + 𝐲2′ 𝐱2 p12 ] = (𝛽̂3 − 𝛽̂1 )[𝐲′ 𝐱1 − p13 (𝐲′ 𝐱1 + 𝐲′ 𝐱3 )] + (𝛽̂2 − 𝛽̂1 )[−𝐲′ 𝐱1 + p12 (𝐲′ 𝐱1 + 𝐲′ 𝐱2 )] = 1
1
3
1
(𝛽̂3 − 𝛽̂1 )(𝐱1′ 𝐱1 )(𝛽̂1 − 𝛽̂13 ) − (𝛽̂2 − 𝛽̂1 )(𝐱1′ 𝐱1 )(𝛽̂1 − 𝛽̂12 ) > 0
1
2
⇓ (𝛽̂3 − 𝛽̂1 ) < (𝛽̂2 − 𝛽̂1 ) ⇓
𝛽̂1 − 𝛽̂12 𝛽̂1 − 𝛽̂13 (
(𝛽̂3 − 𝛽̂1 ) < (𝛽̂2 − 𝛽̂1 )
𝛽̂2 − 𝛽̂1 𝛽̂3 − 𝛽̂1
)(
p12 − 1 p13 − 1
)
⇓ (1 − p13 )(𝛽̂3 − 𝛽̂1 )2 < (𝛽̂2 − 𝛽̂1 )2 (1 − p12 ) ⇓ p13 (𝛽̂3 − 𝛽̂1 )2 > (𝛽̂3 − 𝛽̂1 )2 − (𝛽̂2 − 𝛽̂1 )2 (1 − p12 ).
Similarly, for expression (A.2) ( ) ( ) ( ) 𝐲1′ 𝐱1 𝛽̂13 − 𝛽̂1 + 𝐲2′ 𝐱2 𝛽̂2 − 𝛽̂23 + 𝐲3′ 𝐱3 𝛽̂13 − 𝛽̂23 = ( ) ( ) ( ) ( ) 𝐲1′ 𝐱1 𝛽̂13 − 𝛽̂1 + 𝐲2′ 𝐱2 𝛽̂2 − 𝛽̂23 + 𝐲3′ 𝐱3 𝛽̂13 − 𝛽̂3 + 𝐲3′ 𝐱3 𝛽̂3 − 𝛽̂23 = ( ( ) ) ( ) 𝐲1′ 𝐱1 (1 − p13 ) 𝛽̂3 − 𝛽̂1 + 𝐲2′ 𝐱2 (p23 − 1) 𝛽̂3 − 𝛽̂2 − 𝐲3′ 𝐱3 p13 𝛽̂3 − 𝛽̂1 + 𝐲3′ 𝐱3 p23 ( ) 𝛽̂3 − 𝛽̂2 = ( )[ ] ( )[ ] 𝛽̂3 − 𝛽̂1 𝐲1′ 𝐱1 (1 − p13 ) − 𝐲3′ 𝐱3 p13 + 𝛽̂3 − 𝛽̂2 𝐲2′ 𝐱2 (p23 − 1) + 𝐲3′ 𝐱3 p23 = )[ ( )] ( )[ ( )] ( 𝛽̂3 − 𝛽̂1 𝐲1′ 𝐱1 − p13 𝐲1′ 𝐱1 + 𝐲3′ 𝐱3 + 𝛽̂3 − 𝛽̂2 −𝐲2′ 𝐱2 + p23 𝐲2′ 𝐱2 + 𝐲3′ 𝐱3 = © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations (
𝛽̂3 − 𝛽̂1
257
)( ′ )( ) ( )( )( ) 𝐱1 𝐱1 𝛽̂1 − 𝛽̂13 − 𝛽̂3 − 𝛽̂2 𝐱2′ 𝐱2 𝛽̂2 − 𝛽̂23 > 0
⇓ ) 𝐱′ 𝐱2 ( ) 𝛽̂ − 𝛽̂23 ( 𝛽̂3 − 𝛽̂2 2 𝛽̂3 − 𝛽̂1 < 2′ 𝐱1 𝐱1 𝛽̂1 − 𝛽̂13 ⇓ )
(
𝛽̂3 − 𝛽̂1
(𝛽̂3 − 𝛽̂1 )2 − (𝛽̂2 − 𝛽̂1 )2 (1 − p12 ) ⇓ p23 >
(
𝛽̂ − 𝛽̂1 1+ 2 𝛽̂3 − 𝛽̂2
)2
( −
𝛽̂2 − 𝛽̂1 𝛽̂3 − 𝛽̂2
)2
( +
𝛽̂2 − 𝛽̂1 𝛽̂3 − 𝛽̂2
)2 p12 > 1,
which is not possible because 0 < p23 < 1. So expressions (A.1) and (A.2) cannot hold true at the same time, proving that SSR13 ≥ min{SSR12 , SSR23 }.
Appendix B: groups-adjusted Rand index Given a set of N elements Z = {z1 , … , zn } and its two partitions P = {P1 , … , PR } and Q = {Q1 , … , QS }, the RI measures the similarity between the two data clusterings. In a simulation experiment, where one of the two partitions represents the true DGP, the value of the RI measures the accuracy of the other partition in reproducing the original cluster structure. The calculation of the index involves the cardinalities of the following sets: 1. Z11 = {(zi1 , zi2 ) | zi1 , zi2 ∈ Pr ; zi1 , zi2 ∈ Qs }; that is, the set of pairs of elements in Z that are in the same set of P and in the same set of Q; 2. Z00 = {(zi1 , zi2 ) | zi1 ∈ Pr1 ; zi2 ∈ Pr2 ; zi1 ∈ Qs1 ; zi2 ∈ Qs2 }; that is, the set of pairs of elements in Z that are in different sets of P and in different sets of Q; 3. Z10 = {(zi1 , zi2 ) | zi1 , zi2 ∈ Pr ; zi1 ∈ Qs1 ; zi2 ∈ Qs2 }; that is, the set of pairs of elements in Z that are in the same set of P and in different sets of Q; © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
258
A. Cerasa
4. Z01 = {(zi1 , zi2 ) | zi1 ∈ Pr1 ; zi2 ∈ Pr2 ; zi1 , zi2 ∈ Qs }; that is, the set of pairs of elements in Z that are in different sets of P and in the same set of Q. Indicating with n⋅⋅ the number of elements in the set Z⋅⋅ , the value of RI is given by RI =
n11 + n00 n +n = 11(N ) 00 . n11 + n00 + n10 + n01 2
g
In the statistical problem formalized in this article, each observation zi belongs to a group g (g = 1, … , m), and the final aim is to obtain a suitable k-partition of these groups of observations. Suppose, now, to consider two observations belonging to the g g same group g, namely, zi and zi . Because the clustering procedure considers each 1 2 g g group g as a whole, it is obvious that, whatever the final partition will be, zi and zi 1 2 will necessarily belong to the same partition set, that is, ) ( g g zi , zi ∈ Z11 ∀g = 1, … , m. 1
2
Therefore each pair of observations coming from the same group will be an element of set Z11 , spuriously increasing the value of n11 and consequently the value of the RI. Then, it seems appropriate to remove such pairs from the calculation of the index. This leads to the formulation of the following adjusted version of the RI: m ( ) ∑ n +n −C ng GARI = 11(N ) 00 where C = , 2 − C g=1 2 with ng indicating the number of elements in group g. It is important to point out that the correction term C does not depend on k, so it is not affected by the characteristics of the partition but depends only on the structure of the data. Finally, because the correction term C is positive, we have that 0 ≤ GARI ≤ RI ≤ 1. References Atkinson, A. C., M. Riani and A. Cerioli (2010), The forward search: Theory and data analysis, Journal of the Korean Statistical Society 39(2), 117–134. Baudry, J.-P., A. E. Raftery, G. Celeux, K. Lo and R. Gottardo (2010), Combining mixture components for clustering, Journal of Computational and Graphical Statistics 19(2). Cerioli, A. and D. Perrotta (2014), Robust clustering around regression lines with high density regions, Advances in Data Analysis and Classification 8(1), 5–26. Davies, R. B. (1980), Algorithm AS 155: the distribution of a linear combination of 𝜒 2 random variables, Applied Statistics 29(3), 323–333. FATF, (2006), Trade based money laundering, Technical Report, Organization for Economic Co-operation and Development (OECD), Paris. FATF, (2008), Best practises on trade based money laundering, Technical Report, Organization for Economic Co-operation and Development (OECD), Paris. Garc´ıa-Escudero, L. A., A. Gordaliza, R. San Martin, S. Van Aelst and R. Zamar (2009), Robust linear clustering, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(1), 301–318. © 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research
Combining groups of preclassified observations
259
Garc´ıa-Escudero, L. A., A. Gordaliza, C. Matrán and A. Mayo-Iscar (2010a), A review of robust clustering methods, Advances in Data Analysis and Classification 4(2-3), 89–109. Garc´ıa-Escudero, L. A., A. Gordaliza, A. Mayo-Iscar and R. San Mart´ın (2010b), Robust clusterwise linear regression through trimming, Computational Statistics & Data Analysis 54 (12), 3057–3069. Hennig, C. (2010), Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification 4(1), 3–34. Hennig, C. and T. F. Liao (2013), How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society: Series C (Applied Statistics) 62(3), 309–369. Hubert, L. and P. Arabie (1985), Comparing partitions, Journal of Classification 2(1), 193–218. Kaufman, L. and P. J. Rousseeuw (2009), Finding groups in data: an introduction to cluster analysis, vol. 344, John Wiley & Sons, Hoboken, New Jersey. Maitra, R. and V. Melnykov (2010), Simulating data to study performance of finite mixture modeling and clustering algorithms, Journal of Computational and Graphical Statistics 19(2), 354–376. Melnykov, V., W. C. Chen and R. Maitra (2012), Mixsim: an R package for simulating data to study performance of clustering algorithms, Journal of Statistical Software 51(12), 1–25. Perrotta, D. and F. Torti (2010), Detecting price outliers in European trade data with the forward search, In Data analysis and classification, Springer Berlin Heidelberg, Springer, 415–423. Pesarin, F. and L. Salmaso (2010), Permutation tests for complex data: theory, applications and software, John Wiley & Sons, Chichester, West Sussex. Pula, G. and D. Santabárbara (2012), Is China climbing up the quality ladder? Documentos de Trabajo 1209, 1–30. Rand, W. M (1971), Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association 66(336), 846–850. Riani, M., A. Cerioli, A. C. Atkinson, D. Perrotta and F. Torti (2008), Fitting robust mixtures of regression lines to European trade data. Mining Massive Datasets for Security Applications, 271–286, IOS press, Amsterdam. Riani, M., D. Perrotta and F. Torti (2012), FSDA: a Matlab toolbox for robust analysis and interactive data exploration, Chemometrics and Intelligent Laboratory Systems 116, 17–32. Riani, M., A. Cerioli, D. Perrotta and F. Torti (2015), Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library, Advances in Data Analysis and Classification 9(4), 461–481. Rousseeuw, P. J and A. M Leroy (2005), Robust regression and outlier detection, vol. 589, John Wiley & Sons, Hoboken, New Jersey. Šidák, Z. (1967), Rectangular confidence regions for the means of multivariate normal distributions, Journal of the American Statistical Association 62(318), 626–633. Received: 13 March 2015. Revised: 7 December 2015.
© 2016 The Authors Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research