coverage could be achieved by utilizing Hall's (1992) or Johnson's (1978) transformation method. In our context, John's transformation of Welch's t test. (5).
Evaluating the Value of Probability Forecasts in the Sense of Merton
Kajal Lahiri, Huaming Peng, Yongchen Zhao Department of Economics University at Albany, SUNY January 2012
Abstract In this paper, we propose to directly test the value of probability forecasts in the framework of Merton (1981), without converting the probability forecasts to binary forecasts before testing. We address the issues of serial correlation and skewness in the probability forecast series with appropriate correction of the test statistic and a circular bootstrap procedure, resulting in a more powerful test. The test is applied to forecasts of probability of real GDP decline in U.S. Survey of Professional Forecasters during 1968-2011. We find that, among the forecasters who forecast frequently, the number of forecasters making valuable forecasts decreases sharply as horizon increases, with only about half the forecasters making valuable forecasts for the current quarter.
1. Introduction Probability forecasts are necessary and informative when target event is binary or categorical and have been used extensively in business, economics, and other scientific disciplines like meteorology. One commonly used procedure to examine the performance
Page 1 of 20
of probability forecasts for binary events with a direct link to the squared loss function is quadratic probability scores (QPS). Calibration tests can be constructed to examine the validity of probability forecasts based on the deviation of the score from desired value. However, this approach has several restrictions. Diebold and Mariano (1995) propose and evaluate tests without assuming quadratic, or even symmetric, loss. Granger and Pesaran (2000) stress the necessity of linking decision making and forecast evaluation, raising concerns on the choice of loss functions in evaluation. Merton (1981), in studying market timing skills of financial market participants, proposes an equilibrium theory to evaluate forecast of binary event. He suggests that a series of probability forecasts has no value if it is independent from the realizations series of the target binary events. Even though this theory has been applied to testing the value of forecasts by several authors (for example, Easaw, Garratt and Heravi 2005, and Ashiya 2006), most of the work is done under the assumption of serial independence. Pesaran and Timmermann (2009) propose a test for serially dependent forecasts of categorical variables. But their simulation results show a considerable size distortion when the sample size is small. In this paper, we propose to directly test the value of probability forecasts of binary events in the sense of Merton (1981). Utilizing the probability forecasts rather than binary forecasts converted from probability forecasts, we obtain asymptotic robustness. The issue of serial correlation and possible skewness in the forecasts are addressed. And to robustify the tests when the sample is small, a circular bootstrap procedure is used. We apply our proposed test procedures to a long panel data of probability forecasts of real
Page 2 of 20
GDP decline in U.S. Survey of Professional Forecasters and show that the procedures we propose perform well in practice. The structure of this paper is as follows. In Section 2, we discuss existing tests that are useful for our problem and necessary modifications needed. We also discuss the benefit of using circular bootstrap procedure in implementation of the tests. In Section 3, we conduct simulation studies to examine the small sample properties of the proposed test procedures and discuss the overall performance of the tests. Section 4 is on our application of the tests to the probability forecasts of decline in real GDP from U.S. Survey of Professional Forecasters. We report the results from our proposed test procedures along with other diagnostic measures to show the good performance of our tests in practice. Section 5 summarizes our main results.
2. The Tests In economics and finance decision or policy makers are often interested in directional movements of variables of interest. For example, finance practitioners would short selling (or go longing) a particualr stock if the price of that stock is expected to fall (or rise). Also central bank or Fed would raises interest rates if infation is predicted. Thus for business decision or policy makers it is of particular importantance to acess the quality of the directional forecasts. There are various approaches of evaluating probability forecasts of binary events. Among them one increasing important measure in economics and financial literature is due to Merton (1981), who not only derives an equilibrium theory for the value of marPage 3 of 20
ket-timings kills for future binary events but also provides a sufficient statistics to evaluate the value. More specifically, let the realized values of a variable of interest and its forecasts be defined as
and
respectively. Merton argues that the binary forecast
has no economic value if and
only if (1) which is equivalent to (2) By contrast, a binary forecast has positive value in the sense of Merton if and only if (3) Since
and
,
testing for the null hypothesis defined in (2) against the alternative (3) is basically testing for the equality of means across two different populations provided the conditional indiand
cator variables
are available and some limit theorems can be
applied. With Survey of Professional Forecast data, instead of observing vaiables and
, and
we
observe
calibrated
probability
assessments
over time respectively. These observed proba-
Page 4 of 20
bilities may be best regarded as the ensemple average of the conditional indicator variables. Our goal of this paper is to construct a test for (2) using these calibrated probabilities. For notational simplicity, define
,
,
, also let
and is independent of
,
,
. Now assuming that for each individual forecaster , , both
and
are stationary up to order three and er-
, and
godic with
,
, and that
and
are asymptotically normal. Then a simple test for the null hypothesis in (2) is the well known Welch's t-test, (4) where
with
and
where
denotes the nearest integer that
to, and similarily does Under
rounds down
.
, the Welch's t test statistic is approximately distributed with the approx-
imate degrees of freedom
being defined by
Page 5 of 20
However, the Welch's t test is less robust when the underlying true distribution is highly skewed. To handle this problem of skewness and improve Type-I error coverage, Luh and Gao (1999) and Guo and Luh (2000) demonstrated that a better Type-I error coverage could be achieved by utilizing Hall's (1992) or Johnson's (1978) transformation method. In our context, John's transformation of Welch's t test (5) and Hall's transformation of Welch's t test (6) where
and
and
is defined similarily. It is well known that both
and
are distributed approx-
imately as t variables with degrees of freedom given by
But the transformed Welch's t test may still lack of power when the sample size is small, To address this issue, Keselman, etc (2004) proposed using simple boostrap method for random samples to obtain the critical values. They found that a transfomation for skewness combined with a boostrap method improves Type I error control and probabil-
Page 6 of 20
ity coverage even if the sample sizes are small. However, the simple bootstrap does not apply in our case becasue calibrated probability assessments are correlated over time in each group. Block bootstrap has been shown to be an appropriate tool to obtain asymptotically valid procedures to approximate distributions of a large class of statistics and weakly dependent processes (see for example, Kunsch, 1989; Lahiri, 1991; Liu and Singh,1992, Politis and Ramano, 1992). Consistency of the block bootstrap for the mean and asymptotic refinements over the classical normal approximation of the error in rejection probability can also be achieved if the block boostrap is properly implemented. (Radulovic, 1996; Gotze and Kunsch, 1996; Lahiri, 1996;Horowitz, 1996; Andrew, 2002; Inoue and Shintani, 2006). Among various block bootstraps, circular block bootstrap (CBB) and overlapping moving block bootstrap (OMBB) procedures have been shown to be more efficient than non-overlapping and stationary bootstrap; moreover, the CBB is prefered to OMBB in the sense that the bootstrap sample mean under CBB has an expectation equal to the sample mean of the observed series while it is not true for OMBB (Lahiri, 1999). Thus in this paper we will apply the CBB procedure to both
and
sta-
tistics.
3. Some Simulation Results In this section, we present results from a simulation study of the properties of all four versions of the test. The purpose of the simulation study is to examine the size and power properties of the tests applied to samples with different sizes, variances, and autocorrelations, with different amount of trimming, with and without bootstrapping. Page 7 of 20
For each outcome
and
, we consider simulated probability forecast
series that are dependent, with beta marginal distribution. The two series of conditional forecasts
and
are generated independently according
to McKenzie (1985). An initial part of the series is discarded for stability. The outcome series is generated from a Bernoulli distribution. Proportion of positive outcomes (decline in real GDP) is set to about 16.5% of the sample size, roughly matching the observed data for the US between 1968 to 2011. All simulation results are based on 500 Monte Carlo replications. We first examine performance of the tests using samples with varying sizes. Samples are generated under the null hypothesis with mean 0.5, variance 0.03, first order autocorrelation 0.1 for both series of conditional forecasts. Sample sizes under considera. No trimming is applied. Table 1 shows empirical
tion are
sizes of the tests at 5% nominal level. All four versions of the test are notably over-sized when sample size is small, with Welch’s test having empirical size of 14.6% at sample size 20, maximum of all. As sample size increases, empirical sizes start to approach the 5% nominal size. With a sample size of 200, Hall’s test and Yuen’s test are properly sized, while Johnson’s test and Welch’s test still relatively over-sized. In addition, size distortions may not be monotonically decreasing as sample size increases, when the sample size is small, say, smaller than 200. This brings additional uncertainty to the empirical size when used with observational data. Table 1: Empirical sizes at 5% nominal level with varying sample size
n 20
Hall 0.072
Johnson 0.086
Page 8 of 20
Welch 0.146
Yuen 0.074
40 60 100 150 200
0.076 0.074 0.066 0.074 0.048
0.054 0.078 0.070 0.064 0.070
0.090 0.068 0.090 0.064 0.072
0.064 0.078 0.074 0.088 0.044
To study the performance of the tests when two conditional probability forecast series
have
different
variance,
we
use
simulation
samples
with
variances
. The two conditional series of forecasts are otherwise the same, with mean 0.5 and autocorrelation 0.1. Two different sample sizes are considered here:
and
. Table 2 shows
empirical sizes of the tests at 5% nominal level. Size distortions are clearly seen across all four versions and variance pairs. However, no clear pattern exists that shows a relationship between variance of the forecasts and performance of the tests. In addition, increasing sample size from 40 to 150 seems to make little difference, especially for Welch’s test and Yuen’s test. With sample size 150, Hall’s test is the one that shows least amount of size distortion. Table 2: Empirical sizes at 5% nominal level with varying variance and sample size n=40 (.001, .03)
n=150
Hall
Johnson
Welch
Yuen
Hall
Johnson
Welch
Yuen
0.062
0.066
0.070
0.064
0.076
0.060
0.080
0.048
(.01, .001)
0.078
0.056
0.088
0.070
0.054
0.042
0.082
0.078
(.03, .01)
0.078
0.078
0.090
0.076
0.062
0.064
0.086
0.080
(.1, .03)
0.062
0.072
0.062
0.094
0.052
0.066
0.054
0.064
To understand the effect of different autocorrelations of the two conditional forecast series on the performance of the test, we implement the test using samples with autocor-
Page 9 of 20
relations
. For both conditional forecast series, mean and variance are set to 0.5 and 0.03 respectively. Sample size is fixed at 100. Table 3 shows the empirical size of the four tests at 5% nominal level. All four tests are significantly over-sized when autocorrelation of the conditional forecast series are large, especially when they have opposite sign. This is particularly true to Welch’s and Yuen’s tests. When autocorrelation pair is
,
empirical size for Welch’s test at 5% nominal level reaches 22% and Yuen’s test 24%. On the other hand, when both conditional forecast series have relatively low autocorrelation, say, lower than 0.3, regardless of sign and magnitude, presence of autocorrelation does not seem to bring any additional size distortion, compared with results in Table 1 and 2. Hall’s test seems to be the most insensitive to autocorrelations, but only slightly better than the other three tests. Table 3: Empirical sizes at 5% nominal level with varying autocorrelation and sample size
(-.1, -.3) (-.1, .1) (-.1, .3) (-.3, -.65) (-.3, .65) (-.65, .65) (.1, .3) (.3, .65)
Hall 0.078 0.070 0.072 0.118 0.128 0.186 0.066 0.152
Johnson 0.080 0.084 0.082 0.146 0.122 0.196 0.062 0.148
Welch 0.072 0.066 0.098 0.170 0.132 0.216 0.078 0.130
Yuen 0.084 0.054 0.076 0.156 0.122 0.242 0.076 0.138
Observational data may contain outliers, making trimming a tempted procedure to apply to data before testing. We examine the effect of 5% trimming on test performance
Page 10 of 20
using simulated samples with different variances and sample sizes. Table 4 compares empirical sizes of Yuen, Johnson, and Hall’s test at 5% nominal level when used in different situations with and without trimming. Results show that with small variances, the tests are better sized without trimming. But when variance becomes large, the effect of trimming become mixed. Since we generated the sample without explicitly planting in outliers, it is reasonable to expect the effect of small amount of trimming to be small. In practice when observational data is used, whether to trim or not would without doubt depend heavily on the nature and amount of outliers. Table 4: Empirical sizes at 5% nominal level with varying variances and trimming size
n=40
n=150
Hall
Johnson
Welch
Yuen
Hall
Johnson
Welch
Yuen
5% trimming (.001, .001) (.01, .01) (.03, .03)
0.070 0.082 0.074
0.076 0.074 0.068
0.084 0.094 0.090
0.072 0.074 0.092
0.076 0.062 0.070
0.064 0.070 0.064
0.072 0.092 0.102
0.078 0.084 0.064
No trimming (.001, .001) (.01, .01) (.03, .03)
0.080 0.070 0.108
0.088 0.072 0.068
0.068 0.072 0.088
0.072 0.056 0.084
0.060 0.064 0.060
0.068 0.068 0.052
0.050 0.094 0.076
0.090 0.054 0.078
Empirical powers are shown in Table 5 below for different samples. Here we set the variance and autocorrelation to 0.03 and 0.1 respectively. We can see that for Welch’s and Yuen’s test, the power increases with sample size. For Hall’s and Johnson’s test, it is not clear whether larger sample size actually helps. We also see that for each given sample, Welch’s test and Yuen’s test are slightly more powerful than Hall’s test or Johnson’s test. Except for the last sample where the difference in mean is only 0.1, in almost all cases, all four versions of the tests are satisfactorily powerful. Page 11 of 20
Table 5: Empirical power at 5% nominal level n=40 Johnson 0.95
n=150
0.85
0.15
Hall 0.944
Welch 1
Yuen 1
Hall 0.706
Johnson 0.696
Welch 1
Yuen 1
0.75
0.25
0.986
0.99
0.65
0.35
0.962
0.958
1
1
0.816
0.838
1
1
0.972
0.978
0.924
0.934
1
1
0.55
0.45
0.398
0.364
0.468
0.364
0.764
0.754
0.826
0.808
Overall, we do not find any version of the test dominate other versions in all aspects examined above. However, we do not that Welch’s and Yuen’s tests are slightly more powerful but is more likely to be ill-sized when variances or autocorrelations are in their extremes. Johnson’s and Hall’s tests, with little power especially when the difference in mean is small, do however exhibit empirical sizes that are more robust to changes in variances and autocorrelations of the sample.
4. Empirical Application In this section, we apply the four tests to probability forecasts of decline in real GDP from US Survey of Professional Forecasters. In additional to checking the value of these forecasts, we look at whether forecasts recognized by the tests to have value actually perform better in recession forecasting. Survey of Professional Forecasters is a quarterly survey containing questions on respondents’ forecasts of major US macroeconomic variables, among which, probability of decline in real GDP. Respondents are asked to provide one probability forecast for the current quarter, for which the forecasters won’t be able to get data until next quarter, plus each of the following four quarters. The current quarter forecasts are used to construct the Page 12 of 20
“anxiety index”. We use individual data on this question from 1968 quarter 4 to 2011 quarter 4 – a total of 25,477 forecasts provided by 425 different respondents during the 172-quarter-period. Due to the presence of a big amount of missing data, when applying the test, we impose a participation requirement that at least 10 forecasts are available under each condition from a forecaster. Forecasters who do not meet this requirement are left out of the sample. A simple measure of evaluating probability forecast is quadratic probability score (QPS), which compares the probability forecast with the realizations of the target event. We can assess the validity of the probability forecasts using scoring rules. Given the realizations of the target binary events, the asymptotic average score is minimized if the forecasts are properly calibrated. Following Dawid (1986), we can test whether the forecasts are perfect, in the sense that whether the QPS is unreasonably larger than its expected value of
using the following test statistic
Table 1 shows the mean, standard deviation, maximum and minimum of QPS for all five horizons, plus the number of forecasters who fail the test at 5%, 10% and 20% level. From the table we can see that the mean QPS increases from 0.124 to 0.250 when horizon increases from 1 to 5 quarters, since we expect the value of forecasts become smaller as horizon becomes longer. In the meantime, standard deviation of the QPS largely increases from 0.042 to 0.057, indicating an increased heterogeneity in the value of individual forecasts. In addition, no one in the sample, except one forecaster at five quarter horizon, performs significantly worth than expected at 5% level. No one performs superior, i.e.,
Page 13 of 20
produce a QPS that is smaller than the expectation above, at any reasonable level of significance. Table 1: Statistics of individual QPS and number of forecasters failing the QPS test for all horizons h
#
mean
sd
max
min
Rej 5%
Rej 10%
Rej 20%
1
22
0.124
0.042
0.210
0.059
0
0
1
2
24
0.169
0.035
0.229
0.082
0
0
1
3
22
0.191
0.041
0.251
0.084
0
0
4
4
27
0.222
0.049
0.309
0.115
0
0
7
5
27
0.250
0.057
0.368
0.134
1
3
10
For each of the forecasters that meets our participation requirement, we fit a twoparameter beta distribution 1 to his forecasts, horizon by horizon, so that we can be assured that our tests should show reasonable performance given the characteristics of the data. Table 2 shows the maximum, minimum, and mean values for different aspects of the fitted distributions. Conditioned on no decline in real GDP, mean of forecasted probabilities are mostly within the range of 10% to 45%, usually around 20% to 25%. Variance ranges from 0.002 at long horizon to 0.108 at short horizon. First order autocorrelation of the forecasts, while usually around 0.3, may fall anywhere between -0.07 and 0.65. Conditional on decline in real GDP, mean probability decreases from about 55% to 20% as horizon increases, with maximum about 40% higher than the minimum. Variances of the forecasts go from 0.001 to 0.119, with means decreasing from 0.07 to about 0.02 as horizon increases. Means of first order autocorrelations increase from 0.04 to 0.11, with maximum as high as 0.65 and minimum as low as 0.001. Despite the large range within
1
Alternative parameterization including a four-parameter version that specifies the minimum and maxi-
mum. Since we are working with probabilities here, a two parameter beta distribution suffices. Page 14 of 20
which the characteristics vary, on an individual basis, the range of movement is quite smaller. In addition to the individual parameter values, we present Figure 1 showing the fitted beta distribution and histograms for the two conditional series of one-quarter ahead forecasts pooling all forecasters and quarters. Table 2: Characteristics of individually fitted beta distributions by horizon. h
stat
mean1
var1
ar1
mean0
var0
ar0
1
mean
0.564
0.070
0.043
0.263
0.050
0.296
max
0.741
0.119
0.248
0.443
0.108
0.528
min
0.319
0.025
-0.159
0.071
0.008
0.027
mean
0.391
0.044
0.093
0.232
0.041
0.304
max
0.589
0.077
0.376
0.373
0.095
0.640
min
0.212
0.005
-0.249
0.103
0.011
0.019
mean
0.314
0.032
0.077
0.224
0.028
0.332
max
0.486
0.083
0.566
0.350
0.065
0.596
2
3
4
5
min
0.145
0.008
-0.407
0.102
0.008
0.129
mean
0.220
0.017
0.142
0.202
0.020
0.314
max
0.385
0.051
0.648
0.344
0.069
0.658
min
0.067
0.001
-0.229
0.092
0.005
-0.052
mean
0.210
0.019
0.111
0.204
0.018
0.269
max
0.408
0.089
0.415
0.369
0.054
0.658
min
0.058
0.001
-0.296
0.078
0.002
-0.068
We apply all four versions of the test to individual series of forecasts and examine the value of their forecasts as horizon increases. Table 3 reports the test results 2. In testing, we choose not to use trimming. Since probability forecasts are naturally bounded in between 0 and 1, it is unlikely that one outlying forecast is able to affect the result too much. The circular block bootstrapping procedure described before is used to derive the p-values. Number of bootstrap replications is set to be 1000. Table 3: Number of rejections at 5% nominal level for each horizon 2
Full set of results for each individual is available from the authors upon request. Page 15 of 20
Asymptotic Test Horizon
N
1
Bootstrap Test
Hall
Johnson
Welch
Yuen
Hall
Johnson
Welch
Yuen
22
21
21
22
22
12
12
12
13
2
24
19
19
21
21
8
8
10
10
3
22
8
8
14
13
5
5
5
5
4
27
4
4
3
3
2
2
2
2
5
27
2
2
2
2
2
2
1
2
As can be seen from Table 3, in general, as horizon becomes longer, fewer and fewer forecasters are able to produce forecasts that pass the test, i.e., test rejects the null hypothesis that the forecasts are of no value. In forecasting for current quarter (h=1) asymptotic tests indicate that almost all forecasters are making sensible forecasts. This number decreases from 22 to 2 when horizon increases to 5 quarters, despite the total number of forecasters increases from 22 to 27. Bootstrap tests shows that significantly fewer forecasters are producing sensible forecasts at short horizons. Only about 12 out of 22 forecasters pass the bootstrap test when h=1. More forecasters pass Welch’s and Yuen’s asymptotic test than Hall’s test or Johnson’s test, clearly seen at horizon 1 to 3. But almost the same number of forecasters pass the bootstrap test, regardless of which version of the test we use. Interestingly, forecasters passing the asymptotic test are quite often not those who pass the bootstrap test. This is particularly true at long horizons. When h=5, no one who passes the asymptotic tests passes the bootstrap test and no one who passes the bootstrap tests passes the asymptotic ones. To see if the forecasters identified as making forecasts with value actually outperforms the forecasters who failed the tests, we compare the mean forecasts of the two groups against the actual value. Figure 1 shows the mean forecasts of the two groups de-
Page 16 of 20
fined using the results of Hall’s test for one quarter ahead forecasts from 1968 quarter 4 to 1984 quarter 3 3. We can see from the figure that mean forecasts from the group whose members pass the test are higher for most of the shaded periods, like 1974 quarter 3 to 1975 quarter 2 and 1981 quarter 2 to 1982 quarter 2. For other quarters, the two forecasts are rather close. In addition, for almost all major recess periods, forecasted probabilities produced by the group whose members pass the test increases significantly several quarters before that from the other group increases, indicating better foresight from the passing group. So, the visual inspection does give us results consistent with the test results. Figure 1: Comparison of mean forecasts from two groups
Shaded area indicates a drop in real GDP. Solid line is the mean forecast from the group whose members pass Hall’s test. Dashed line is the mean forecast from the group whose members fail Hall’s test.
5. Summary 3
After 1984 quarter 4, all forecasters who meet the participation requirement pass the test so there is no
comparison group. Page 17 of 20
In this paper, we propose to test the validity of probability forecast in the framework of Merton (1981). Four versions of the test are constructed based on Welch (1947), Yuen (1974), Johnson (1978), and Hall’s (1992) improved t-test. Small sample properties of the test are examined in a set of Monte Carlo experiments and then the tests are applied to data containing probability forecasts of decline in real GDP from US Survey of Professional Forecasters. Our simulation results show that all four tests are in general robust to varying sample sizes, variances, autocorrelations, and the amount of trimming applied in testing. Except for simulated sample with extreme characteristics, all tests show satisfactory power but are slightly over-sized. Johnson’s test and Hall’s test are often better sized. But Welch’s test and Yuen’s test are slightly more powerful. Our empirical analysis using SPF data shows that the number of forecasters who are able to make sensible forecasts of probability of real GDP decline decreases as horizon increases. Asymptotic tests show that almost all forecasters are making good forecasts about the current quarter. Bootstrap tests show only half of them do. At horizon as long as 4 to 5, only 2 or 3 forecasters are able to make good forecasts, according to both asymptotic tests and bootstrap tests.
References Agresti, A (1992). A survey of exact inference for contingency tables, Statistical Science, 7, 131-153.
Page 18 of 20
Clements, M. P., & Harvey, D. I. (2010). Forecast encompassing tests and probability forecasts. Journal of Applied Econometrics, 25(6), 1028–1062. Clements, M., & Harvey, D. I. (2011). Combining probability forecasts. International Journal of Forecasting, 27(2), 208-223. Elsevier B.V. Diebold, F.X. and R. Mariano (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13, 253-263. Easaw, J.Z., D. Garratt and S.M. Heravi (2005), Does consumer sentiment accurately forecast UK hosehold consumption? Are there any comparisons to be made with the US? Journal of Macroeconomics, 27, 517-532. Granger, C.W.J. and M.H. Pesaran (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting, 19, 537-560. Henriksson, R. D., & Merton, R. C. (1981). On market timing and investment performance. II. Statistical procedures for evaluating forecasting skills. Journal of Business, 54(4), 513–533. Keselman, H., Othman, A. R., Wilcox, R. R., & Fradette, K. (2004). The new and improved two-sample t test. Psychological Science, 15(1), 47. SAGE Publications. Luh, W. M., & Guo, J. H. (2002). Using Johnson’s transformation with approximate test statistics for the simple regression slope homogeneity. The Journal of experimental education, 71(1), 69–81. Heldref Publications. McKenzie, E. (1985). An autoregressive process for beta random variables. Management Science, 31(8), 988–997. Merton, R. C. (1981). On market timing and investment performance. I. An equilibrium theory of value for market forecasts. Journal of Business, 54(3), 363–406. Ranjan, R., & Gneiting, T. (2010). Combining probability forecasts. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(1), 71-91.
Page 19 of 20
Seillier-Moiseiwitsch, F., & Dawid, A. P. (1993). On testing the validity of sequential probability forecasts. Journal of the American Statistical Association, 88(421), 355359. Pesaran, M.H. and A.G. Timmermann (1992). A simple nonparametric test of predictive performance. Journal of Business & Economic Statistics, 10, 461-465. Pesaran, M.H. and A.G. Timmermann (2009). Testing dependence among serially correlated multi-category variables. Journal of American Statistical Association, 104, 325337. Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. Biometrika, 61(1), 165–170. Biometrika Trust.
Page 20 of 20