arXiv:1502.05367v1 [stat.ME] 18 Feb 2015
Detecting weaks signals with record statistics Damien Challet,1,2∗ 1
Laboratoire de math´ematiques appliqu´ees aux syst`emes, CentraleSup´elec,
Grande Voie des Vignes, 92295 Chˆatenay-Malabry, France, 2 Encelade Capital SA, EPFL Innovation Park, Building C, 1015 Lausanne, Switzerland ∗
To whom correspondence should be addressed; E-mail:
[email protected].
Counting the number of local extrema of the cumulative sum of data points yields the R-test, a new single-sample non-parametric test. Numeric simulations indicate that the R-test is more powerful than Student’s t-test for semiheavy and heavy-tailed distributions, equivalent for Gaussian variables and slightly worse for uniformly distributed variables. Finally the R-test has a greater power than the single-sample Wilcoxon signed-rank test in most cases. Detecting weak signals in noisy data is of utmost importance in all fields of science and technology. Accordingly, many methods have been devised to filter noisy data or to test the presence of a signal in noisy data. If the data have a regular temporal or spatial structure, adhoc methods are able to detect very weak signals, see e.g. (1, 2). A more difficult task is to detect weak signals in data that do not have any structure. When the data are purely Gaussian, Student’s t-test is the uniformly most powerful test (3). Most types data do not follow a Gaussian distribution, but because of the vast domain of validity of the Central Limit Theorem, the t-test performs relatively well as long as the tails of the data distribution are not too heavy. Signedrank Wilcoxon tests (4) are powerful alternatives to the t-test in some cases. 1
Let us write down the definition of the t-test in order to fix the notations. Take N samples of a quantity of interest, denoted by xn , n = 1, · · · , N , assumed to be independently identically √ distributed (iid). Denoting an estimate with a hat, the t-statistics of the sample is tˆ = θˆ N where θˆ = µ ˆ/ˆ σ is its estimated signal-to-noise ratio (SNR thereafter), µ ˆ its estimated average and σ ˆ its estimated standard deviation. The t-test stops being optimal as soon as the data are not strictly Gaussian. This is due for example to the computation of moments of the data which becomes problematic if its distribution has heavy tails. Here I propose instead a new test, the R-test, based on a measure that only indirectly depends on the exact values of {xn }. Let me define the integrated signal XN =
PN
n=0
xn , which is nothing else than a random walk. A remarkable result about unbiased
random walks, based on Sparre Andersen theorem (5), is that the distribution of the number of upper records (local maxima) in N steps, denoted by R+ , does not depend on the distribution of xn as long as it is symmetric (i.e. x and −x are equiprobable) and continuous. This distribution is known exactly and is given by !
2N − R+ + 1 P (R+ , N ) = /22N −R+ +1 , N q
(1)
q
which tends to a Gaussian distribution N ( 4N/π, (4 − π)N ) for large N (6). For symmetry reasons, the number of lower records (local minima), denoted by R− , has the same distribution. Up to this point, these quantities have a major flaw: they offer limited precision, hence detection accuracy, because the number of records is by definition an integer number. The key idea is to note that for iid data xn , the integrated signal of any permutation of {xn } is as valid a representation as XN . Thus one should compute the average number of records of R± over ¯ ± . One can make the test more symmetric by measuring P 1 permutations, denoted by R ¯0 = R ¯+ − R ¯ − . As we shall see, the statistics of both R ¯ + and R ¯ − provide a test with the same R power. 2
¯+, R ¯ − or R ¯ 0 . For the R-test to There is no theoretical result regarding the distributions of R qualify as non-parametric, the distribution of its statistics must be independent from that of the data, which is likely because of the universality of Eq. (1). This is confirmed, at least in a first approximation, in Fig. 1. One can nevertheless perform extensive numerical simulations to determine the statistics ¯+, R ¯ − or R ¯ 0 for θ = 0. Then, computing it again for a given value of θ 6= 0, one can of R characterize the discrimination power of R-statistics by computing their Receiver Operating Characteristic (ROC) curves in order to compare their power with that of the t-test and singlesample Wilcoxon signed-rank sum test (7). All the measures of records have indistinguishable ROC curves and are consistently either better (i.e. above) or (at least in a first approximation) equivalent to the t-test in the case of Gaussian variables; it is slightly worse than the t-test for uniform distributions (Fig. 2). The Area Under Curve (AUC) is a good scalar summary (the larger, the better) that allows one to compare two ROC curves provided that one is constantly above the other, which is the case between R-test and -test. Results are reported in Figs 3 and 4 that display the AUCs of various distributions for the five tests versus θ or versus distribution parameters. The R-test is consistently equivalent or better than the t-test, except for uniform distributions; I have checked that this is true over a wide range of values of θ. The relative power of single-sample signed-rank Wilcoxon test performance depends much on the distribution and on the parameters; this test outperforms the R-test if the data come from a distribution with heavy-enough tails, such as Student’s t-distribution with 2.5 degrees of freedom, and only for some values of the signal-to-noise ratio (Fig. 4). While promising, the R-test needs additional attention. The first important open question is to assess rigorously if the test is truly non-parametric. Secondly, one should like to determine the asymptotic power of the R-test relative to the t-test and to show under what conditions the R-test is more powerful than the t-test. Thirdly, one generalize it to the case of correlated 3
measurements. Practically, instead of considering all permutations, one should permute blocks of data instead of single values, much like block bootstraps (e.g. (8–10)). On the theoretical side, one should try to use the recent generalization of Sparre Andersen theorem to correlated variables (11). Finally, one can design two-sample tests based on record statistics. Theoretical results (12) show indeed that the number of co-records is also universal for distributions with finite variance, which suggests a straightforward way to build a non-parametric two-sample test based on record statistics and permutations. 1. The authors thanks Gilles Fa¨y for his suggestions and declares no conflict of interest. Full source code (R and C++) available at https://github.com/damienchallet/rtest
References and Notes 1. D. L. Donoho, Information Theory, IEEE Transactions on 52, 1289 (2006). 2. C. Maccone, Deep Space Flight and Communications: Exploiting the Sun as a Gravitational Lens (Springer Science & Business Media, 2009). 3. J. Neyman, E. Pearson, Philosophical Transactions of the Royal Society of London (1933). 4. F. Wilcoxon, Biometrics Bulletin p. 80 (1945). 5. E. S. Andersen, Mathematica Scandinavica 1, 263 (1953). 6. S. N. Majumdar, R. M. Ziff, Phys. Rev. Lett. 101, 050601 (2008). 7. T. Hastie, et al., The Elements of Statistical Learning (Springer, 2009). 8. E. Carlstein, The Annals of Statistics p. 1171 (1986). 9. H. R. Kunsch, The Annals of Statistics p. 1217 (1989). 4
10. R. LePage, L. Billard, Exploring the Limits of Bootstrap, vol. 270 ofWiley Series in Probability and Statistics, R. LePage, L. Billard, eds. (Wiley, 1992), pp. 263–270. 11. R. Artuso, G. Cristadoro, M. Degli Esposti, G. Knight, Phys. Rev. E 89, 052111 (2014). 12. G. Wergen, S. N. Majumdar, G. Schehr, Phys. Rev. E 86, 011119 (2012).
5
0.030 0.025
0.00
0.000
0.005
0.02
0.010
0.015
Density
0.06 0.04
Density
Student nu=2.5 Gauss Exponential Uniform
0.020
0.08
Student nu=2.5 Gauss Exponential Uniform
0
10
20
30
40
−60
−40
−20
0
20
40
R0
R+
¯ + and R ¯ 0 for random variables from various distributions with zero signal Fig. 1. Statistics of R ¯ + and R ¯ − have the same statistics for symmetry reasons. Random variables (θ = 0). Both R from the exponential distribution are multiplied by a random sign. N = 100, 10000 samples per point, 10000 permutations per sample.
6
0.8 0.6 TPR 0.4 0.2
0.2
0.4
TPR
0.6
0.8
1.0
Exponential
1.0
Gauss
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
FPR
FPR
Uniform
Student nu=3
0.8
1.0
0.8 0.6 TPR 0.4 0.2
0.2
0.4
TPR
0.6
0.8
1.0
0.2
1.0
0.0
t−test R+−test R− −test R0−test Wilcoxon−test
0.0
0.0
t−test R+−test R− −test R0−test Wilcoxon−test
0.0
0.2
0.4
0.6
0.8
t−test R+−test R− −test R0−test Wilcoxon−test
0.0
0.0
t−test R+−test R− −test R0−test Wilcoxon−test 1.0
0.0
FPR
0.2
0.4
0.6
0.8
1.0
FPR
Fig. 2. True Positive Rate (TPR) versus False Positive Rate (FPR) (ROC plot) for various distributions of {xn } with µ = 0.11 and σ = 1. 10000 samples of length N = 100, 10000 permutations per sample.
7
Gauss
Exponential
0.8
statistics
AUC
R+ R− t Wilcox R0
0.7
0.6
1.0
0.9
0.9
0.8
statistics
R+ R− t Wilcox R0
0.7
0.5 0.0
0.1
SNR
0.2
0.3
0.8
statistics
R+ R− t Wilcox R0
0.7
0.6
0.6
0.5
Uniform
AUC
0.9
1.0
AUC
1.0
0.5 0.0
0.1
SNR
0.2
0.3
0.0
0.1
SNR
0.2
0.3
Fig. 3. Area under curve (AUC) vs the signal-to-noise ratio θ = µ/σ for various types of distributions of {xn }. N = 100, 10000 samples per point, 10000 permutations per sample. Error bars set at two standard deviations.Continuous lines are only meant for eye-guidance.
SNR=0.01
SNR=0.11
SNR=0.21
1.00
0.575
1.00
0.95 0.99
AUC
R+ R− t Wilcox R0
0.90
R+ R− t Wilcox R0
0.85
statistics
R+ R− t Wilcox R0
AUC
0.550
statistics
AUC
statistics
0.98
0.525 0.97 0.80 0.500 2.5
3
3.5 ν
4
5
2.5
3
3.5 ν
4
5
2.5
3
3.5 ν
4
5
Fig. 4. Student t-distributed {xn }. Area under curve (AUC) vs number of degrees of freedom ν for various values of θ. N = 100, 10000 samples per point, 10000 permutations per sample. Error bars set at two standard deviations. Continuous lines are only meant for eye-guidance.
8