broad class of time series data using the Latent Variable Model popular in matrix estimation problems. This broad class
Time Series Forecasting via Matrix Estimation Anish Agarwal, Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen ∗ LIDS, ORC, SDSC at Massachusetts Institute of Technology
Abstract In this paper, we address the problem of forecasting a time series. We begin by introducing a unified model for a broad class of time series data, subsuming popular models such as stationary auto-regressive, periodic, and trend, and any additive mixture of these processes. As a main contribution, we cast this family of time series as a matrix, which enables us to view both interpolation and extrapolation from a matrix estimation point of view. Surprisingly, despite the generality of our proposed model, we find that these stochastic processes admit an approximately low rank structure. This motivates us to translate the ageold problem of time series forecasting into a simple linear regression problem. Additionally, the matrix completion lens allows us to handle missing and/or noisy observations via spectral methods. As a result, we propose a principled, robust algorithm with provable guarantees that produces forecasts in the presence of missing/noisy data. Our algorithm first “de-noises” the data via singular value thresholding and then learns the a relationship between the rows of the estimated low rank matrix through linear regression. Theoretically, we show the de-noising procedure accurately imputes and filters missing and noisy observations, respectively. It also produces a consistent estimator of the time series if p = Ω(T −1+ζ ) for some ζ > 0; here p is the proportion √ of observed data from T . We also prove that our forecasting RMSE scales as O(σ/ p) plus a polynomially decaying term, where σ is standard deviation of the inherent noise. Using a pre-processing method, we show that our estimator becomes consistent. Experimentally, we showcase the efficacy of our algorithm both on real-world and synthetic datasets. In the presence of missing data and high levels of noise, we find that our algorithm vastly outperforms existing methods.
1
Introduction
Time series data is of enormous interest across all walks of life but time series analysis and forecasting are hard problems, albeit well-studied. Time-dependence is modeled in several ways: periodicity or seasonality, trends, stationary auto-regressive, and their mixtures. The standard approach to forecasting these mixture models is to iteratively “peel off” trend, periodic, and auto-regressive model structures to produce a final estimate. This is both painstaking and computationally inefficient. Unfortunately, there exists no “universal” time series model or estimation algorithm. In addition, there are no standard prescriptions to combat issues of missing data. Primary contributions. As our main contribution, we present a “universal” approach to modeling a broad class of time series data using the Latent Variable Model popular in matrix estimation problems. This broad class includes mixtures of autoregressive-moving average (ARMA), periodic, and trend processes as a special case. In addition, our “universal” approach can accommodate the case of multiple time series as well. Under the suggested model class, we transform the time series data into an appropriate matrix which can be represented as as instance of the Latent Variable Model. Somewhat surprisingly, show that the mean matrix of interest is (approximately) low rank, allowing us to view time series imputation and forecasting via the lens of matrix completion. This leads us to a two-step algorithm: firstly, using singular value thresholding, we “de-noise” the observation matrix to produce an estimate of the low-rank matrix of interest and secondly, we solve a least squares problem on the estimated low-rank matrix to uncover a linear forecasting relationship. The first step naturally imputes missing entries, leaving us with a single imputation and forecasting algorithm applicable to a rich class of time series data. As a theoretical contribution, we provide a bound on the average root-mean-square error (RMSE) of our estimates of the true means; we also *Author emails are {anish90, mamjad, devavrat, deshen}@mit.edu
establish asymptotic consistency, under certain conditions. Finally, we produce results from various experiments to establish the universality of our approach vis-a-vis mixture time series models and its robustness to missing data for both imputation and forecasting. Related works. Refer to Appendix B.
2
Time Series as a Matrix Model
Ee start by formally defining a class of time series families that we will consider, and then demonstrating how to represent all such time series in matrix form. We argue that the resulting matrix structure turns out to be (approximately) low rank (Appendix E), which motivates the algorithm presented in the following section. A unified model. We restrict our attention to discrete-time time series data: let X(t) ∈ R denote the value of the time series at time t ∈ Z. We consider the following class of time series: over an infinite interval T = {. . . , −2, −1, 0, 1, . . . , } (with notation X[s: t] = (X(s), . . . , X(t))) ∞ n o X d F T = X[1: T ] : X(t) = η(t; γ) + Ψi ε(t − i), ∀t ∈ [T ] ,
(1)
i=0
where η(t; α) is any deterministic Lipschitz function of time, t, and γ is a finite-dimensional latent parameter vector; ε(t) are i.i.d random P∞ variables representing noise terms with E[ε(t)] = 0 and finite variance Var(ε(t)) ≤ σ 2 ; and i=1 |ψi |≤ C for some positive constant C, i.e. the sequence ψi is absolutely summable with ψ0 = 1. Although the time series is generated over the infinite interval T , we only have access to a finite amount of data from X[1: T ]. The expectation of X(t) per F T is, E[X(t)] = η(t; γ). We claim that popular time-series models, e.g. ARMA, periodic, trend, and their additive mixtures belong to the model class F T . Refer to Appendix D for the derivations. Time series as a matrix. Given time series observations over T time steps, X[1 : T ], we construct an L × N matrix X = [Xij ] with N = bT /Lc, where column i ∈ [N ] of X is X[(i − 1)L + 1: iL]. In words, we divide the time series into N non-overlapping, contiguous intervals each of length L, and employ each interval as a distinct column within the matrix to construct the L × N dimension matrix X. By definition, each element of the matrix corresponds to a distinct element within the time series, i.e. Xij = X(i + (j − 1)L). A latent variable model. The following set of characterizations apply to a matrix X that is constructed from any time series in the model class F T and via the process described above. From the definition of F T , such an X can be decomposed as X=M+
∞ X
ψh E(h) ,
(2)
h=0
where the L × N matrices M
= [Mij ] and Eh
Mij = E[X(t)] = η(t; γ) where t = i + (j − 1)L; (b) ∀h, (h) E[ij ]
(h) Var(ij )
(h)
= [ij ] obey the following: (h) ij
(a)
are i.i.d random variables
(h) ij ,
2
with = 0 and ≤ σ . Specifically, is the noise at time t = i + (j − 1)L − h. P∞ Similarly, we can express the conditional expectation of X as Mcon = M + h=1 ψh E(h) , where P∞ 0 0 Mcon = [Mij ] obeys: Mij = E[X(t)|It−1 ] = η(t; γ) + i=1 ψi ε(t − i) with t = i + (j − 1)L. Under this setting, we show that each entry in M can be expressed through a latent variable model representation, i.e. each Mij is a function of its latent row and feature vectors. This key modeling result is summarized as Proposition E.1 in Appendix E. Low-Rank Approximation. Given the above it can be shown that the mean matrix, M , can be well-approximated by a low-rank matrix M(r) . This allows us to motivate a forecasting algorithm which relies on learning a linear relationship between the rows of M(r) . For details, refer to the comprehensive development in Appendix E.
3
The Algorithm
We describe an algorithm that takes as input the L × N matrix of observations, X(k) , obtained by transforming the time series X[1: T ] as described in Section 2, and a parameter µ whose choice 2
is discussed in Appendix C. For simplicity, we assume the entries of X(k) and M(k) for all k are (k) (k) bounded by one in absolute value, i.e. Xij ≤ 1 and Mij ≤ 1. Refer to Appendix C for intuition and a detailed discussion on the finer print on the algorithm details, e.g. how to choose L, how to pick the threshold µ and more. Algorithm 1 Time Series Forecasting via Matrix Completion Step 1. De-noising the data: singular value thresholding (inspired by [11]). (k)
(k)
− 1. For k ∈ [L], define the (L − 1) × N matrix Y(k) = [Yij ]i 0. Forecasting analysis. In order to achieve a desirable forecasting performance, we aim to learn the optimal “model” that can be used to predict future values of the time series. In our setting, this ˆ − )T βˆ(k) is “close” to M (k) , the last row of the ˆ (k) = (M translates to learning βˆ(k) so that M L
(k)
L
unknown mean matrix M(k) for all k. That is, our performance metric of interest is h 1/2 i ˆ (k) ) := √1 E PN (M ˆ (k) − M (k) )2 RMSE(M . L Li Li i=1 N
3
Theorem 4.2. Let M be defined as above with δ = L− for any > 0. Suppose p = Ω(N −1+ζ ) for some ζ > 0 and C is known. Then for any k ∈ [L] and using µ as defined in (5) √ C1 2 2 ˆ (k) ) ≤ √ RMSE(M (C σ + (1 − p))1/2 + O(1/L/2 ) + O(1/ N ), L p where C1 is a universal positive constant.
(4)
Using our prescribed universal threshold, which adroitly captures the signal and discards misleading randomness, the resulting inequality is a function solely of the inherent noise, σ 2 , and the error derived from missing data. In the event of plentiful data, one would hope to overcome this error term. This motivates a useful preprocessing step which can help achieve asymptotic consistency, as described in Appendix G.
5
Experiments (Real Datasets)
We conduct several experiments, both on real world (this section) and synthetic datasets (Appendix A), to highlight our algorithm’s ability to accurately forecast a time series, with and without missing data, and impute missing values in comparison to the state-of-the-art. For all experiments, we compare our algorithm to the state-of-the-art time series forecasting library for R which decomposes stationary auto-regressive, seasonal and trend components. Bitcoin. Figures 1a and 1b show the forecasts for Bitcoin prices (in Yuans) in March which demonstrates classical auto-regressive properties. We provide a week’s data to learn and forecast over two days, with a regular frequency of 30s. Figure 1a shows that our algorithm and the R library appear to do an excellent job of predicting the future even with 50% data missing. Figure 1b shows the RMSE of the predictions as a function of p which clearly underscores our algorithm’s strength, especially with missing data. Google Flue Trends (Peru). Figures 2a and 2b show the forecasts for Google flu search-trends in Peru which shows significant seasonality. We provide weekly data from 2003-2012 to learn and then forecast for each week in the next three years. Figure 2a shows that our algorithm clearly outperforms R when predicting the future with 30% data missing. Figure 2b shows the RMSE of the predictions as a function of p which once again confirm that our algorithm outperforms the R library. Our experiments done with synthetically generated datasets are summarized in Appendix A.
(a) Price predictions for Bitcoin.
(b) RMSE for Bitcoin predictions.
Figure 1: Bitcoin price forecasts and RMSE as a function of p.
(a) Flu trends predictions (Peru).
(b) RMSE flu trend predictions.
Figure 2: Peru’s Google flu trends forecasts and RMSE as a function of p.
4
References [1] E. Abbe and C. Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 670–688. IEEE, 2015. [2] E. Abbe and C. Sandon. Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic bp, and the information-computation gap. Advances in neural information processing systems, 2016. [3] E. M. Airoldi, T. B. Costa, and S. H. Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems, pages 692–700, 2013. [4] A. Anandkumar, R. Ge, D. Hsu, and S. Kakade. A tensor spectral approach to learning mixed membership community models. In Conference on Learning Theory, pages 867–881, 2013. [5] O. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. In D. Blei and F. Bach, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2191–2199. JMLR Workshop and Conference Proceedings, 2015. [6] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563, 1966. [7] C. Borgs, J. T. Chayes, H. Cohn, and S. Ganguly. Consistent nonparametric estimation for heavy-tailed sparse graphs. arXiv preprint arXiv:1508.06675, 2015. [8] J. Box and Reinsel. Time Series Analysis, Forecasting and Control. Prentice Hall, Englewood Clifs, NJ, 3rd edition, 1994. [9] P. J. Brockwell and R. A. Davis. Time series: theory and methods. Springer Science & Business Media, 2013. [10] E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010. [11] S. Chatterjee. Matrix estimation by universal singular value thresholding. Ann. Statist., 43:177– 214, 2015. [12] S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214, 2015. [13] Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015. [14] T. M. Cover. Behavior of sequential predictors of binary sequences. Technical report, DTIC Document, 1966. [15] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion. Information and Inference, 3(3):189–223, 2014. [16] W. Dunsmuir and P. Robinson. Estimation of time series models in the presence of missing data. Journal of the American Statistical Association, 76(375):560–568, 1981. [17] J. Durbin and S. J. Koopman. Time series analysis by state space methods, volume 38. OUP Oxford, 2012. [18] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. Information Theory, IEEE Transactions on, 38(4):1258–1270, 1992. [19] J. D. Hamilton. Time series analysis, volume 2. Princeton university press Princeton, 1994. [20] R. E. Kalman et al. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. 5
[21] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998, 2010. [22] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(Jul):2057–2078, 2010. [23] C. E. Lee, Y. Li, D. Shah, and D. Song. Blind regression: Nonparametric regression for latent variable models via collaborative filtering. In Advances in Neural Information Processing Systems 29, pages 2155–2163, 2016. [24] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069–1097, 2011. [25] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(Dec):3413–3430, 2011. [26] J. Rissanen. Universal coding, information, prediction, and estimation. Information Theory, IEEE Transactions on, 30(4):629–636, 1984. [27] D. S. S. Robert H. Shumway. Time Series Analysis and It’s Applications. Blue Printing, 3rd edition, 2015. [28] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992. [29] P. C. Shields. The interactions between ergodic theory and information theory. In IEEE Transactions on Information Theory. Citeseer, 1998. [30] R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting using the em algorithm. Journal of time series analysis, 3(4):253–264, 1982. [31] D. Steurer and S. Hopkins. Efficient estimation from few samples: Stochastic block models and related problems. personal communications, 2017. [32] H.-F. Yu, N. Rao, and I. S. Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. pages 847–855, 2016. [33] Y. Zhang, E. Levina, and J. Zhu. Estimating network edge probabilities by neighborhood smoothing. arXiv preprint arXiv:1509.08588, 2015.
6
A
Synthetic Experiments.
In this section we discuss experiments performed with synthetic data to illustrate the imputation and forecasting prowess of our algorithm in comparison with the state-of-the-art. All experiments discussed next are for various mixture time series models containing auto-regressive, periodic and trend components with additive Gaussian noise. All observations are normalized to the range [−1, 1]. Forecasts. Figures 3a-3c visually depict the predictions from our algorithm when compared to the state-of-the-art time series forecasting library in R which decomposes stationary auto-regressive, seasonal and trend components. We provide the R library the number of lags to search over, in effect making its job easier. It is noticeable that the forecasts from the R library always experience higher variance. As p becomes smaller, the R library’s forecasts also contain an apparent bias. Under all levels of missing data, our algorithm appears to produce smooth estimates which appear preferable to the predictions produced by the R library. These visual findings are confirmed in Figure 4b which shows that our algorithm’s prediction RMSE is always better than that of the R library’s (for the mixture of periodic and trend components only).
(a) 70% data missing.
(b) 50% data missing.
(c) No data missing. Figure 3: Plots for three levels of missing data (p ∈ {0.3, 0.5, 1})showing the original time series (means) and forecasts produced by the R-library (Baseline) and our algorithm.
Imputation. Figure 4a shows that our algorithm outperforms the state-of-the-art AMELIA library for multiple time series imputation at all levels of missing data. AMELIA is better than the baseline which imputes all missing values with the mean of the observations, but performs much worse than our algorithm. Note that this experiment involved multiple time series where the outcome variable of interest and the log of its squared power were also included. The additional time series components were included to help AMELIA impute missing values.
(a) Imputation RMSE (mixture AR, periodic, trend).
(b) Prediction RMSE (mixture periodic, trend).
Figure 4: Plots showing the Imputation and Prediction RMSE as a function of p.
7
B
Related Literature.
The problem of matrix estimation / completion deals with trying to estimate the entries of the matrix based on noisy, partial observations. This has become of great interest due to its connection to recommendation systems (cf. [21, 22, 24, 13, 12, 23, 10, 21, 25, 13, 15]), social network analysis (cf. [1, 2, 1, 4, 31]) and graph learning (graphon estimation) (cf. [3, 33, 7]). The key realization of this rich literature is that one can do estimation of the true underlying matrix from noisy, partial observations by simply taking a low-rank approximation of the observed data. We refer an interested reader to recent work by Chatterjee [12] and references there in. The question of time series forecasting is, potentially as old as civilization in some form. Few textbook style references include [9, 8, 19, 27]. At the highest level, the time series modeling primarily involves viewing a given time series as a function indexed by time (integer or real values) and the goal of model learning is simply to identify this function from observations (over finite intervals). Given that the space of such functions is complex, the task is to utilize function form (i.e. “basis functions”) so that for the given setting, the time series observation can fit a sparse representation. For example, in communication and signal processing, the harmonic or Fourier representation of time series has been popularly utilized due to the fact that signals communicated are periodic in nature. Other popular forms are “trend” modeled through polynomial basis and “stationary” or “filters” modeled through ARMA and its variants e.g. ’Autoregressive Conditional Heteroskedasticity (ARCH). There are rich connections to theory of stochastic processes and information theory cf. [14, 29, 26, 18]. A very popular form of time series model with latent structure is the Hidden Markov Model (HMM) in probabilistic form, cf. [20, 6] and Recurrent Neural Network (RNN) its deterministic form, cf. [28]. The question of learning time series models with missing data has received comparatively less attention. A common approach to dealing with such a setting is to utilize HMMs or general StateSpace-Models (including AR) and using them to learn the with missing data, cf. [16, 30]. To the best of the authors’ knowledge, most work in this literature is restricted to such class of models, cf. [17]. Recently, building on the literature in online learning, sequential approaches have been proposed to address prediction with missing data, cf. [5]. A related work, cf. [32], also applies matrix methods to time series forecasting. First a matrix is created using multiple time series, and then a low-rank approximation of it is computed. As the second step, it learns the model relationship between singular vectors associated with columns. A key difference is that our method is agnostic to the underlying data generating process, for a broad class of time series dynamics. In comparison, the regression step in cf. [32] differs greatly based on the belief of what the underlying generating process is. In addition, our method can easily accommodate multiple time series by by simply appending other time series (and their lagged versions) as additional rows. We cannot see a way the algorithm in [32] can deal with missing entries with a single time series unless our matrix transformation is utilized.
8
C C.1
The Algorithm Intuition and Fine Print. Intuition.
Let the matrix X be obtained through the transformation described in Appendix E. The definition of F T suggests the model learning process into two segments: (a) estimating η(t; γ) and (b) Pdividing ∞ estimating i=0 Ψi ε(t − i). Amalgamating the estimates from (a) and (b) provides an estimate for Xt |It−1 . The propositions in Appendix E suggest a simple two-step algorithm for estimating η(t; γ): (i) ˆ ; (ii) find a set “de-noise" the matrix of observations, X, to obtain an estimate of the mean matrix, M ˆ such that the bottom-most row of the matrix M ˆ can be written as a linear combination of weights, β, of the other L − 1 rows. βˆ provides an approximate linear relationship between η(t; γ) and its L − 1 ˆ t − 1, . . . , t − (L − 1)). For notational brevity, we will past values. We call this linear estimator, η(t; ˆ := η(t; ˆ t − 1, . . . , t − (L − 1)). denote η(t) ˆ Using the estimator η(t), we propose a standard two-step process to estimate the moving average component of the time series: (iii) estimate the moving average residuals using the following ˆ such that ˆ = [Wij ], where Wij = Xij − η(i ˆ + (j − 1)L); (iv) find a set of weights, θ, estimator, W ˆ the bottom-most row of the matrix W can be written as a linear combination of the other L − 1 rows. Note that if the focus is to produce an estimate for the underlying deterministic process, η(t), ˆ and the past L − 1 observed values to then we can stop after step (ii) and use our linear estimator η(t) predict future values of η(t). Algorithms for step (iv) have a rich history in the time series community, and can be estimated through a variety of standard algorithms. Indeed, the main challenge is normally ˆ. in producing reliable estimates for the residuals, which step (iii) above gives access to through W Hence we focus the remainder of this paper on the algorithm and performance guarantees for our ˆ linear estimator, η(t). An additional subtlety that is important to note is that if the aim is forecasting, and not simply imputation, then a single matrix X will not suffice; any forecasting model we produce will only be able to produce estimates for Xt |It−1 , where t is an integer multiple of L. This can be seen ˆ by noting that the linear estimator, η(t), is fit by regressing the bottom-most row of the matrix ˆ ˆ only contains entries M against the other L − 1 rows. However the bottom most row of M that are integer multiples of L, and so indeed our estimator is only valid for such multiples. To workaround this we simply create L different models: for each k ∈ [L], define the matrix, X(k) , (k) where X = X(i + (j − 1)L + (k − 1)), and produce βˆ(k) as described above. We then have an ij
ˆ estimator, η(t), for all t ∈ {L, . . . , T }. C.2
Choosing the number of rows, L.
Proposition E.4 suggests that we should pick L to be as large as possible, to ensure that the linear relationship in the matrix M(k) holds. At the same time, Theorem G.1 requires that L is in o(N ). Thus it suffices to let N = L1+ for any > 0: N = L2 being a choice. C.3
Bounded entries transformation.
Several of our results, as well as the algorithm we propose, assume that the observation matrix is (k) bounded such that Xij ≤ 1. For any data matrix, we can achieve this by using the following pre-processing transformation: suppose the entries of X(k) belong to an interval [a, b]. Then, one can first pre-process the matrix X(k) by subtracting (a + b)/2 from each entry, and dividing by (b − a)/2 to enforce that the entries lie in the range [−1, 1]. The reverse transformation, which can be applied at the end of the algorithm description above, returns a matrix with values contained in the original range. Specifically, the reverse transformation equates to multiplying the end result by (b − a)/2 and adding by (a + b)/2. 9
C.4
Choosing the threshold, µ.
Here, we discuss several approaches to choosing the parameter µ for the singular value thresholding. We propose a data driven train/validate approach. Precisely, we reserve a portion of the learning period matrix for validation, and use the rest of the pre-intervention data to produce an estimate βˆ(k) for each of the finitely many choices of µ (s1 , . . . , sL−1 ). Using each estimate βˆ(k) , produce the corresponding validation set predictions. Then, select the µ that achieves the minimum MSE with respect to the observed data in the validation set. On the other hand, if the practitioner would prefer smoother estimates which remain robust to missing data, then one recommended approach is to estimate the mean matrix using all rows of the matrix Y(k) and then using the last row of the estimated means for validation. In our experiments, we use this latter approach to produce robust estimates. Moreover, one can also employ a universal threshold that adaptively adjusts itself depending on the structure of M(k) . [11] provides one universal threshold. We use that to motivate another universal threshold which works well in our setting: A universal threshold. Since the singular values associated with the data matrix Y(k) contain both signal and noise, we aim to distinguish the useful information from the unwanted randomness so as to avoid overfitting our model to the idiosyncrasies of the dataset, preferably through a data-driven process. In that spirit, we prescribe a universal threshold that adaptively discovers the “correct” level of information within the singular values, depending on the particular structure of the unknown M(k) . For any choice of η ∈ (0.1, 1), we claim that setting q µ = (2 + η) N (Cˆ 2 σ ˆ 2 pˆ + pˆ(1 − pˆ)) (5) produces an estimator with strong theoretical properties for both imputation and forecasting. Here, ˆ pˆ, and σ C, ˆ 2 are estimates of the underlying values C, p, and σ 2 , respectively. For our purposes, pˆ and σ ˆ 2 denote the (unbiased) maximum likelihood estimates of their respective latent parameters. Furthermore, if the underlying stochastic process is a mixture of harmonic and trend time series, then one should set Cˆ = 1 (in which case C = ψ0 = 1).
10
D D.1
Popular Time Series as Instances of our Model. Stationary Autoregressive Moving Average process (ARMA).
The ARMA(p, q) model with coefficients α = (α1 , . . . , αp ), β = (β1 , . . . , βq ) obeys p q X X X(t) = µ + Xt−i αi + ε(t − j)βj + ε(t). i=1
(6)
j=1
Re-stating the ARMA(p, q) model in lag operator form, we have (1 − α1 L − α2 L2 − · · · − αp Lp )X(t) = µ + (1 + β1 L + β2 L2 + · · · + βq Lq )ε(t),
(7)
k
where L is the lag operator, i.e. L X(t) = X(t − k). By restricting our attention to stationary ARMA(p, q) processes, we know that all the roots of the polynomial (1 − α1 L − α2 L2 − · · · − αp Lp ) = 0 lie outside the unit circle, and hence by [? ] there exists a polynomial, φ(L) such that φ(L) = (1 − α1 L − α2 L2 − · · · − αp Lp )−1 (1 + β1 L + β2 L2 + · · · + βq Lq ) 2
= φ0 + φ1 L + φ2 L + . . .
(8) (9)
P∞
where the coefficients of φ(L) are absolutely summable, j=0 |φj |< ∞. We thus have µ + φ(L)ε(t). (10) X(t) = (1 − α1 − α2 − · · · − αp ) By setting γ = (1−α1 −αµ2 −···−αp ) , µ η(t; γ) = , ∀t. (11) (1 − ψ1 − ψ2 − · · · − ψp ) In addition, by setting ψi = φi , and observing that η is trivially a Lipschitz function, we have that a stationary ARMA(p, q) belongs to F T . Note that it is a standard result in the literature that a stationary ARMA process has an absolutely summable M A(∞) representation [19], the exposition above is provided for completeness. D.2
Periodic.
A periodic or mixture of finite frequency model obeys K X X(t) = θk sin(2πωk t) + ρk cos(2πωk t) + ε(t).
(12)
k=1
By setting γ = ((θk , ρk , ωk ))1≤k≤K η(t; γ) =
K X
θk sin(2πωk t) + ρk cos(2πωk t).
(13)
k=1
In addition, by setting ψ0 = 1, ψi = 0 for i > 0, and verifying that η is a Lipschitz function, we have that this model class belongs to F T . D.3
Trend.
This class of time series obeys X(t) =
L X
γl t−` +
Q X
log(γq t) + ε(t),
(14)
q=1
l=0
where ` > 0. By setting γ = (γl )1≤l≤L+Q η(t; γ) =
L X
γl t
−`
+
Q X
log(γq t).
(15)
q=1
l=0
Just as in the periodic model case, we set ψ0 = 1, ψi = 0 for i > 0. By imposing that η(t; γ) is an additive mixture of sub-linear functions, it can be verified to be a Lipschitz function, implying that the model class belongs to F T . 11
D.4
Mixture models.
An additive mixture of a stationary ARMA, periodic and trend model obeys X(t) = µ +
p X
Xt−i αi +
i=1
q X
ε(t − j)βj +
j=1
K X
θk sin(2πωk t) + ρk cos(2πωk t) +
k=1
L X
γl t−` +
l=0
Q X
log(γq t) + ε(t).
q=1
(16) By the exposition above, this model can be re-written as X(t) = c +
K X
θk sin(2πωk t) + ρk cos(2πωk t) +
k=1
where c = η(t; γ) =
µ (1−ψ1 −ψ2 −···−ψp ) .
L X l=0
−`
γl t
+
Q X
log(γq t) + φ(L)ε(t),
(17)
q=1
Setting γ = [c, (θk , ρk , ωk )1≤k≤K , (γk )1≤k≤L+Q ]
Q K L X X X µ + θk sin(2πωk t) + ρk cos(2πωk t) + γl t−` + log(γq t) (1 − ψ1 − ψ2 − · · · − ψp ) q=1 k=1
l=0
(18) η(t; γ) remains Lipschitz, as it can be verified that Lipschitz functions form a vector space. Note that the latent parameter vector, γ, of the additive mixture is formed by simply appending the latent parameters for each component of the mixture. Since φ(L) is absolutely summable and γ remains finite, the mixture model is verified to belong to F T .
12
E
Modeling: Low-Rank Approximation (Continued from Section 2)
E.1
A latent variable model.
Proposition E.1. Given a time series model, X[1: T ], per F T , let M be defined as described above. Then there exists: (a) latent parameters θi ∈ Rd1 , ρj ∈ Rd2 such that d1 and d2 are both finite; (b) a function f : Rd1 × Rd2 → R such that Mij = f (θi , ρj ) for all i ∈ [L], j ∈ [N ]; and (c) f is Lipschitz in θ. E.1.1
Time series matrix is approximately low rank.
The structure of M is determined by the Lipschitz function f . By studying particular smoothness properties of f , we argue that M is a low rank or an approximately low rank matrix in those cases. We begin by showing if η(t; γ) is an additive mixture of sublinear trends, we arrive at the following Proposition. Proposition E.2. Given a time series model, X[1: T ], per F T , let M be defined as in Proposition E.1 with L < N . If the following condition holds: (a) η(t; γ) as defined in F T has the following −α property dη(t) for some α > 0 and C1 > 0, then there exists a L × N matrix M(r) = dt = C1 t (r)
[Mij ] = [f (p(θi ), ρj )] where the following two conditions hold: (b) P(θ) := {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL } where |P(θ)| = r = o(L); (c) M , M(r) have the following relationship,
M − M(r) ≤ C2 L−ε/2 , (19) max where C1 , C2 are positive constants and for some > 0. If η(t; γ) is an additive mixture of harmonics, we have, Proposition E.3. Given a time series model, X[1: T ], per F T , let M be defined as in Proposition PK E.1 with L < N . Let the following condition on η(t; γ) holds: (a) η(t; γ) = k=1 θk sin(2πωk t) + (r) ρk cos(2πωk t). Then, there exists a L × N matrix M(r) = [Mij ] = [f (p(θi ), ρj )] where the following two conditions hold: (b) P(θ) := {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL } where |P(θ)| = r = Lε for some > 0; (c) M , M(r) have the following relationship,
M − M(r) ≤ C1 L−ε , (20) max where C1 is a positive constant. It is not hard to see that Propositions E.2 and E.3 suggests that as we collect more data, M(r) becomes an increasingly good entry-wise approximation of M . Corollary E.1. Given a time series model, X[1: T ], per F T , let M and M(r) be defined as above. Then for some > 0
1
M − M(r) 2 = O 1 . (21) F LN L We can now motivate the development of a unified forecasting algorithm through the following key proposition. Proposition E.4. Given a time series model, X[1: T ], per F T , let M be defined as in Proposition E.1 with L < N , i.e. there exists latent parameters θi = i and ρj = [(j − 1)L, γ] such that (r) Mij = f (θi , ρj ) for all i ∈ [L], j ∈ [N ]. Suppose there exists a M(r) , where [Mij ] = [f (p(θi ), ρj )] such that the following conditions hold: (a) |P(θ)| = r = o(L)
where P(θ)
:= {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL }. (b) M , M(r) have the following relationship, M − M(r) max ≤ δ for some δ > 0. Then, there exists a row ` where 1 ≤ ` < L, such that √ kML − M` k ≤ 2N δ. 13
Observe that the M(r) induced by M in Propositions E.2 and E.3, satisfy the conditions on M(r) in Proposition E.4 (the validity of these Propositions remains intact even when ARMA is added to the processes under consideration). Therefore, given a sufficient amount of data, the preceding discussion motivates the development of a robust, linear forecasting algorithm for a large sub-class of time series models within F T , once we represent them in the prescribed matrix form X. Specifically, since M is shown to be well approximated by a low rank matrix, M(r) , this prompts us to develop an algorithm centered around singular value thresholding–which finds a good low rank approximation of our data matrix while simultaneously de-noising corrupted observations and imputing missing entries–followed by linear regression. This leads us to the following algorithmic description.
F F.1
Modeling Analysis (Proofs) Proof of Proposition E.1
Proposition (E.1). Given a time series model, X[1: T ], per F T , let M be defined as described above. Then there exists: (a) latent parameters θi ∈ Rd1 , ρj ∈ Rd2 such that d1 and d2 are both finite; (b) a function f : Rd1 × Rd2 → R such that Mij = f (θi , ρj ) for all i ∈ [L], j ∈ [N ]; and (c) f is Lipschitz in θ. Proof. Recall that η(t; γ) is a deterministic function, which depends only on the time argument t and the fixed parameters γ. From Equation (2), we have Mij = η(t; γ), where t = i + (j − 1)L. As a result Mij = η(i + (j − 1)L; γ) = f (θi , ρj ) where the latent row parameter, θi = (i), and the latent column parameter, ρj = [(j − 1)L, γ]. Since η(t) is Lipschitz in t, and θi simply encodes an index in time, we have that f is Lipschitz in θ. Let |γ| be the dimension of the vector γ. d1 and d2 are both finite since d1 = 1 and d2 = |γ|+1. Effectively, the latent parameters encode the amount of shift in the time argument to η(·; γ), so as to obtain the appropriate entry in the matrix. F.2
Proof of Proposition E.2
Proposition (E.2). Given a time series model, X[1: T ], per F T , let M be defined as in Proposition E.1 with L < N . If the following condition holds: (a) η(t; γ) as defined in F T has the following −α property dη(t) for some α > 0 and C1 > 0, then there exists a L × N matrix M(r) = dt = C1 t (r)
[Mij ] = [f (p(θi ), ρj )] where the following two conditions hold: (b) P(θ) := {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL } where |P(θ)| = r = o(L); (c) M , M(r) have the following relationship,
M − M(r) ≤ C2 L−ε/2 , (22) max where C1 , C2 are positive constants and for some > 0. Proof. We construct M(r) in two steps: Step 1: For j < Lε/α , where ε ∈ (0, α), let the j-th row of M(r) be equal to the j-th row of M . Step 2: For rows j ≥ Lε/α , the sub-matrix we construct is an adaptation of the argument in [12]. Let f refer to the latent variable model of M , and let R and D refer to the set of row and column parameters of the sub-matrix of M corresponding to its last L − j + 1 rows, {θLε/α · · · θL } and {ρ1 · · · ρN } respectively. Through a straightforward application of the Mean Value Theorem, observing that the derivative is decreasing in t, and by condition (a), we have that for all θ1 , θ2 ∈ R (with appropriately defined 14
constants) |f (θ1 , ρj ) − f (θ2 , ρj )| ≤ f 0 (θ3 ) · |θ1 − θ2 | 0
ε/α
≤ η (L
≤ (C1 L
−ε
≤ C2 L
) · |θ1 − θ2 |
ε/α −α
)
· |θ1 − θ2 |
· |θ1 − θ2 |,
(23) (24) (25) (26)
where f 0 and η 0 are the derivatives with respect to θ and t, respectively, and θ3 ∈ (θ1 , θ2 ). Define a partition, P (ε), of R into continuous intervals of length Lε/2 . Observe that since 0 0 θi = i, for any A ∈ P (ε), whenever θ, θ ∈ A , we have |θ − θ |≤ Lε/2 . It follows that |P (ε)|= 1 1 (L − Lε/α )/Lε/2 = L1−ε/2 − Lε( α − 2 ) . Let T be a subset of R that is constructed by selecting exactly one element from each partition in P (ε). That is, |T |= |P (ε)|. Therefore, it follows that for each θ ∈ R, we can find p(θ) ∈ T so that θ and p(θ) belong to the same partition of P (ε). Hence, we can define the (i, j)-th element of M(r) in the following way: (1) for all j < Lε/α , let (r)
(r)
p(θi ) = θi such that Mij = f (θi , ρj ); (2) for j ≥ Lε/α , let Mij = f (p(θi ), ρj ). Consequently,
M − M(r) ≤ max |f (θi , ρj ) − f (p(θi ), ρj )| max i∈[L]
≤
max i∈[j≥Lε/α ]
≤
max i∈[j≥Lε/α ]
|f (θi , ρj ) − f (p(θi ), ρj )| |θi − p(θi )|L−ε C2
≤ C2 L−ε/2 . Now, if θi and θj belong to the same element of P (ε), then p(θi ) and p(θj ) are identical. Therefore, there are at most |P (ε)| distinct rows in the last L − Lε/α rows of M(r) where |P (ε)|= L1−ε/2 − 1 1 Lε( α − 2 ) . Let P(θ) := {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL }. By construction, since ε ∈ (0, α), we have that |P(θ)| = Lε/α + |P (ε)| = o(L). This completes the proof for when K = 1.
In general, F.3
Proof of Proposition E.3
Proposition (E.3). Given a time series model, X[1: T ], per F T , let M be defined as in Proposition PK E.1 with L < N . Let the following condition on η(t; γ) holds: (a) η(t; γ) = k=1 θk sin(2πωk t) + (r) ρk cos(2πωk t). Then, there exists a L × N matrix M(r) = [Mij ] = [f (p(θi ), ρj )] where the following two conditions hold: (b) P(θ) := {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL } where |P(θ)| = r = Lε for some > 0; (c) M , M(r) have the following relationship,
M − M(r) ≤ C1 L−ε , (27) max where C1 is a positive constant. Proof. Let x := {x1 , . . . , xK } denote the set of periods associated with each harmonic in η(t; γ). Let xlcm denote the least common multiple for the set x. Then, we have that η(t; γ) = η(g(t); γ) where g(t) = t mod xlcm . Let R and D refer to the set of row and column parameters, {θ1 · · · θL } and {ρ1 · · · ρN }, respectively. Since η(t; γ) = η(g(t); γ), rather than considering the set of time steps {1, . . . , T }, it suffices to consider instead the set {g(1), . . . , g(T )} ∈ [0, xlcm ]. Because θi is simply an index to a time step, ¯ where θi ∈ R ¯ ⊂ [0, xlcm ]; it is sufficient to consider an alternate, compact set of row parameters, R, ¯ crucially, we highlight that R is independent of L. From here onwards, the arguments we make follow directly from arguments in [12]. We provide it ¯ where the following holds: here for completeness. For any δ > 0, we first define a partition P (δ) of R 15
0
for any A ∈ P (δ), whenever θ, θ are two points such that θ, θ0 ∈ A, we have |f (θ, ρj ) − f (θ0 , ρj )|≤ δ for all j ∈ [N ]. Due to the Lipschitzness property of the function f (with Lipschitz constant L) and the compactness of [0, xlcm ], it can be shown that |P (δ)|≤ C([0, xlcm ], L)δ −1 , where C([0, xlcm ], L) is a constant that depends only on [0, xlcm ] and L. ¯ that is constructed by selecting exactly one element from each partition Let T be a subset of R ¯ we can find p(θ) ∈ T so that in P (δ), i.e. |T |= |P (δ)|. Therefore, it follows that for each θ ∈ R, θ and p(θ) belong to the same partition of P (δ). Let M(r) be the matrix whose (i, j)th element is f (p(θi ), ρj ). Then by construction,
M − M(r) ≤ δ. max
Now, if θi and θj belong to the same element of P (δ), then p(θi ) and p(θj ) are identical. Therefore, there are at most |P (δ)| distinct rows in M(r) . Let P(θ) := {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL }. By construction, we have that |P(θ)| = |P (δ)|. Choosing δ = C([0, xlcm ], L)L−ε , then |P(θ)| = Lε . This completes the proof. F.4
Proof of Proposition E.4
Proposition (E.4). Given a time series model, X[1: T ], per F T , let M be defined as in Proposition E.1 with L < N , i.e. there exists latent parameters θi = i and ρj = [(j − 1)L, γ] such that (r) Mij = f (θi , ρj ) for all i ∈ [L], j ∈ [N ]. Suppose there exists a M(r) , where [Mij ] = [f (p(θi ), ρj )] such that the following conditions hold: (a) |P(θ)| = r = o(L)
where P(θ)
:= {p(θi ) : i ∈ [L]} ⊂ {θ1 , . . . , θL }. (b) M , M(r) have the following relationship, M − M(r) max ≤ δ for some δ > 0. Then, there exists a row ` where 1 ≤ ` < L, such that √ kML − M` k ≤ 2N δ. Proof. Assume we have access to data from X[1: T + r − 1]. Let us first construct a matrix with ˜ = [M ˜ ij ] = [η(i+j −1; γ)], of dimension L×(T +r−1). Applying identical overlapping entries, M ˜ ij = f (θ˜i , ρ˜j ) with θ˜i = i and ρ˜j = [j − 1, γ]. By arguments made in Proposition E.1, we have M ˜ are constant, i.e. construction, the skew-diagonal entries from left to right of M ˜ k−j,i+j : 1 ≤ k − j ≤ L, 1 ≤ i + j ≤ T + r − 1}. ˜ ki := {M M (28) ˜ . Specifically, for Under this setting, we note that the columns of M are subsets of the columns of M all 0 ≤ j < N and k ≤ L, ˜ k,jL+1 = Mk,j+1 . M (29) ˜ exists within M . Hence, M ˜ ij = Mi0 ,j 0 for some By construction, observe that every entry within M 0 0 i , j , and ˜ f (θi , ρ˜j ) − f (p(θ˜i ), ρ˜j ) = |f (θi0 , ρj 0 ) − f (p(θi0 ), ρj 0 )| ≤ δ, where the inequality follows from condition (b). In light of this, just as we defined M(r) with respect ˜ (r) from M ˜ . Specifically, the (i, j)th element of M ˜ (r) is to M , we define M ˜ (r) = f (p(θ˜i ), ρ˜j ). M ij
(30)
By condition (a) and applying the Pigeonhole Principle, we observe that within the last r + 1 rows of M(r) , at least two rows are identical. Without loss of generality, let these two rows be denoted (r)
(r)
(r)
(r)
as ML−r1 = [ML−r1 ,i ]i≤N and ML−r2 = [ML−r2 ,i ]i≤N , respectively, where r1 ∈ {1, . . . , r − 1}, r2 ∈ {2, . . . , r}, and r1 < r2 . Since θ˜i = i = θi , we trivially have that p(θ˜i ) = p(θi ). Consequently, ˜ (r) are also identical, i.e. for all i ≤ T + r − 1, it must be the case that the same two rows in M ˜ (r) ˜ (r) M L−r1 ,i = ML−r2 ,i . 16
(31)
Using this fact, we have that for all i ≤ T + r − 1, ˜ ˜ (r) ˜ (r) ˜ L−r ,i − M ˜ (r) + M ˜ L−r ,i − M ˜ (r) + M ˜ L−r ,i ≤ M ML−r1 ,i − M 2 1 2 L−r1 ,i − ML−r1 ,i L−r2 ,i L−r1 ,i ≤ 2δ,
(32)
˜ (r) . Additionally, by the where the last inequality follows from (31) and the construction of M ˜ as described above by (28), we necessarily have the following two skew-diagonal property of M equalities: ˜ Li = M ˜ L−r ,r +i M 1 1 ˜ L−∆ ,i = M ˜ L−r ,r +i , M r
2
1
(33) (34)
where ∆r = r2 − r1 . Thus, by (32), (33), and (34), we obtain for all i ≤ T , ˜ ˜ L−∆ ,i = M ˜ L−r ,r +i − M ˜ L−r ,r +i MLi − M r 1 1 2 1 ≤ 2δ.
(35)
Thus, applying (29) and (35), we reach our desired result, i.e. for all i ≤ N , |MLi − ML−∆r ,i | ≤ 2δ.
(36)
This completes the proof.
17
G
Main Results (Addendum to Section 4)
ˆ − and M ˆ (k) . In particular, In this section, we present the theoretical properties of our estimators, M L (k) we demonstrate that Step 1 of our algorithm (detailed in Section 3) accurately imputes missing entries while Step 2 produces a reliable forecasting estimator. We begin, however, with a brief reminder of our key operating assumptions and frequently used notations. To begin, we assume that the noise is drawn from a zero-mean distribution with bounded second moments, i.e. for all i, j and h, (h)
E[ij ] = 0,
(h)
Var(ij ) ≤ σ 2 .
and (k)
− Further, we allow each entry within X(k) = [Xij ]i 0, 2t2 P P(|S − E[S]| ≥ t) ≤ 2 exp − n . 2 i=1 (bi − ai ) Theorem H.4. Theorem 3.4 of [11] Take any two numbers m and n such that 1 ≤ m ≤ n. Suppose that A = [Aij ]1≤i≤m,1≤j≤n is a matrix whose entries are independent random variables that satisfy, for some δ 2 ∈ [0, 1], E[Aij ] = 0,
E[A2ij ] ≤ δ 2 ,
and
|Aij | ≤ 1
a.s.
Suppose that δ 2 ≥ n−1+ζ for some ζ > 0. Then, for any η ∈ (0, 1), √ 2 P(kAk ≥ (2 + η)δ n) ≤ C(ζ)e−cδ n , where C(ζ) depends only on η and ζ, and c depends only on η. The same result is true when m = n and A is symmetric or skew-symmetric, with independent entries on and above the diagonal, all other assumptions remaining the same. Lastly, all results remain true if the assumption δ 2 ≥ n−1+ζ is changed to δ 2 ≥ n−1 (log n)6+ζ . Remark H.4.1. The proof of Theorem H.4 can be found in [11] under Theorem 3.4. 20
H.2
Useful lemmas.
Pm Lemma H.1. Let A = i=1 σi xi yiT be the singular value decomposition of A. Fix any π > 0 such that µ = (1 + π)kA − Bk, and let S = {i : σi ≥ µ}. Define X ˆ= B σi xi yiT . i∈S
Then
ˆ
B − B ≤ (2 + π)kA − Bk. Proof. By the definition of µ and hence the set of singular values S, we have that
ˆ
ˆ
− A + kA − Bk
B − B ≤ B = max σi + kA − Bk i∈S /
≤ (1 + π)kA − Bk + kA − Bk = (2 + π)kA − Bk. Lemma H.2. Lemma 3.5 of [11] Let A = Fix any π > 0 and define
Pm
i=1
σi xi yiT be the singular value decomposition of A.
X
ˆ= B
σi xi yiT .
i:σi >(1+π)kA−Bk
Then
ˆ
B − B ≤ K(π)(kA − BkkBk∗ )1/2 , F
p √ where K(π) = (4 + 2π) 2/π + 2 + π. Proof. The proof can be found under Lemma 3.5 of [11]. H.3
Preliminaries
To simplify the following exposition, we assume that |Mij | ≤ 1 and |Xij | ≤ 1. Recall that all entries P∞ (h) of the last row are observed such that YL = XL = ML + h=0 ψh L . On the other hand, all other entries are observed independently with some arbitrary probability p. Specifically, for all i ∈ [L − 1] and j ∈ [N ], we define Yij = Xij if Xij is observed and Yij = 0 otherwise. Under this model, observe that for all i ∈ [L − 1] and j ∈ [N ], E[Yij ] = pMij . Additionally, ∞ X
(h)
Var(ψh ij ) ≤ C 2 σ 2 < ∞,
h=0
since the sequence ψh is assumed to be absolutely summable. Moreover, by the zero-mean assumption of the noise, ∞ X
(h)
E[ψh ij ] = 0.
h=0
21
Consequently, given the above two results, we are able to interchange the variance operator, Var(·), and the infinite sum such that ∞ X (h) Var(Xij ) = Var ψi ij h=0
=
∞ X
(h)
Var(ψh ij )
h=0
≤ C 2 σ2 . As a result, Var(Yij ) = E[Yij2 ] − (E[Yij ])2 2 = pE[Xij ] − (pMij )2 2 ≤ p(C 2 σ 2 + Mij ) − (pMij )2 2 = C 2 pσ 2 + pMij (1 − p)
≤ C 2 pσ 2 + p(1 − p). Recall that pˆ denotes the proportion of observed entries in Y − and σ ˆ 2 represents the (unbiased) sample variance computed from the last row. Given the information above, we define three events E1 , E2 , and E3 as E1 := {|ˆ p − p| ≤ ωp/z}, 2 E2 := { σ ˆ − σ 2 ≤ ωσ 2 /z}, p
E3 := { Y − − pM − ≤ (2 + ω/2) N q}, 2
where q = C 2 σ 2 p + p(1 − p); for reasons that will be made clear later, we choose z = 60( σ σ+1 2 ). By Bernstein’s Inequality, we have that P(E1 ) ≥ 1 − 2e−c1 (L−1)N p , for appropriately defined constant c1 . By Hoeffding’s Inequality, we obtain P(E2 ) ≥ 1 − 2e−c2 N σ
2
for some positive constant c2 . Moreover, by Theorem H.4, P(E3 ) ≥ 1 − C4 e−c3 N q as long as q = C 2 σ 2 p + p(1 − p) ≥ N −1+ζ for some ζ > 0. In other words, p(C 2 σ 2 + 1) ≥ p(C 2 σ 2 + (1 − p)) ≥ N −1+ζ . Consequently, assuming the event E3 occurs, we require that p ≥ H.4
N −1+ζ C 2 σ 2 +1
for some ζ > 0.
Imputation analysis.
Here, we prove that Step 1 of our algorithm accurately imputes missing entries. Lemma H.3. Let M be defined as before and let M − = [Mij ]i 0. Proof. The following derivation is an adaptation for the proof of Theorems 1.1 and 2.7 of [11]. However, we provide it here for completeness. Let π > 0 be defined by the relation p
(1 + π) Y − − pM − = (2 + η) N qˆ, where qˆ = C σ ˆ 2 pˆ + pˆ(1 − pˆ). Observe that if E1 , E2 , and E3 happen, then p (2 + η) T (C 2 σ ˆ 2 pˆ + pˆ(1 − pˆ)) p 1+π ≥ (2 + η/2) T (C 2 σ 2 p + p(1 − p)) p p (2 + η) 1 − η/z (1 − η/z)C 2 σ 2 p + p(1 − p − ηp/z) p ≥ (2 + η/2) C 2 σ 2 p + p(1 − p) s p (2 + η) 1 − η/z η C 2 σ2 + p = 1− 2 + η/2 z C 2 σ2 + 1 − p r p (2 + η) 1 − η/z η C 2 σ2 + 1 ≥ 1− 2 + η/2 z C 2 σ2 p r (2 + η) 1 − η/z η = 1− 2 + η/2 60 2+η η ≥ 1− 2 + η/2 60 η η ≥ 1+ 1− 5 60 η 1 ≥1+ − . 5 50 Let K(π) be the constant defined in Lemma H.2. Since η ∈ (0.1, 1), π ≥ 10η−1 > 0 and 50 p √ K(π) = (4 + 2π) 2/π + 2 + π p p p √ ≤ 4 1 + π 2/π + 2 2(1 + π) + 2(1 + π) p √ √ = (4 2/π + 3 2) 1 + π √ ≤ C1 1 + π 23
where C1 is a constant that depends only on the choice of η. By Lemma H.2, if E1 , E2 and E3 occur, then
2
ˆ−
pM − pM − ≤ C2 (1 + π) Y − − pM −
pM − ∗
ˆ F p
≤ C3 N qˆ pM − ∗ p
≤ C4 N q pM − ∗ for an appropriately defined constant C4 . Therefore,
2
2
ˆ−
ˆ−
− M − p2 M − M − ≤ C5 pˆ2 M F F
2
2
ˆ− − p − p)2 M − F pM − pM + C5 (ˆ ≤ C5 ˆ F p
≤ C6 N q pM − ∗ + C5 (ˆ p − p)2 (L − 1)N. In general, since |Mij | and |Yij | ≤ 1,
ˆ −
ˆ−
+ M − F
M − M − ≤ M F F
p
ˆ − ≤ |S| M + M − F p
|S|
Y − + M − = F pˆ
≤ (L − 1)3/2 N Y − + M − F p p ≤ (L − 1)3/2 N (L − 1)N + (L − 1)N ≤ 2(L − 1)2 N 3/2 . Let E := E1 ∩ E2 ∩ E3 . Applying DeMorgan’s Law and the Union Bound, P(E c ) = P(E1c ∪ E2c ∪ Eec ) ≤ P(E1c ) + P(E2c ) + P(E3c ) ≤ C7 e−c8 N (p(L−1)+σ
2
+q)
= C7 e−c8 φN ,
(42)
2
where we define φ := p(L − 1) + σ + q and C7 , c8 are appropriately defined. Observe that p(1−p) E(ˆ p−p)2 = (L−1)N . Thus, by the law of total probability and noting that P(E) ≤ 1 (for appropriately defined constants),
2
2
2 h i h i
ˆ−
ˆ−
ˆ−
E M − M − ≤ E M − M − | E + E M − M − | E c P(E c ) F F p −F −1 −1 4 3 −c8 φN
≤ C6 p N q M ∗ + C5 p + C9 (L − 1) N e
−1/2 1/2 = C6 p N (Cσ 2 + (1 − p))1/2 M − + C5 p−1 + C9 (L − 1)4 N 3 e−c8 φN . ∗
Normalizing by (L − 1)N , we obtain √ 0 − k C5 ˆ − ) ≤ C6 q kM √ ∗+ MSE(M + C10 e−c11 φN , (L − 1)N p (L − 1) N p where q 0 = C 2 σ 2 + 1 − p. We invoke Lemma H.3 to bound kM − k∗ to obtain p √ C7 rN (L − 1) C7 (L − 1) N δ 1 − ˆ √ √ MSE(M ) ≤ + +O (L − 1)N (L − 1) N p (L − 1) N p √ C7 δ C7 r 1 = √ +p +O p (L − 1)N (L − 1)p C7 L−1 C7 L−2 1 ≤ √ + √ +O , p p (L − 1)N for some 1 , 2 > 0. 24
H.6
Forecasting analysis.
We now analyze our estimator’s ability to forecast and measure its performance with respect to the (normalized) `2 error. P∞ (h) Lemma H.4. Let M be defined as before. Let L be defined as L = YL − ML = h=0 ψh L . Let βˆ be the minimizer of Step 2 in the Algorithm. Then for any µ ≥ 0, ∞
√
X
(h) ˆ − ˆ L |ψh | L + M − − M
.
ML − M
≤ 2N δ + 2 h=0
Proof. Recall M − = [Mij ]i 0. Clearly, we can express M` = (M − )T β ∗ ,
(43)
where β ∗ ∈ RL−1 is a 1-sparse vector
with a single nonzero component of value 1 in the `th index.
− T ˆ ˆ Since β minimizes YL − (M ) v for any v ∈ RL−1 , we subsequently have
∞
X
(h) ˆ L = (YL − ˆ L ψh L ) − M
ML − M
h=0
∞
X
(h) ˆ − )T βˆ |ψh | L ≤ YL − (M
+ h=0 ∞
X
(h)
ˆ − )T β ∗ ≤ YL − (M |ψh | L
+ h=0
∞ ∞
X X
(h) ˆ − )T β ∗ = (ML + ψh L ) − (M |ψh |kL k
+
h=0
h=0
∞
X
(h) ˆ − )T β ∗ ≤ ML − (M |ψh | L .
+2 h=0
By Proposition E.4 and (43), we obtain
ˆ − )T β ∗ ˆ − )T β ∗
ML − (M
≤ kML − M` k + M` − (M
√
ˆ − )T β ∗ ≤ 2N δ + M` − (M
√
ˆ − )T β ∗ = 2N δ + (M − − M
√
ˆ − ≤ 2N δ + M − − M
, where the last inequality follows from the fact that kβ ∗ k2 = 1. Combining the above results completes the proof. H.7
Proof of Theorem 4.2
Theorem (4.2). Let M be defined as before with δ = L− for any > 0. Suppose p = Ω(N −1+ζ ) for some ζ > 0 and C is known. Then for any k ∈ [L] and using µ as defined in (5) √ C1 2 2 ˆ (k) ) ≤ √ RMSE(M (C σ + (1 − p))1/2 + O(1/L/2 ) + O(1/ N ), L p where C1 is a universal positive constant. 25
Proof. For notational simplicity, let us temporarily drop the superscript k for this derivation. Applying Lemma H.4 and taking expectations, we obtain ∞
(a) √
X
ˆ−
(h) ˆ L E ML − M − M − + 2 |ψh |E L
≤ 2N δ + E M h=0 (b)
≤
√
√
ˆ−
2N δ + 2C N σ + E M − M − .
(44)
We are able to interchange the countably infinite sum and the expectation operator in (a) via Fubini’s (h) (h) Theorem; and (b) follows since for each h, εij are i.i.d random variables with Var(εij ) ≤ σ 2 . Let π be defined by the relation p
(1 + π) Y − − pM − = (2 + η) N qˆ, where qˆ = C 2 σ ˆ 2 pˆ + pˆ(1 − pˆ); recall that q = C 2 σ 2 p + p(1 − p). Assuming E1 , E2 , and E3 happens, Lemma H.1 states that
ˆ−
pM − pM − ≤ (2 + π) Y − − pM −
ˆ
≤ 2(1 + π) Y − − pM − p = (4 + 2η) N qˆ p (45) ≤ C1 N q for an appropriately defined constant C1 . Therefore,
ˆ−
ˆ−
p M − M − ≤ C2 pˆ M − M −
ˆ−
p − p)M − ≤ C2 ˆ pM − pM − + (ˆ p
≤ C2 C1 N q + (ˆ p − p)M −
(46)
In general, since |Mij | and |Yij | ≤ 1,
ˆ−
ˆ −
M − M − ≤ M
+ M −
1
Y − + M − pˆ
≤ (L − 1)N Y − + M − p p ≤ (L − 1)N (L − 1)N + (L − 1)N =
≤ 2((L − 1)N )3/2 . Let E := E1 ∩ E2 ∩ E3 . Using the same argument that led to (42), we have P(E c ) ≤ C3 e−c4 φN where we define φ := p(N − 1) + σ 2 + q and C3 , c4 are appropriately defined. Thus, by the law of total probability and observing that P(E) ≤ 1,
h i h i
ˆ−
ˆ−
ˆ−
E M − M − ≤ E M − M − | E + E M − M − | E c P(E c )
C 5 p ≤ E N q + (ˆ p − p)M − | E + C6 ((L − 1)N )3/2 e−c4 φN . (47) p We note that for any non-negative random variable X ≥ 0 and event D such that P(D) ≥ 1/2, E[X | D] ≤
E[X] ≤ 2E[X]. P(D)
Using the fact that P(E) ≥ 1/2 for large enough L and N ,
C p
ˆ−
7 E M − M − ≤ E N q + (ˆ p − p)M − + C6 ((L − 1)N )3/2 e−c4 φN . p 26
(48)
(49)
By Jensen’s Inequality, E|ˆ p − p| ≤
p
Var(ˆ p), thus
p
p − p)M − ≤ p(1 − p). E (ˆ
Putting everything together, r √ C p(1 − p) 7 ˆ L ) ≤ 2δ + 2Cσ + RMSE(M + C8 e−c9 pN q+ p N √ C7 = √ (C 2 σ 2 + (1 − p))1/2 + 2Cσ + O(1/ N ) + O(1/L/2 ). p √
The proof is complete assuming we absorb the second term, 2Cσ, into the first term (since C > 1) and appropriately redefining the corresponding constant (and re-labeling it as C1 ). H.8
Proof of Theorem G.1
Theorem (G.1). Let M be defined as before with δ = L−qfor any > 0. Fix any γ ∈ (0, 1/2) 1 and η ∈ (0.1, 1) such that ∆ = N 2 +γ and µ = (2 + η) N 2γ (Cˆ 2 σ ˆ 2 pˆ + pˆ(1 − pˆ)). Suppose p = Ω(N −2γ ) and C are known. Then for any k ∈ [L], ˆ ¯ (k) ) = O(N − 14 + γ2 ) + O(L− 2 ). RMSE(M L Proof. For notational simplicity, let us temporarily drop the superscript k for this derivation. To establish Theorem G.1, we shall follow the proof of Theorem 4.2, using the block partitioned matrices instead. Recall that τ = N/∆ where ∆ = N 1/2+γ . For analytical simplicity, we define the random variable 1 w.p. p, Dit = 0 otherwise, whose definition will soon prove to be useful. As previously described in Section G.2, for all i ∈ [L − 1] and j ∈ [∆], we define X (k) ¯ (k) = 1 X Xit · Dit ij τ t∈Bj
and X (k) ¯ (k) = p Mit . M ij τ t∈Bj
From here on, to avoid notational fatigue, and without loss of generality, we set k = 1. Define (h) ¯ − = [¯ E ]i