The M4 Competition in Progress Forecast. Compete. Excel.
Evangelos Spiliotis Spyros Makridakis Vassilios Assimakopoulos National Technical University of Athens Forecasting & Strategy Unit University of Nicosia Institute for the Future National Technical University of Athens- Forecasting & Strategy Unit
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The quest for the holy grail What do we forecast? The performance of the forecasting methods strongly depends on the o Domain o Frequency o Length o Characteristics o ??? of the time series being examined as well as on various strategic decisions, such as forecasting horizon and computation time (complexity), and relevant information available
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The quest for the holy grail What kind of method should we use? Too many types of methods and alternatives o Statistical o Machine Learning o Combination o Judgmental with contradicted results in the literature
Even if we knew which method is best for the examined application in general, lots of work would still be needed to properly select and parameterize our forecasting model, as well as to pre-process our data
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The quest for the holy grail Is there a golden rule or some best practices? “ignorance of research findings, bias, sophisticated statistical procedures, and the proliferation of big data, have led forecasters to violate the Golden Rule. As a result, …, forecasting practice in many fields has failed to improve over the past halfcentury”. Golden rule of forecasting: Be conservative (Armstrong, et.al., 2015)
“identify the main determinants of forecasting accuracy considering seven time series features and the forecasting horizon” ‘Horses for Courses’ in demand forecasting (Petropoulos, et.al., 2014)
“investigate which individual model selection is beneficial and when this approach should be preferred to aggregate selection or combination” Simple versus complex selection rules for forecasting many time series (Fildes & Petropoulos, 2015) 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluating Forecasting Performance We need benchmarks.... New methods and forecasting approaches must perform well on well-known, diverse and representative data sets
This is exactly the scope of forecasting competitions: Learn how to improve the forecasting accuracy, and how such learning can be applied to advance the theory and practice of forecasting ✓ Encourage researchers and practitioners develop new and more accurate forecasting methods ✓ Compare popular forecasting methods with new alternatives ✓ Document state-of-the-art methods and forecasting techniques used in academia and industry ✓ Identify best practices ✓ Set new research questions and try to provide proper answers
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluating Forecasting Performance Competitions will always be helpful.... ➢ There will always be features of time series forecasting not previously studied under competition conditions ➢ There will always be new methods to be evaluated and validated ➢ As new performance metrics and statistical test come into light, the results of previous competitions will be always put under question ➢ Technological advances affect the way forecasting is performed and enable more advanced, complex and computational intensive approaches, previously inapplicable ➢ Exploding data influence forecasting and its applications (more data to learn from, unstructured data sources, abnormal time series, new forecasting needs)
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The history of time series forecasting competitions Establishing the idea of forecasting competitions
Makridakis and Hibon (1979) • No participants • 111 time series (yearly, quarterly & monthly) • 22 methods
Major findings • Simple methods do as well or better than sophisticated ones • Combining forecasts may improve forecasting accuracy • Special events have a negative impact on forecasting performance
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The history of time series forecasting competitions Establishing the idea of forecasting competitions Automatic forecasting may be useless and less accurate than humans, while combining forecasts quite risky
G. Jenkins
No-one wants that accurate forecasts nor has enough data to estimate them
G.J.A. Stern
M. B. Priestley
A model (simple data generation process) can perfectly describe and extrapolate your time series if identified and applied correctly
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The history of time series forecasting competitions M1: The first forecasting competition Makridakis et. al (1982) • Seven participants • 1001 time series (yearly, quarterly & monthly) • 15 methods (plus 9 variations) • Not real-time
What's new? •Real participants •Many accuracy measures
Major findings • • • •
Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones. The relative ranking of the performance of the various methods varies according to the accuracy measure being used. The accuracy when various methods are combined outperforms, on average, the individual methods being combined and does very well in comparison to other methods. The accuracy of the various methods depends on the length of the forecasting horizon involved.
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The history of time series forecasting competitions M2: Incorporating judgment
What's new? •Combine statistical methods with judgment •Ask questions to the companies involved •Learn from previous errors and revise next forecasts accordingly
Makridakis and Hibon (1993) • 29 time series • 16 methods (human forecasters, automatic methods and combinations)
• Real time
Major findings • In most cases, forecasters failed to improve statistical forecasts based on their judgment • Simple methods perform better in most of the cases, with the results being in agreement with previous studies
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The history of time series forecasting competitions M3: The forecasting benchmark “The M3 series have become the de facto standard test base in forecasting research. When any new univariate forecasting method is proposed, if it does not perform well on the M3 data compared to the results on other published algorithms, it is unlikely to receive any further attention or adoption.” (Kang, Hyndman & Smith-Miles, 2017)
What's new? •More methods (NNs and FSSs) •More series
Makridakis and Hibon (2000) • 3003 time series • 24 methods • Not real time
Major findings • The results of the previous studies and competitions were largely confirmed. • New methods, such as the Theta of Assimakopoulos & Nikolopoulos (2000), and FSSs, such as the ForecastPro, have proven their forecasting capabilities • ANNs relatively inaccurate
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The history of time series forecasting competitions Modern forecasting competitions Neural network competitions (NN3 2006) Crone, Hibon & Nikolopoulos (2011) 111 monthly M3 series & 59 submissions ✓ None CI method outperformed the original M3 contestants ✓ NNs may be inadequate for time series forecasting, especially for short ones ✓ No “best-practices” identified for utilizing CI methods
Kaggle Competitions Tourism Forecasting Competition Athanasopoulos & Hyndman (2010) Web traffic (Wikipedia) competition Anava & Kuznetsov (2017) ✓ feedback significantly improves forecasting accuracy by proving motivation and fruitful feedback ✓ fast results and conclusions
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Status que and next steps So, what did we learn? ✓ Forecasting and time series analysis are two different things ✓ Models that produce more accurate forecasts should be preferred from those of better statistical properties ✓ Simple models work – especially for short series ✓ Out-of-sample and in-sample accuracy may significantly differ (Avoid over-fitting) ✓ Automatic forecasting algorithms work rather well – especially for long time series ✓ Combining methods help us deal with uncertainty 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Status que and next steps What would be also useful to learn (or verify) though M4? ✓ Which are the “best practices” nowadays? ✓ How do advances in technology and algorithms have affected forecasting? ✓ Are there any new methods that could really make a difference? ✓ How about prediction intervals? ✓ Similarities and differences between the various forecasting methods, including ML ones? ✓ Are the data of the forecasting competitions representative? Do other larger datasets support previous findings?
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The dates
Nov 1, 2017 Competition Jan 1, 2018 Competition Starts Announced
Nov
Dec
2018
Feb
Competition Ends May 31, 2018 Preliminary Results Jun 18, 2018
Mar
Apr
May
Jun
Jul
Final Results and Sep 28, 2018 Winners
Aug
Sep
Today
• •
There was also a deadline extension (1 week) to encourage more participations Late submissions are not eligible for any prize
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Oct
The M4 Competition The dataset (1/2) Frequency
Micro
Industry
Macro
Finance Demographic
Other
Total
Yearly
6,538
3,716
3,903
6,519
1,088
1,236
23,000
Quarterly
6,020
4,637
5,315
5,305
1,858
865
24,000
Monthly
10,975
10,017
10,016
10,987
5,728
277
48,000
112
6
41
164
24
12
359
1,476
422
127
1,559
10
633
4,227
-
-
-
-
-
414
414
5,121
18,798
19,402
24,534
3,437
100,000
Weekly Daily Hourly Total
8,708
✓ The largest forecasting competition involving 100,000 business time series to provide conclusions of statistical significance ✓ High frequency data, including Weekly, Daily and Hourly series ✓ Diverse time series collected from 23 reliable data sources & classified in 6 domains *Data available at https://www.m4.unic.ac.cy/the-dataset/ or through the M4comp2018 R package 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The dataset (2/2)
Yearly
Quarterly
Monthly
Hourly
2D visualization of time series into the Feature Space of Kang et al., 2017 Frequency, Seasonality, Trend, Randomness, ACF1 & Box-Cox λ 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The rules ✓ Produce point forecasts for the whole dataset – mandatory. Forecasting horizons as follows: • 6 for yearly • 8 for quarterly (2 years) • 18 for monthly (1.5 years) • 13 for weekly (3 months) • 14 for daily (2 weeks) • 48 for hourly data (2 days) ✓ Estimate prediction intervals (95% confidence) for the whole dataset –optional ✓ Submit before deadline through the M4 site using a pre-defined file formal ✓ Submit the code used to generate the forecasts, as well as a detailed method description for reasons of reproducibility - optional but highly recommended. The supplementary material must be uploaded at M4 GitHub* repo not later than 10th of June, 2018 * https://github.com/M4Competition 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition Evaluation: Point Forecasts Overall Weighted Average (OWA) of two accuracy measures: • Mean Absolute Scaled Error (MASE) • symmetric Mean Absolute Percentage Error (sMAPE) 1 𝑀𝐴𝑆𝐸 = ℎ
ℎ
σℎ𝑡=1 𝑌𝑡 − 𝑌𝑡 1 σ𝑛𝑡=𝑚+1 𝑌𝑡 − 𝑌𝑡−𝑚 𝑛−𝑚
2 𝑌𝑡 − 𝑌𝑡 1 𝑠𝑀𝐴𝑃𝐸 = ℎ 𝑌𝑡 + 𝑌𝑡 𝑡=1
,where 𝑌𝑡 is the post sample value of the time series at point t, 𝑌𝑡 the estimated forecast, h the forecasting horizon and m the frequency of the data
➢ Estimate MASE and sMAPE per series by averaging the error computed per forecasting horizon ➢ Divide all Errors by that of Naïve 2 (Relative MASE and Relative sMAPE) ➢ Compute the OWA by averaging the Relative MASE and the Relative sMAPE
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition Evaluation: Prediction Intervals Mean Scaled Interval Score (MSIS) 2 2 ℎ σ 𝑈 − 𝐿 + (𝐿 − 𝑌 )1{𝑌 < 𝐿 } + (𝑌𝑡 − 𝑈𝑡 )1{𝑌𝑡 > 𝑈𝑡 ቅ 𝑡 𝑡 𝑡 𝑡 𝑡 1 𝑡=1 𝑡 𝑎 𝑎 𝐌𝐒𝐈𝐒 = 1 ℎ σ𝑛 𝑛 − 𝑚 𝑡=𝑚+1 𝑌𝑡 − 𝑌𝑡−𝑚 ,where L and U are the Lower and Upper bounds of the prediction intervals, 𝑌 are the future observations of the series, 𝑎 is the significance level (0,05) and 1 is the indicator function (being 1 if Y is within the postulated interval and 0 otherwise).
➢ A penalty is calculated at the points where the real values are outside the specified bounds ➢ The width of the prediction interval adds up to the penalty, if any, to get the IS ➢ The IS estimated at the individual points are averaged to get the MIS value ➢ MIS is scaled by dividing its value with the mean absolute seasonal difference of the series ➢ MSIS of all series is averaged to evaluate the total performance of the method 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The benchmarks 10 benchmarks were used to facilitate comparisons between the participating methods: 7 classic Statistical methods, 1 Combination and 2 simplified Machine Learning ones 1. Naïve 1 (S) – used to compare all methods (Prediction Intervals) 2. Seasonal Naïve (S) 3. Naïve 2 (S) - reference for estimating OWA 4. Simple Exponential Smoothing (S) 5. Holt’s Exponential Smoothing (S) 6. Dampen Exponential Smoothing (S) 7. Combination of 4, 5 and 5 (C) – used to compare all methods (Point Forecasts)* 8. Theta (S) 9. MLP (ML) 10.RNN (ML)
*Accurate, robust, simple & easy to understand 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The prizes Six prizes, standing in total at 27,000€ Prize
Description
Amount
1st Prize
Best performing method according to OWA
9,000 €
2nd Prize
Second-best performing method according to OWA
4,000 €
3rd Prize Third-best performing method according to OWA Prediction Intervals Prize Best performing method according to MSIS The UBER Student Prize Best performing method according to OWA The Amazon Prize
2,000 € 5,000 € 5,000 €
The best reproducible forecasting method according to OWA 2,000 €
Sponsorships
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The participants (1/2)
14 12 10 8 6 4 2
✓ 50 submissions (20 with PIs) ✓ 17 countries
0
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
The M4 Competition The participants (2/2) # of Participants per Method Type
# of Participants per Affiliation Type
35
35
30
30
25
25
20
20
15
15
10
10
5
5 0
0 Combination
Statistical
Machine Learning
Other
University
Company-Organization
Individual
✓ The majority utilized statistical methods or combinations, both of Statistical and ML models, and only a few pure ML ones*. ✓ More than half of the participants were related to the academia and the rest were either companies or individuals *These are rough classifications – more work is needed to verify them 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Point Forecasts Rankings (1/5) Rank
Team
Diff from
Affiliation
Method
sMAPE MASE OWA Comb (%) 11.37 1.54 0.821 -8.52
Uber Technologies
Hybrid
University of A Coruña & Monash University
Comb (S & ML)
11.72
1.55
0.838
-6.65
ProLogistica Soft
Comb (S)
11.84
1.55
0.841
-6.25
Individual
Comb (S & ML)
11.70
1.57
0.842
-6.17
Comb (S)
11.84
1.55
0.843
-6.10
Comb (S)
11.89
1.57
0.848
-5.55
Harvard Extension School
Comb (S)
12.02
1.60
0.860
-4.13
National Technical University of Athens
Statistical
11.99
1.60
0.861
-4.11
1
Smyl
2
Montero-Manso et al.
3
Pawlikowski et al.
4
Jaganathan & Prakash
5
Fiorucci, J. A. & Louzada
6
Petropoulos & Svetunkov
7
Shaub
8
Legaki & Koutsouri
9
Doornik et al.
University of Oxford
Comb (S)
11.92
1.63
0.865
-3.62
10
Pedregal et al.
University of Castilla-La Mancha
Comb (S)
12.11
1.61
0.869
-3.19
-
Statistical
12.15
1.63
0.874
-2.65
11
4Theta (Benchmark)
University of Brasilia & University of São Paulo University of Bath & Lancaster University
12
Roubinchtein
Washington State Employment Security Department
Comb (S)
12.18
1.63
0.876
-2.38
13
Ibrahim
Georgia Institute of Technology
Statistical
12.20
1.64
0.880
-1.97
14
Tartu M4 seminar
University of Tartu
Comb (S & ML)
12.50
1.63
0.888
-1.09
15
Waheeb
Universiti Tun Hussein Onn Malaysia
Comb (S)
12.15
1.71
0.894
-0.40
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Point Forecasts Rankings (2/5) Rank
Team
16
Darin & Stellwagen
17
Dantas & Cyrino Oliveira
Affiliation Business Forecast Systems (Forecast Pro) Pontifical Catholic University of Rio de Janeiro
Method
Diff from
sMAPE MASE OWA Comb (%)
Statistical
12.28
1.69
0.895
0.25
Comb (S)
12.55
1.66
0.896
0.19
18
Theta (Benchmark)
-
Statistical
12.31
1.70
0.897
0.03
19
Comb (Benchmark)
-
Comb (S)
12.55
1.66
0.898
0.00
Scarsin (i2e)
Comb (S)
12.37
1.72
0.907
-1.01
-
Statistical
12.66
1.68
0.907
-1.02
Universidad Miguel Hernández & Universitat de Valencia
Comb (S)
12.51
1.72
0.910
-1.38
Individual
Machine Learning
12.89
1.68
0.915
-1.94
20 21
Nikzad, A. Damped (Benchmark)
22
Segura-Heras et al.
23
Trotta
24
Chen & Francis
Fordham University
Comb (S)
12.55
1.73
0.915
-1.96
25
Svetunkov et al.
Lancaster University & University of Newcastle
Comb (S)
12.46
1.74
0.916
-2.01
26
Talagala et al.
Monash University
Statistical
12.90
1.69
0.917
-2.12
27
Sui & Rengifo
Fordham University
Comb (S)
12.85
1.74
0.930
-3.56
28
Kharaghani
Individual
Comb (S)
13.06
1.72
0.930
-3.63
29
Smart Forecast
Smart Cube
Comb (S)
13.21
1.79
0.955
-6.34
30
Wainwright et al.
Oracle Corporation (Crystal Ball)
Statistical
13.34
1.80
0.962
-7.15
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Point Forecasts Rankings (3/5)
Top 6 performing methods
Smyl, S. • Hybrid model mixing Exp. Smoothing with LSTM – estimated concurrently • Hierarchical modeling – parameters estimated using information both from the whole dataset and individual series | Combinations are also considered Montero-Manso, P., Talagala, T., Hyndman, R. J. & Athanasopoulos, G. • Weighted average of ARIMA, ETS , tbats, Theta, naïve, seasonal naïve, NN and LSTM • Weights estimated through gradient boosting tree (xgboost) using holdout tests Pawlikowski, M., Chorowska, A. & Yanchuk, O. • Weighted average of several statistical methodsusing holdout tests • Pool defined based on time series characteristics / manual selection Jaganathan, S. & Prakash, P. • Combination of statistical methods as described in Armstrong, J. S. (2001) Fiorucci, J. A. & Louzada, F. • Weighted average of ARIMA, ETS & Theta • Weights estimated using cross-validation Petropoulos, F. & Svetunkov, I. • Median of ETS, CES, ARIMA & Theta 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Point Forecasts Rankings (4/5) Spearman’s correlation coefficient of the rankings Correlation
sMAPE
MASE
OWA
sMAPE
-
-
-
MASE
0.88
-
-
OWA
0.94
0.98
-
The final ranks, both according to MASE and sMAPE, are highly correlated with OWA, meaning that both can be used as proxies to measure the relative performance of the individual methods
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Point Forecasts Rankings (5/5) Multiple Comparisons with the Best (MCB)
OWA Rank #2 #5 #3 #4 #1 #6
Montero-Manso.
Pawlikowski
Participant Montero-Manso Fiorucci Pawlikowski Jaganathan Smyl Petropoulos
Smyl.
Jaganathan. Fiorucci Petropoulos
RNN Com Damped
Theta Holt
MLP SES
Naive2
sNaive Naive
✓ The forecasts of the first six methods did not statistically differ ✓ Apart from these methods, the improvements of the rest over the benchmarks were minor 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Point Forecasts What about Complexity - Future Work Does sub-optimality matter? (Nikolopoulos & Petropoulos, 2017)
Forecasting performance (sMAPE) versus computational complexity (Makridakis et al., 2018)
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Comparing different types of methods Median performance per Frequency & Domain ✓ In general, Combinations produced more accurate forecasts that the rest of the methods, regardless the frequency and the domain of the data ✓ Out of the 17 methods that did better than the benchmarks, 12 were Comb, 4 were Statistical and 1 was Hybrid ✓ Only 1 pure ML method performed better than Naive2 Type of Method Statistical Machine Learning Combination Other
Yearly Quarterly Monthly Weekly
Daily
Hourly
Total
0.93
0.93
0.95
0.97
1.00
1.00
0.97
1.27 0.87 0.99
1.16 0.90 1.92
1.20 0.92 1.77
1.00 0.90 8.88
1.93 1.02 9.16
0.92 0.65 2.79
1.48 0.91 1.80
Type of Method Macro Micro Demographic Industry Finance Other Statistical 0.95 0.98 0.95 0.99 0.97 0.97 Machine Learning 1.20 1.16 1.44 1.43 1.41 1.56 Combination 0.90 0.89 0.90 0.93 0.92 0.91 Other 1.64 1.81 1.93 1.55 2.04 1.76
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Total 0.98 1.48 0.91 1.80
Comparing different types of methods Top 3 per Frequency & Domain Frequency Yearly Quarterly
Monthly Weekly Daily Hourly
Domain Macro
Micro Demographic Industry Finance Other
1st Smyl, S. (#1)
2nd
3rd
Legaki, N. Z. (#8) Montero-Manso, P. (#2) Smyl, S. (#1) Smyl, S. (#1) Jaganathan, S. (#4) Petropoulos, F. (#6) Darin, S. (#16) Fiorucci, J. A. (#5) Pawlikowski, M. (#3) Pawlikowski, M. (#3) Smyl, S. (#1)
Montero-Manso, P. (#2)
Petropoulos, F. (#6) Montero-Manso, P. (#2)
Pawlikowski, M. (#3) Taru M4Seminar (#14) Doornik, J. (#9)
1st Smyl, S. (#1) Smyl, S. (#1)
2nd 3rd Montero-Manso, P. (#2) Jaganathan, S. (#4) Legaki, N. Z. (#8) Pawlikowski, M. (#3) Montero-Manso, P. (#2) Pawlikowski, M. (#3) Smyl, S. (#1) Montero-Manso, P. (#2) Jaganathan, S. (#4) Smyl, S. (#1) Montero-Manso, P. (#2) Fiorucci, J. A. (#5) Smyl, S. (#1) Smyl, S. (#1) Pawlikowski, M. (#3) Montero-Manso, P. (#2)
Spearman’s correlation coefficient of the rankings
Legend: - Statistical - Combination ➢ Although the best performing methods for the whole dataset were also very accurate for the individual subsets, in many cases they were outperformed by other methods with a much lower rank – No method to fit them all 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Impact of forecasting horizon Average sMAPE across 60 methods (benchmarks & submissions)
Frequency
✓ The length of the forecasting horizon has a great impact on forecasting accuracy ✓ Only for hourly data did ML methods become competitive 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Yearly Quarterly Monthly Weekly Daily Hourly
Deterioration per period (%)
20 13 6 7 14 1
Impact of time series characteristics* Average impact in forecasting accuracy (coefficient) per t-s characteristic k methods/type x 100,000 observations Type of Method
Randomness
Trend
Seasonality Linearity Stability
Length
Machine Learning
0.20
-0.10
-0.04
0.14
-0.05
-0.08
Statistical
0.18
-0.08
-0.02
0.09
-0.04
0.15
Combination
0.17
-0.09
-0.02
0.10
-0.03
-0.02
Total
0.18
-0.08
-0.02
0.10
-0.04
0.06
*𝑠𝑀𝐴𝑃𝐸 = 𝑎 ∗ 𝑅𝑎𝑛𝑑𝑜𝑚𝑛𝑒𝑠𝑠 + 𝑏 ∗ 𝑇𝑟𝑒𝑛𝑑 + ⋯ + 𝑓 ∗ 𝐿𝑒𝑛𝑔𝑡ℎ Machine Learning: • More data, better forecasts • Not robust for noisy and linear series • Good for seasonal series
Combinations: • Robust for noisy data • Bad in capturing seasonality
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Statistical: • Bad for trended & seasonal series • Good for modeling linear patterns • The less the data the better (use only the most recent ones)
Evaluation of submissions – Prediction Intervals Rankings Rank
Team
1
Smyl
2
Montero-Manso et al.
3
Doornik et al. ETS (benchmark)
4
Diff from
Affiliation
Method
MSIS
Coverage Naive (%)
Uber Technologies
Hybrid
12.23
94.78%
49.2%
University of A Coruña & Monash University
Comb (S & ML)
14.33
95.96%
40.4%
University of Oxford
Comb (S)
15.18
90.70%
36.9%
-
Statistical
15.68
91.27%
34.8%
Comb (S)
15.69
88.52%
34.8%
Comb (S)
15.98
87.81%
33.6%
Comb (S)
16.50
88.93%
31.4%
5
Fiorucci & Louzada
6
Petropoulos & Svetunkov
7
Roubinchtein
University of Brasilia & University of São Paulo University of Bath & Lancaster University Washington State Employment Security Department
8
Talagala et al.
Monash University
Statistical
18.43
86.48%
23.4%
-
Statistical
18.68
85.80%
22.3%
Georgia Institute of Technology
Statistical
20.20
85.62%
16.0%
Wells Fargo Securities
Statistical
22.00
86.41%
8.5%
Automatic Forecasting Systems, Inc. (AutoBox)
Statistical
22.37
82.87%
7.0%
9
ARIMA (benchmark)
10
Ibrahim
11
Iqbal et al.
12
Reilly
13
Wainwright et al.
Oracle Corporation (Crystal Ball)
Statistical
22.67
82.99%
5.7%
14
Segura-Heras et al.
Universidad Miguel Hernández & Universitat de Valencia
Comb (S)
22.72
90.10%
5.6%
-
Statistical
24.05
86.40%
0.0%
15
Naïve (benchmark)
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Prediction Intervals Median performance per Frequency ✓ Apart from the first two methods, the rest underestimated reality considerably ✓ On average, the coverage of the methods was only 86.4% (target is 95%) ✓ Estimating uncertainty was more difficult for low frequency data, especially for the yearly series – limited sample & longer forecasting horizon
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Evaluation of submissions – Prediction Intervals Median performance per Domain ✓ Demographic and Industry data were easier to predict – slower changes and less fluctuations ✓ Micro & Finance data are characterized by the higher levels of uncertainty – challenges for business forecasting
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Impact of forecasting horizon
Coverage
Average Coverage across 23 methods (benchmarks & submissions)
✓ The length of the forecasting horizon has a great impact on estimating the PIs correctly, especially for yearly, quarterly & monthly data 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Conclusions Five major findings ✓ Hybrid methods, utilizing basic principles of statistical models and ML components, have a great potential ✓ Combining forecasts of different methods significantly improves forecasting accuracy ✓ Pure ML methods are inadequate for time series forecasting ✓ Prediction intervals underestimate reality considerably Accuracy of individual statistical or ML methods is low and hybrid approaches and combination of methods is the way forward to improve forecasting accuracy and make forecasting more valuable
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Conclusions …and some minor, yet important ones ✓ Complex methods did better than simple ones but the improvements were not exceptional. Given the computational resources used, one can question if these are also practical. ✓ Forecasting horizon has a negative effect on forecasting accuracy – both for point forecasts and PIs
✓ When using large samples, the variations reported between different error measures were insignificant ✓ Different methods should be used per series according to their characteristics, as well as their frequency and domain. Yet, learning from the masses seems mandatory. ✓ The majority of the forecasters exploited traditional forecasting approaches and mostly experimented on how to combine them 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Next Steps ➢ Understand why hybrid methods work better in order to advance them further and improve their forecasting performance ➢ Figure out how combinations should be performed and where the emphasis should be given – pool or weights? ➢ Study the elements of the top performing methods in terms of PIs and lean how to exploit and advance their features to better capture uncertainty ➢ Accept the drawbacks of ML methods and reveal ways to utilize their advantages in time series forecasting ➢ Experiment and discover new, more accurate forecasting approaches
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
Thank you for your attention Questions? If you would like to learn more about M4 visit
https://www.m4.unic.ac.cy/
or contact me at
[email protected]
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018
References • • • • • • • • • • • • • • • • • • • •
Armstrong, J. S., Green, K. C. & Graefe, A. (2015). Golden rule of forecasting: Be conservative. Journal of Business Research, 68(8), 1717-1731 Armstrong, J. S. (2001). Combining forecasts. Retrieved from https://repository.upenn.edu/marketing_papers/34 Athanasopoulos, G., Hyndman, R.J., Song, H. & Wu, D.C. (2011). The tourism forecasting competition. International Journal of Forecasting, 27(3),822844, Athanasopoulos, G. & Hyndman, R.J. (2011). The value of feedback in forecasting competitions. International Journal of Forecasting, 27(3), 845-849 Crone, S. F., Hibon, M. & Nikolopoulos, K. (2011). Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction. International Journal of Forecasting, 27(3), 635 - 660 Fildes, R. & Petropoulos, F. (2015). Simple versus complex selection rules for forecasting many time series. Journal of Business Research, 68(8), 16921701 Gneiting, T. & Raftery A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102 (477), 359-378 Hyndman, R. J., Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting 22(4), 679-688 Kang, Y., Hyndman, R.J. & Smith-Miles, K. (2017). Visualising forecasting algorithm performance using time series instance spaces. International Journal of Forecasting, 33(2), 345-358 Makridakis, S., Hibon, M., & Moser, C. (1979). Accuracy of Forecasting: An Empirical Investigation. Journal of the Royal Statistical Society. Series A (General), 142(2), 97-145 Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M. et al. (1982). The accuracy of extrapolation (time series) methods: results of a forecasting competition. Journal of Forecasting, 1, 111-153 Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T. et al. (1993). The M2-competition: A real-time judgmentally based forecasting study. International Journal of Forecasting, 9(1), 5-22 Makridakis, S. & Hibon, M. (2000). The M3-Competition: results, conclusions and implications. International Journal of Forecasting, 16(4), 451-476 Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2018). Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLOS ONE, 13(3), 1-26 Montero-Manso, P., Netto, C. & Talagala, T. (2018). M4comp2018: Data from the M4-Competition, R package version: 0.1.0 Newbold, P., & Granger, C. (1974). Experience with Forecasting Univariate Time Series and the Combination of Forecasts. Journal of the Royal Statistical Society. Series A (General), 137(2), 131-165 Nikolopoulos, K. & Petropoulos, F. (2017). Forecasting for big data: Does suboptimality matter?, Computers & Operations Research (in press) Petropoulos, F., Makridakis, S., Assimakopoulos, V. & Nikolopoulos, K. (2014). ‘Horses for Courses’ in demand forecasting. European Journal of Operational Research, 237(1), 152-163 Spiliotis, E., Patikos, A., Assimakopoulos V. & Kouloumos, A. (2017). Data as a service: Providing new datasets to the forecasting community for time series analysis. 37th International Symposium on Forecasting, Cairns, Australia
38th International Symposium on Forecasting Boulder Colorado, USA– June 2018