Examining the effect of temporal aggregation on forecasting accuracy

The M4 Competition in Progress Forecast. Compete. Excel.

Evangelos Spiliotis Spyros Makridakis Vassilios Assimakopoulos National Technical University of Athens Forecasting & Strategy Unit University of Nicosia Institute for the Future National Technical University of Athens- Forecasting & Strategy Unit

38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

The quest for the holy grail What do we forecast? The performance of the forecasting methods strongly depends on the o Domain o Frequency o Length o Characteristics o ??? of the time series being examined as well as on various strategic decisions, such as forecasting horizon and computation time (complexity), and relevant information available


The quest for the holy grail What kind of method should we use? Too many types of methods and alternatives o Statistical o Machine Learning o Combination o Judgmental with contradicted results in the literature

Even if we knew which method is best for the examined application in general, lots of work would still be needed to properly select and parameterize our forecasting model, as well as to pre-process our data


The quest for the holy grail Is there a golden rule or some best practices? “ignorance of research findings, bias, sophisticated statistical procedures, and the proliferation of big data, have led forecasters to violate the Golden Rule. As a result, …, forecasting practice in many fields has failed to improve over the past halfcentury”. Golden rule of forecasting: Be conservative (Armstrong, et.al., 2015)

“identify the main determinants of forecasting accuracy considering seven time series features and the forecasting horizon” ‘Horses for Courses’ in demand forecasting (Petropoulos, et.al., 2014)

“investigate which individual model selection is beneficial and when this approach should be preferred to aggregate selection or combination” Simple versus complex selection rules for forecasting many time series (Fildes & Petropoulos, 2015) 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Evaluating Forecasting Performance We need benchmarks.... New methods and forecasting approaches must perform well on well-known, diverse and representative data sets

This is exactly the scope of forecasting competitions: Learn how to improve the forecasting accuracy, and how such learning can be applied to advance the theory and practice of forecasting ✓ Encourage researchers and practitioners develop new and more accurate forecasting methods ✓ Compare popular forecasting methods with new alternatives ✓ Document state-of-the-art methods and forecasting techniques used in academia and industry ✓ Identify best practices ✓ Set new research questions and try to provide proper answers


Evaluating Forecasting Performance Competitions will always be helpful.... ➢ There will always be features of time series forecasting not previously studied under competition conditions ➢ There will always be new methods to be evaluated and validated ➢ As new performance metrics and statistical test come into light, the results of previous competitions will be always put under question ➢ Technological advances affect the way forecasting is performed and enable more advanced, complex and computational intensive approaches, previously inapplicable ➢ Exploding data influence forecasting and its applications (more data to learn from, unstructured data sources, abnormal time series, new forecasting needs)


The history of time series forecasting competitions Establishing the idea of forecasting competitions

Makridakis and Hibon (1979) • No participants • 111 time series (yearly, quarterly & monthly) • 22 methods

Major findings • Simple methods do as well or better than sophisticated ones • Combining forecasts may improve forecasting accuracy • Special events have a negative impact on forecasting performance


The history of time series forecasting competitions Establishing the idea of forecasting competitions Automatic forecasting may be useless and less accurate than humans, while combining forecasts quite risky

G. Jenkins

No-one wants that accurate forecasts nor has enough data to estimate them

G.J.A. Stern

M. B. Priestley

A model (simple data generation process) can perfectly describe and extrapolate your time series if identified and applied correctly


The history of time series forecasting competitions M1: The first forecasting competition Makridakis et. al (1982) • Seven participants • 1001 time series (yearly, quarterly & monthly) • 15 methods (plus 9 variations) • Not real-time

What's new? •Real participants •Many accuracy measures

Major findings • • • •

Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones. The relative ranking of the performance of the various methods varies according to the accuracy measure being used. The accuracy when various methods are combined outperforms, on average, the individual methods being combined and does very well in comparison to other methods. The accuracy of the various methods depends on the length of the forecasting horizon involved.


The history of time series forecasting competitions M2: Incorporating judgment

What's new? •Combine statistical methods with judgment •Ask questions to the companies involved •Learn from previous errors and revise next forecasts accordingly

Makridakis and Hibon (1993) • 29 time series • 16 methods (human forecasters, automatic methods and combinations)

• Real time

Major findings • In most cases, forecasters failed to improve statistical forecasts based on their judgment • Simple methods perform better in most of the cases, with the results being in agreement with previous studies


The history of time series forecasting competitions M3: The forecasting benchmark “The M3 series have become the de facto standard test base in forecasting research. When any new univariate forecasting method is proposed, if it does not perform well on the M3 data compared to the results on other published algorithms, it is unlikely to receive any further attention or adoption.” (Kang, Hyndman & Smith-Miles, 2017)

What's new? •More methods (NNs and FSSs) •More series

Makridakis and Hibon (2000) • 3003 time series • 24 methods • Not real time

Major findings • The results of the previous studies and competitions were largely confirmed. • New methods, such as the Theta of Assimakopoulos & Nikolopoulos (2000), and FSSs, such as the ForecastPro, have proven their forecasting capabilities • ANNs relatively inaccurate


The history of time series forecasting competitions Modern forecasting competitions Neural network competitions (NN3 2006) Crone, Hibon & Nikolopoulos (2011) 111 monthly M3 series & 59 submissions ✓ None CI method outperformed the original M3 contestants ✓ NNs may be inadequate for time series forecasting, especially for short ones ✓ No “best-practices” identified for utilizing CI methods

Kaggle Competitions Tourism Forecasting Competition Athanasopoulos & Hyndman (2010) Web traffic (Wikipedia) competition Anava & Kuznetsov (2017) ✓ feedback significantly improves forecasting accuracy by proving motivation and fruitful feedback ✓ fast results and conclusions


Status que and next steps So, what did we learn? ✓ Forecasting and time series analysis are two different things ✓ Models that produce more accurate forecasts should be preferred from those of better statistical properties ✓ Simple models work – especially for short series ✓ Out-of-sample and in-sample accuracy may significantly differ (Avoid over-fitting) ✓ Automatic forecasting algorithms work rather well – especially for long time series ✓ Combining methods help us deal with uncertainty 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Status que and next steps What would be also useful to learn (or verify) though M4? ✓ Which are the “best practices” nowadays? ✓ How do advances in technology and algorithms have affected forecasting? ✓ Are there any new methods that could really make a difference? ✓ How about prediction intervals? ✓ Similarities and differences between the various forecasting methods, including ML ones? ✓ Are the data of the forecasting competitions representative? Do other larger datasets support previous findings?


The M4 Competition The dates

Nov 1, 2017 Competition Jan 1, 2018 Competition Starts Announced

Nov

Dec

2018

Feb

Competition Ends May 31, 2018 Preliminary Results Jun 18, 2018

Mar

Apr

May

Jun

Jul

Final Results and Sep 28, 2018 Winners

Aug

Sep

Today

• •

There was also a deadline extension (1 week) to encourage more participations Late submissions are not eligible for any prize


Oct

The M4 Competition The dataset (1/2) Frequency

Micro

Industry

Macro

Finance Demographic

Other

Total

Yearly

6,538

3,716

3,903

6,519

1,088

1,236

23,000

Quarterly

6,020

4,637

5,315

5,305

1,858

865

24,000

Monthly

10,975

10,017

10,016

10,987

5,728

277

48,000

112

6

41

164

24

12

359

1,476

422

127

1,559

10

633

4,227

-

-

-

-

-

414

414

5,121

18,798

19,402

24,534

3,437

100,000

Weekly Daily Hourly Total

8,708

✓ The largest forecasting competition involving 100,000 business time series to provide conclusions of statistical significance ✓ High frequency data, including Weekly, Daily and Hourly series ✓ Diverse time series collected from 23 reliable data sources & classified in 6 domains *Data available at https://www.m4.unic.ac.cy/the-dataset/ or through the M4comp2018 R package 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

The M4 Competition The dataset (2/2)

Yearly

Quarterly

Monthly

Hourly

2D visualization of time series into the Feature Space of Kang et al., 2017 Frequency, Seasonality, Trend, Randomness, ACF1 & Box-Cox λ 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

The M4 Competition The rules ✓ Produce point forecasts for the whole dataset – mandatory. Forecasting horizons as follows: • 6 for yearly • 8 for quarterly (2 years) • 18 for monthly (1.5 years) • 13 for weekly (3 months) • 14 for daily (2 weeks) • 48 for hourly data (2 days) ✓ Estimate prediction intervals (95% confidence) for the whole dataset –optional ✓ Submit before deadline through the M4 site using a pre-defined file formal ✓ Submit the code used to generate the forecasts, as well as a detailed method description for reasons of reproducibility - optional but highly recommended. The supplementary material must be uploaded at M4 GitHub* repo not later than 10th of June, 2018 * https://github.com/M4Competition 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

The M4 Competition Evaluation: Point Forecasts Overall Weighted Average (OWA) of two accuracy measures: • Mean Absolute Scaled Error (MASE) • symmetric Mean Absolute Percentage Error (sMAPE) 1 𝑀𝐴𝑆𝐸 = ℎ

ℎ

σℎ𝑡=1 𝑌𝑡 − 𝑌෡𝑡 1 σ𝑛𝑡=𝑚+1 𝑌𝑡 − 𝑌𝑡−𝑚 𝑛−𝑚

2 𝑌𝑡 − 𝑌෡𝑡 1 𝑠𝑀𝐴𝑃𝐸 = ෍ ℎ 𝑌𝑡 + 𝑌෡𝑡 𝑡=1

,where 𝑌𝑡 is the post sample value of the time series at point t, 𝑌෡𝑡 the estimated forecast, h the forecasting horizon and m the frequency of the data

➢ Estimate MASE and sMAPE per series by averaging the error computed per forecasting horizon ➢ Divide all Errors by that of Naïve 2 (Relative MASE and Relative sMAPE) ➢ Compute the OWA by averaging the Relative MASE and the Relative sMAPE


The M4 Competition Evaluation: Prediction Intervals Mean Scaled Interval Score (MSIS) 2 2 ℎ σ 𝑈 − 𝐿 + (𝐿 − 𝑌 )1{𝑌 < 𝐿 } + (𝑌𝑡 − 𝑈𝑡 )1{𝑌𝑡 > 𝑈𝑡 ቅ 𝑡 𝑡 𝑡 𝑡 𝑡 1 𝑡=1 𝑡 𝑎 𝑎 𝐌𝐒𝐈𝐒 = 1 ℎ σ𝑛 𝑛 − 𝑚 𝑡=𝑚+1 𝑌𝑡 − 𝑌𝑡−𝑚 ,where L and U are the Lower and Upper bounds of the prediction intervals, 𝑌 are the future observations of the series, 𝑎 is the significance level (0,05) and 1 is the indicator function (being 1 if Y is within the postulated interval and 0 otherwise).

➢ A penalty is calculated at the points where the real values are outside the specified bounds ➢ The width of the prediction interval adds up to the penalty, if any, to get the IS ➢ The IS estimated at the individual points are averaged to get the MIS value ➢ MIS is scaled by dividing its value with the mean absolute seasonal difference of the series ➢ MSIS of all series is averaged to evaluate the total performance of the method 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

The M4 Competition The benchmarks 10 benchmarks were used to facilitate comparisons between the participating methods: 7 classic Statistical methods, 1 Combination and 2 simplified Machine Learning ones 1. Naïve 1 (S) – used to compare all methods (Prediction Intervals) 2. Seasonal Naïve (S) 3. Naïve 2 (S) - reference for estimating OWA 4. Simple Exponential Smoothing (S) 5. Holt’s Exponential Smoothing (S) 6. Dampen Exponential Smoothing (S) 7. Combination of 4, 5 and 5 (C) – used to compare all methods (Point Forecasts)* 8. Theta (S) 9. MLP (ML) 10.RNN (ML)

*Accurate, robust, simple & easy to understand 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

The M4 Competition The prizes Six prizes, standing in total at 27,000€ Prize

Description

Amount

1st Prize

Best performing method according to OWA

9,000 €

2nd Prize

Second-best performing method according to OWA

4,000 €

3rd Prize Third-best performing method according to OWA Prediction Intervals Prize Best performing method according to MSIS The UBER Student Prize Best performing method according to OWA The Amazon Prize

2,000 € 5,000 € 5,000 €

The best reproducible forecasting method according to OWA 2,000 €

Sponsorships


The M4 Competition The participants (1/2)

14 12 10 8 6 4 2

✓ 50 submissions (20 with PIs) ✓ 17 countries

0


The M4 Competition The participants (2/2) # of Participants per Method Type

# of Participants per Affiliation Type

35

35

30

30

25

25

20

20

15

15

10

10

5

5 0

0 Combination

Statistical

Machine Learning

Other

University

Company-Organization

Individual

✓ The majority utilized statistical methods or combinations, both of Statistical and ML models, and only a few pure ML ones*. ✓ More than half of the participants were related to the academia and the rest were either companies or individuals *These are rough classifications – more work is needed to verify them 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Evaluation of submissions – Point Forecasts Rankings (1/5) Rank

Team

Diff from

Affiliation

Method

sMAPE MASE OWA Comb (%) 11.37 1.54 0.821 -8.52

Uber Technologies

Hybrid

University of A Coruña & Monash University

Comb (S & ML)

11.72

1.55

0.838

-6.65

ProLogistica Soft

Comb (S)

11.84

1.55

0.841

-6.25

Individual

Comb (S & ML)

11.70

1.57

0.842

-6.17

Comb (S)

11.84

1.55

0.843

-6.10

Comb (S)

11.89

1.57

0.848

-5.55

Harvard Extension School

Comb (S)

12.02

1.60

0.860

-4.13

National Technical University of Athens

Statistical

11.99

1.60

0.861

-4.11

1

Smyl

2

Montero-Manso et al.

3

Pawlikowski et al.

4

Jaganathan & Prakash

5

Fiorucci, J. A. & Louzada

6

Petropoulos & Svetunkov

7

Shaub

8

Legaki & Koutsouri

9

Doornik et al.

University of Oxford

Comb (S)

11.92

1.63

0.865

-3.62

10

Pedregal et al.

University of Castilla-La Mancha

Comb (S)

12.11

1.61

0.869

-3.19

-

Statistical

12.15

1.63

0.874

-2.65

11

4Theta (Benchmark)

University of Brasilia & University of São Paulo University of Bath & Lancaster University

12

Roubinchtein

Washington State Employment Security Department

Comb (S)

12.18

1.63

0.876

-2.38

13

Ibrahim

Georgia Institute of Technology

Statistical

12.20

1.64

0.880

-1.97

14

Tartu M4 seminar

University of Tartu

Comb (S & ML)

12.50

1.63

0.888

-1.09

15

Waheeb

Universiti Tun Hussein Onn Malaysia

Comb (S)

12.15

1.71

0.894

-0.40


Evaluation of submissions – Point Forecasts Rankings (2/5) Rank

Team

16

Darin & Stellwagen

17

Dantas & Cyrino Oliveira

Affiliation Business Forecast Systems (Forecast Pro) Pontifical Catholic University of Rio de Janeiro

Method

Diff from

sMAPE MASE OWA Comb (%)

Statistical

12.28

1.69

0.895

0.25

Comb (S)

12.55

1.66

0.896

0.19

18

Theta (Benchmark)

-

Statistical

12.31

1.70

0.897

0.03

19

Comb (Benchmark)

-

Comb (S)

12.55

1.66

0.898

0.00

Scarsin (i2e)

Comb (S)

12.37

1.72

0.907

-1.01

-

Statistical

12.66

1.68

0.907

-1.02

Universidad Miguel Hernández & Universitat de Valencia

Comb (S)

12.51

1.72

0.910

-1.38

Individual

Machine Learning

12.89

1.68

0.915

-1.94

20 21

Nikzad, A. Damped (Benchmark)

22

Segura-Heras et al.

23

Trotta

24

Chen & Francis

Fordham University

Comb (S)

12.55

1.73

0.915

-1.96

25

Svetunkov et al.

Lancaster University & University of Newcastle

Comb (S)

12.46

1.74

0.916

-2.01

26

Talagala et al.

Monash University

Statistical

12.90

1.69

0.917

-2.12

27

Sui & Rengifo

Fordham University

Comb (S)

12.85

1.74

0.930

-3.56

28

Kharaghani

Individual

Comb (S)

13.06

1.72

0.930

-3.63

29

Smart Forecast

Smart Cube

Comb (S)

13.21

1.79

0.955

-6.34

30

Wainwright et al.

Oracle Corporation (Crystal Ball)

Statistical

13.34

1.80

0.962

-7.15


Evaluation of submissions – Point Forecasts Rankings (3/5)

Top 6 performing methods

Smyl, S. • Hybrid model mixing Exp. Smoothing with LSTM – estimated concurrently • Hierarchical modeling – parameters estimated using information both from the whole dataset and individual series | Combinations are also considered Montero-Manso, P., Talagala, T., Hyndman, R. J. & Athanasopoulos, G. • Weighted average of ARIMA, ETS , tbats, Theta, naïve, seasonal naïve, NN and LSTM • Weights estimated through gradient boosting tree (xgboost) using holdout tests Pawlikowski, M., Chorowska, A. & Yanchuk, O. • Weighted average of several statistical methodsusing holdout tests • Pool defined based on time series characteristics / manual selection Jaganathan, S. & Prakash, P. • Combination of statistical methods as described in Armstrong, J. S. (2001) Fiorucci, J. A. & Louzada, F. • Weighted average of ARIMA, ETS & Theta • Weights estimated using cross-validation Petropoulos, F. & Svetunkov, I. • Median of ETS, CES, ARIMA & Theta 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Evaluation of submissions – Point Forecasts Rankings (4/5) Spearman’s correlation coefficient of the rankings Correlation

sMAPE

MASE

OWA

sMAPE

-

-

-

MASE

0.88

-

-

OWA

0.94

0.98

-

The final ranks, both according to MASE and sMAPE, are highly correlated with OWA, meaning that both can be used as proxies to measure the relative performance of the individual methods


Evaluation of submissions – Point Forecasts Rankings (5/5) Multiple Comparisons with the Best (MCB)

OWA Rank #2 #5 #3 #4 #1 #6

Montero-Manso.

Pawlikowski

Participant Montero-Manso Fiorucci Pawlikowski Jaganathan Smyl Petropoulos

Smyl.

Jaganathan. Fiorucci Petropoulos

RNN Com Damped

Theta Holt

MLP SES

Naive2

sNaive Naive

✓ The forecasts of the first six methods did not statistically differ ✓ Apart from these methods, the improvements of the rest over the benchmarks were minor 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Evaluation of submissions – Point Forecasts What about Complexity - Future Work Does sub-optimality matter? (Nikolopoulos & Petropoulos, 2017)

Forecasting performance (sMAPE) versus computational complexity (Makridakis et al., 2018)


Comparing different types of methods Median performance per Frequency & Domain ✓ In general, Combinations produced more accurate forecasts that the rest of the methods, regardless the frequency and the domain of the data ✓ Out of the 17 methods that did better than the benchmarks, 12 were Comb, 4 were Statistical and 1 was Hybrid ✓ Only 1 pure ML method performed better than Naive2 Type of Method Statistical Machine Learning Combination Other

Yearly Quarterly Monthly Weekly

Daily

Hourly

Total

0.93

0.93

0.95

0.97

1.00

1.00

0.97

1.27 0.87 0.99

1.16 0.90 1.92

1.20 0.92 1.77

1.00 0.90 8.88

1.93 1.02 9.16

0.92 0.65 2.79

1.48 0.91 1.80

Type of Method Macro Micro Demographic Industry Finance Other Statistical 0.95 0.98 0.95 0.99 0.97 0.97 Machine Learning 1.20 1.16 1.44 1.43 1.41 1.56 Combination 0.90 0.89 0.90 0.93 0.92 0.91 Other 1.64 1.81 1.93 1.55 2.04 1.76


Total 0.98 1.48 0.91 1.80

Comparing different types of methods Top 3 per Frequency & Domain Frequency Yearly Quarterly

Monthly Weekly Daily Hourly

Domain Macro

Micro Demographic Industry Finance Other

1st Smyl, S. (#1)

2nd

3rd

Legaki, N. Z. (#8) Montero-Manso, P. (#2) Smyl, S. (#1) Smyl, S. (#1) Jaganathan, S. (#4) Petropoulos, F. (#6) Darin, S. (#16) Fiorucci, J. A. (#5) Pawlikowski, M. (#3) Pawlikowski, M. (#3) Smyl, S. (#1)

Montero-Manso, P. (#2)

Petropoulos, F. (#6) Montero-Manso, P. (#2)

Pawlikowski, M. (#3) Taru M4Seminar (#14) Doornik, J. (#9)

1st Smyl, S. (#1) Smyl, S. (#1)

2nd 3rd Montero-Manso, P. (#2) Jaganathan, S. (#4) Legaki, N. Z. (#8) Pawlikowski, M. (#3) Montero-Manso, P. (#2) Pawlikowski, M. (#3) Smyl, S. (#1) Montero-Manso, P. (#2) Jaganathan, S. (#4) Smyl, S. (#1) Montero-Manso, P. (#2) Fiorucci, J. A. (#5) Smyl, S. (#1) Smyl, S. (#1) Pawlikowski, M. (#3) Montero-Manso, P. (#2)

Spearman’s correlation coefficient of the rankings

Legend: - Statistical - Combination ➢ Although the best performing methods for the whole dataset were also very accurate for the individual subsets, in many cases they were outperformed by other methods with a much lower rank – No method to fit them all 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Impact of forecasting horizon Average sMAPE across 60 methods (benchmarks & submissions)

Frequency

✓ The length of the forecasting horizon has a great impact on forecasting accuracy ✓ Only for hourly data did ML methods become competitive 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Yearly Quarterly Monthly Weekly Daily Hourly

Deterioration per period (%)

20 13 6 7 14 1

Impact of time series characteristics* Average impact in forecasting accuracy (coefficient) per t-s characteristic k methods/type x 100,000 observations Type of Method

Randomness

Trend

Seasonality Linearity Stability

Length

Machine Learning

0.20

-0.10

-0.04

0.14

-0.05

-0.08

Statistical

0.18

-0.08

-0.02

0.09

-0.04

0.15

Combination

0.17

-0.09

-0.02

0.10

-0.03

-0.02

Total

0.18

-0.08

-0.02

0.10

-0.04

0.06

*𝑠𝑀𝐴𝑃𝐸 = 𝑎 ∗ 𝑅𝑎𝑛𝑑𝑜𝑚𝑛𝑒𝑠𝑠 + 𝑏 ∗ 𝑇𝑟𝑒𝑛𝑑 + ⋯ + 𝑓 ∗ 𝐿𝑒𝑛𝑔𝑡ℎ Machine Learning: • More data, better forecasts • Not robust for noisy and linear series • Good for seasonal series

Combinations: • Robust for noisy data • Bad in capturing seasonality


Statistical: • Bad for trended & seasonal series • Good for modeling linear patterns • The less the data the better (use only the most recent ones)

Evaluation of submissions – Prediction Intervals Rankings Rank

Team

1

Smyl

2

Montero-Manso et al.

3

Doornik et al. ETS (benchmark)

4

Diff from

Affiliation

Method

MSIS

Coverage Naive (%)

Uber Technologies

Hybrid

12.23

94.78%

49.2%

University of A Coruña & Monash University

Comb (S & ML)

14.33

95.96%

40.4%

University of Oxford

Comb (S)

15.18

90.70%

36.9%

-

Statistical

15.68

91.27%

34.8%

Comb (S)

15.69

88.52%

34.8%

Comb (S)

15.98

87.81%

33.6%

Comb (S)

16.50

88.93%

31.4%

5

Fiorucci & Louzada

6

Petropoulos & Svetunkov

7

Roubinchtein

University of Brasilia & University of São Paulo University of Bath & Lancaster University Washington State Employment Security Department

8

Talagala et al.

Monash University

Statistical

18.43

86.48%

23.4%

-

Statistical

18.68

85.80%

22.3%

Georgia Institute of Technology

Statistical

20.20

85.62%

16.0%

Wells Fargo Securities

Statistical

22.00

86.41%

8.5%

Automatic Forecasting Systems, Inc. (AutoBox)

Statistical

22.37

82.87%

7.0%

9

ARIMA (benchmark)

10

Ibrahim

11

Iqbal et al.

12

Reilly

13

Wainwright et al.

Oracle Corporation (Crystal Ball)

Statistical

22.67

82.99%

5.7%

14

Segura-Heras et al.

Universidad Miguel Hernández & Universitat de Valencia

Comb (S)

22.72

90.10%

5.6%

-

Statistical

24.05

86.40%

0.0%

15

Naïve (benchmark)


Evaluation of submissions – Prediction Intervals Median performance per Frequency ✓ Apart from the first two methods, the rest underestimated reality considerably ✓ On average, the coverage of the methods was only 86.4% (target is 95%) ✓ Estimating uncertainty was more difficult for low frequency data, especially for the yearly series – limited sample & longer forecasting horizon


Evaluation of submissions – Prediction Intervals Median performance per Domain ✓ Demographic and Industry data were easier to predict – slower changes and less fluctuations ✓ Micro & Finance data are characterized by the higher levels of uncertainty – challenges for business forecasting


Impact of forecasting horizon

Coverage

Average Coverage across 23 methods (benchmarks & submissions)

✓ The length of the forecasting horizon has a great impact on estimating the PIs correctly, especially for yearly, quarterly & monthly data 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Conclusions Five major findings ✓ Hybrid methods, utilizing basic principles of statistical models and ML components, have a great potential ✓ Combining forecasts of different methods significantly improves forecasting accuracy ✓ Pure ML methods are inadequate for time series forecasting ✓ Prediction intervals underestimate reality considerably Accuracy of individual statistical or ML methods is low and hybrid approaches and combination of methods is the way forward to improve forecasting accuracy and make forecasting more valuable


Conclusions …and some minor, yet important ones ✓ Complex methods did better than simple ones but the improvements were not exceptional. Given the computational resources used, one can question if these are also practical. ✓ Forecasting horizon has a negative effect on forecasting accuracy – both for point forecasts and PIs

✓ When using large samples, the variations reported between different error measures were insignificant ✓ Different methods should be used per series according to their characteristics, as well as their frequency and domain. Yet, learning from the masses seems mandatory. ✓ The majority of the forecasters exploited traditional forecasting approaches and mostly experimented on how to combine them 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018

Next Steps ➢ Understand why hybrid methods work better in order to advance them further and improve their forecasting performance ➢ Figure out how combinations should be performed and where the emphasis should be given – pool or weights? ➢ Study the elements of the top performing methods in terms of PIs and lean how to exploit and advance their features to better capture uncertainty ➢ Accept the drawbacks of ML methods and reveal ways to utilize their advantages in time series forecasting ➢ Experiment and discover new, more accurate forecasting approaches


Thank you for your attention Questions? If you would like to learn more about M4 visit

https://www.m4.unic.ac.cy/

or contact me at [email protected]


References • • • • • • • • • • • • • • • • • • • •

Armstrong, J. S., Green, K. C. & Graefe, A. (2015). Golden rule of forecasting: Be conservative. Journal of Business Research, 68(8), 1717-1731 Armstrong, J. S. (2001). Combining forecasts. Retrieved from https://repository.upenn.edu/marketing_papers/34 Athanasopoulos, G., Hyndman, R.J., Song, H. & Wu, D.C. (2011). The tourism forecasting competition. International Journal of Forecasting, 27(3),822844, Athanasopoulos, G. & Hyndman, R.J. (2011). The value of feedback in forecasting competitions. International Journal of Forecasting, 27(3), 845-849 Crone, S. F., Hibon, M. & Nikolopoulos, K. (2011). Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction. International Journal of Forecasting, 27(3), 635 - 660 Fildes, R. & Petropoulos, F. (2015). Simple versus complex selection rules for forecasting many time series. Journal of Business Research, 68(8), 16921701 Gneiting, T. & Raftery A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102 (477), 359-378 Hyndman, R. J., Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting 22(4), 679-688 Kang, Y., Hyndman, R.J. & Smith-Miles, K. (2017). Visualising forecasting algorithm performance using time series instance spaces. International Journal of Forecasting, 33(2), 345-358 Makridakis, S., Hibon, M., & Moser, C. (1979). Accuracy of Forecasting: An Empirical Investigation. Journal of the Royal Statistical Society. Series A (General), 142(2), 97-145 Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M. et al. (1982). The accuracy of extrapolation (time series) methods: results of a forecasting competition. Journal of Forecasting, 1, 111-153 Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T. et al. (1993). The M2-competition: A real-time judgmentally based forecasting study. International Journal of Forecasting, 9(1), 5-22 Makridakis, S. & Hibon, M. (2000). The M3-Competition: results, conclusions and implications. International Journal of Forecasting, 16(4), 451-476 Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2018). Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLOS ONE, 13(3), 1-26 Montero-Manso, P., Netto, C. & Talagala, T. (2018). M4comp2018: Data from the M4-Competition, R package version: 0.1.0 Newbold, P., & Granger, C. (1974). Experience with Forecasting Univariate Time Series and the Combination of Forecasts. Journal of the Royal Statistical Society. Series A (General), 137(2), 131-165 Nikolopoulos, K. & Petropoulos, F. (2017). Forecasting for big data: Does suboptimality matter?, Computers & Operations Research (in press) Petropoulos, F., Makridakis, S., Assimakopoulos, V. & Nikolopoulos, K. (2014). ‘Horses for Courses’ in demand forecasting. European Journal of Operational Research, 237(1), 152-163 Spiliotis, E., Patikos, A., Assimakopoulos V. & Kouloumos, A. (2017). Data as a service: Providing new datasets to the forecasting community for time series analysis. 37th International Symposium on Forecasting, Cairns, Australia


Examining the effect of temporal aggregation on forecasting accuracy

Examining the effect of temporal aggregation on forecasting accuracy

Suggest Documents

Forecasting with multivariate temporal aggregation ... - kourentzes.com

Demand forecasting by temporal aggregation - Semantic Scholar

Improving forecasting via multiple temporal aggregation - Core

Examining the Effect of Fatigue on Shooting Accuracy in Young ...

Temporal Accuracy in Urban Growth Forecasting

The effect of temporal aggregation level in social network ... - PLOS

THE EFFECT OF OVERLAPPING AGGREGATION ON ...

Examining the Effect of Pileup on the Accuracy of Sharp Indentation

Temporal aggregation impacts on epidemiological ... - Ethica Data

The accuracy of housing forecasting in Australia

The Impact of Smart Grid Prosumer Grouping on Forecasting Accuracy ...

The accuracy of housing forecasting in Australia

Comparison of the Short-Term Forecasting Accuracy on Battery ...

Spatio-Temporal Analysis of the Accuracy of

Effect of Chemical Composition on Asphaltenes Aggregation

Effect of Additives on Protein Aggregation - CiteSeerX

Examining the Lucifer Effect

EFFECT OF AGGREGATION ON POPULATION RECOVERY ...

Examining the Effect

On the Temporal and Spatial Accuracy of Spectral Difference Method ...

Effect of Accreditation on Accuracy of Diagnostic

The Effect of Alzheimer's AÎ² Aggregation State on the ...

On the Effect of Bias Estimation on Coverage Accuracy in ...

On the Effect of Bias Estimation on Coverage Accuracy in ...