1 Pooling Information Across SKUs for Demand Forecasting with Data ...

7 downloads 171 Views 157KB Size Report
spectrum ranging from the raw sales and promotion data to a hundred derived explicit ... considerable progress has been made using data analysis techniques.
Pooling Information Across SKUs for Demand Forecasting with Data Mining Özden Gür Ali1, Serpil Sayın1, Tom van Woensel2 and Jan Fransoo2 [email protected], [email protected], [email protected], [email protected]

October 30, 2007.

1

Koc University College of Administrative Sciences and Economics Rumeli Feneri Yolu Sariyer 34450 Istanbul,Turkey

2

Technische Universiteit Eindhoven Department of Technology Management P.O. Box 513 NL-5600 MB Eindhoven Netherlands

Abstract Forecasting demand periodically is one of the critical tasks of retail operations. We evaluate methods that differ in pooling scope, prediction technique and input variables on actual sales and promotion data from a medium-sized full service grocery retailer in Europe. Techniques range from the complex in machine learning, support vector regression with different kernels and regression trees, to simpler traditional statistics of regression. With respect to the complexity of the input data, the methods span a spectrum ranging from the raw sales and promotion data to a hundred derived explicit features reflecting the recent category dynamics. We observe that simple time series techniques perform very well for periods without promotions. However, for periods with promotions, a substantial improvement of 65% can be reached using more sophisticated methods. We identify non-dominated methods of increasing complexity and data preparation cost yielding increasing improvements in forecasting accuracy. It turns out that using more sophisticated input variables does not help unless we use more complex prediction techniques that can apply the appropriate features for appropriate subsets of the data; and pooling information does increase accuracy.

Keywords. Demand forecasting, time series, machine learning, data mining, pooling

1

Pooling Information Across SKUs for Demand Forecasting with Data Mining Özden Gür Ali, Serpil Sayın, Tom van Woensel and Jan C. Fransoo

Introduction Retailers are faced by increasing assortments. Customers demand increasing variety. With the exception of discount retailers, assortments have been increasing and sales per item have been decreasing. Moreover, assortments change rapidly. In grocery retail, which is the focus of our study, product life cycles have been decreasing. See, e.g., Bayus and Putsis (1999) for an analysis of some of these trends. As a consequence, it is increasingly difficult to forecast sales for an individual item, as time series for those items tend to be short. Moreover, retail sales are faced with extensive promotion activities. Products are typically on promotion for a limited period of time, e.g. one week, during which demand is usually substantially higher than during periods without promotions (see, e.g., Cooper et al., 1999). In the marketing literature, extensive studies have been conducted on the effect of promotions on sales, usually within the context of consumer choice models (see, e.g., Fader and Hardie 1996). Some of the marketing studies have been successful in predicting which promotional efforts lead to increasing sales and have resulted in systems used for promotion planning, such as PromoCast by Cooper et al. (1999). It is however not straightforward to apply those models for the reverse purpose, i.e. given the promotional measures, forecasting the actual sales during the promotional period remains a challenging task. While information on individual stock-keeping units (SKUs) has been increasingly difficult to obtain due to the fact that time series are shorter, overall data collection on sales has been easier and cheaper due to the fact that point-of-sales data are generally collected in most of the full range grocery retailers in the developed world. Point-of-sales data are typically collected by scanning product bar codes at the cash register. These data are stored (although usually aggregated on a per day or even per week basis) in the retailer’s information systems. These data have been the subject of many analyses, primarily using statistical techniques based on either time series analysis (moving average, exponential smoothing, etc.) or on regression. Over the past ten years, however, considerable progress has been made using data analysis techniques that are commonly denoted as machine learning. These techniques generally do not imply a specific structure between the independent and dependent variables. In this paper, we apply a number of machine learning techniques as well as traditional time series and statistical approaches to the problem of forecasting demand at a grocery retailer, both during “regular” periods and during promotional periods within a particular product category. Data is used from a medium-sized full service grocery retailer in Europe. Our results are very interesting, in the sense that for periods without promotions, it is next to impossible to beat simple time series techniques. At best, the machine learning techniques

2

match the performance of the simple models. However, for periods with promotions, a substantial and significant improvement of up to 65% can be reached using more sophisticated modeling techniques. The implications of our study are exactly along the lines of these results, namely that machine learning techniques should be targeted at forecasting relatively complex time series, which in grocery retailing are the time series involving promotions. Traditional time series techniques suffice for periods without promotions. Note that most of the products in grocery retailing are not on sale during most of the time. However, many of the stockouts occur during promotions. Obviously, the use of more advanced techniques comes at a cost, namely the use of more extensive data (resulting in data preparation cost) and the maintenance of more complicated models. However, since the improvement is very substantial, the benefits are likely to outweigh the costs involved. Our paper is organized as follows. In the following section, we review the relevant literature. Next, we describe our research environment and briefly characterize the available data. In the following two sections we introduce the problem setting and the data, and describe the methods that we evaluate and compare. The paper continues with the description of the experimental setup. We present and discuss the results of our experiments in the penultimate section and conclude in the last section.

Literature review The literature on forecasting is vast and cannot be described completely in the limited space available. In line with the focus of our paper, we limit our literature review to traditional sales forecasting techniques, forecasting techniques considering promotion sales, the aggregation level, and the datamining techniques used. Many different techniques are available when focusing on traditional sales forecasting. They can roughly be grouped into the following broad categories: judgmental forecasting,

extrapolation

methods, and econometric models. Silver et al. (1998) argue that when limited or no data is available, a qualitative forecasting approach purely based on judgments and experience makes sense. Parackal et al. (2007) note that all forecasting involves judgment: from the method selection to the adaptation of the forecasts. Moreover, any judgment of the model output is always necessary, regardless of the forecasting technique used (Parackal et al., 2007). Econometric models or causal forecasting involves building explicit quantitative models identifying the different relationships with the forecasted variable. The relationships between the different variables and the dependent variable are then quantified using econometric regression techniques (Cordo and Pindyck, 1979). Extrapolation methods mainly use only the time series data of the activity itself to generate the forecast. This approach is typically less expensive to apply, requires far less data and is useful for short to mediumterm forecasting (Newbold, 1979). Since its introduction in the 1970s, the Box-Jenkins approach has

3

become a popular method for time series forecasting. A potential disadvantage of the time series approach is that in principle no other independent factors are taken into account (Alon et al., 2001). When looking at forecasting techniques in the presence of promotions, one enters more into the marketing arena. Several factors mentioned in this literature stream are found to be related to the magnitude of promotional sales. Without being complete the following factors are found to be significant in the literature. The size of the price decrease for a promoted item (in e.g. Blattberg et al.,1995; Christen et al. 1997; Lattin and Bucklin, 1989; Mulhern et al., 1991; Cooper et al., 1999); the frequency of the promotions for similar products (Christen et al., 1997); the advertising mode used, e.g. newspapers, TV, radio, etc. (see e.g. Sethuraman et al., 2002; Walters et al., 2003); category or product group characteristics (see e.g. Baltas, 2005; Narasinham et al., 1996; Hughes, 1980; Blattberg et al., 1995); the weather and bank holidays (Bunn et al., 1999; Hughes, 1980); promotions of competitors (Struse, 1987), radio support of promotions (Sethuraman et al., 2002), previous promotions of the same product, and the number of variants of the product in promotion. The last two aspects are closely related to cannibalizing demand of the promoted product by the same organization (see e.g. Neslin, 1990). Clearly, in order to obtain accurate forecasts in the presence of promotions, it is thus important to systematically collect data about characteristics of promotions. Finally, Lee et al. (2007) show that the statistical forecasts are often adjusted based on management judgment in order to take into account promotions. Their experiments suggest that the forecast accuracy can be improved if memory support is available for noting down the characteristics of the previous promotions. Cooper et al. (1999) describe and evaluate a promotion-event forecasting system, named PromoCast. The system uses data on the performance of each SKU in each store under a variety of promotion conditions, including both the store's adeptness to various promotions and chain-wide historical performance data for this SKU. It is shown that combining both individual store-SKU information and chain-SKU information improves the forecasting performance (Cooper et al., 1999). A commercial package, ACNielsen’s SCAN*PRO tool, is also frequently found in retailing environments (Van Heerde et al., 2002). SCAN*PRO motivates evolutionary model building. In general, simplicity of a model is a desirable aspect so that managers understand models.

As

managers become more

knowledgeable about a decision aid, they may be ready to implement more sophisticated variations. The model will can then be expanded, leading to the increase of complexity. The starting point is basically an econometric model which quantifies the effects of promotions characterized by e.g. a price cut, feature advertising, special displays in the store, etc. Van Heerde et al. (2002) end their paper with listing the main findings on how promotions work (e.g. the effect of the price cut size, the frequency of the promotions, etc.). Note that the primary objective of the marketing studies is not to forecast a specific quantity sold, but to determine the main effects that contribute to increased sales. Paralleling research efforts that use statistical methods for forecasting, the progress in machine learning / data mining research has been reflected in many business applications. The inductive tree

4

algorithm by Quinlan (1986) has originally addressed the classification task, i.e. determination of an output attribute that takes categorical values. Quinlan (1992) modified the decision tree approach for an output attribute of continuous type leading to the model tree idea. The approach calls for partitioning the input space within a tree structure and fitting a (linear) regression model at every leaf, and hence the name model trees, or regression trees. Meanwhile, the Classification and Regression Trees (CART) algorithm (Breiman, Friedman, Olshen, & Stone, 1983), has been implemented in a commercial software. These and alternative model tree approaches differ from each other mainly in the choice of input variable to be branched on, split criteria used, and the models constructed at every leaf of the tree. In business forecasting, perhaps the most commonly used machine learning methodology has been Artificial Neural Networks during 1990s (see, e.g., Zhang, 2003). Modeled after the biological neural network, artificial neural networks are nonlinear adaptive tools that are trained by modifying the parameters of functional relationships defined on a mathematical network of nodes. Alon et al. (2001) compares artificial neural networks with some traditional forecasting methods including Winters exponential smoothing, Box–Jenkins ARIMA model, and multivariate regression on the aggregate US retail sales data. The authors conclude that the Winters exponential smoothing model is a viable method under relatively stable economic conditions. When facing a dynamic nonlinear trend and seasonality patterns, the advanced neural network models perform much better (Alon et al., 2001). In a later study, Hansen and Nelson (2003) evaluate the performance of Neural Networks as opposed to ARIMA models on different types of macroeconomic time series data. They conclude that using transformations and decomposing the original time series into different components yield better results when such transformed attributes are used as inputs in a Neural Network. Similarly, Zhang and Qi (2005) argue that Neural Networks may not exhibit impressive behavior with time series forecasting when the series exhibit seasonality and trend. They report that the performance of the Neural Network improves significantly when the time series data is deseasonalized and detrended. As another promising machine learning tool, by the end of 1990s, Support Vector Machine (SVM) methodology has been applied successfully in many different scientific disciplines (Cristianini and Shawe-Taylor (2000). Based on the main idea of maximizing a margin of separation, and minimizing total empirical error in a balanced way, and mainly due to the implicit mapping idea induced by the Kernel functions, SVMs have performed well where complex relationships between the input attributes and the output attribute exist. Applying SVMs on different business problems has become increasingly common. Cui and Curry (2005) discuss using SVMs for predictive purposes in the marketing area. Hansen, McDonald and Nelson (2006) compare SVMs to classical methods on different time series data of economical nature and find the methodology promising. To our knowledge, the work by Cooper and Giuffrida (2000) is the only study that addresses SKU demand predictions at the store level with the information on promotions, and incorporates data

5

mining techniques. The authors use PromoCast (Cooper et al., 1999) to predict demand at the SKU level of a 95-store retail chain, and apply data mining on the residuals. The study reports that the rulebased data mining approach improves forecasting errors of PromoCast by about 9% by using attributes that define the SKU such as the manufacturer, the category/subcategory it belongs to, the promotion event and the store.

Problem environment and data description At the grocery chain considered, the majority of non-perishables were ordered via an Automated Store Ordering (ASO) system, while most perishables were ordered manually since they either required additional intelligence (like a judgment on the quality of the inventory for vegetables) or they had to be ordered via a separate ordering system (proprietary to a particular supplier). Regular sales are forecasted using the ASO system which follows a basic simple exponential smoothing logic. Promotional sales volumes are predicted based on judgmental forecasts spanning a period of six weeks for a single SKU in promotion (i.e. a promotion effectuated in week t, has been decided and organized in week t – 6). No specific analytical tool nor any other detailed information is used for promotion forecasting. The sales data was available electronically from the Point Of Sales (POS) system in place. As not all the necessary data with regards to the promotion characteristics was readily available in an electronic format with the retailer, a lot of time was devoted compiling the promotions database. At the end, an extensive database including 76 sales weeks, including promotions and their specific characteristics (price discounts, TV, radio and window sheet advertising information) was obtained. This is augmented with other product characteristics such as regular sales price, the product subcategory (each item belongs to a related "family" of products, which is denoted as the product subcategory). Our final data set focuses on one product category including four product subcategories from four different stores. More specifically, we focus on four subcategories of non-perishables items (i.e. sauces, noodles, and ready-made meals). The stores are selected such that there are small, medium and large-sized stores in the dataset. The data set consists of 4170 store-SKU-week combinations, and spans a period of 76 weeks, which is split into training (51 weeks) and test periods (the remaining 25 weeks) as will be explained later. In total, we have 168 store-SKU combinations in the dataset (where store 1 carries 41 SKUs, store 2 carries 38 SKUs, store 3 carries 44 SKUs and store 4 carries 45 SKUs). Table 1 gives some descriptive statistics on the variables in the test dataset. The table clearly shows the amount of variability in the dataset: regular sales range between 0 and 127 consumer units per week (with a mean of 7), while promotional sales range between 1 and 303 consumer units per week

6

(with a mean of 32). The number of promotions in the test set are between 1 and 6 over 25 weeks included. TABLE 1 ABOUT HERE Next to this, we observed in the dataset that 8 of the promotions were accompanied by a TV advertising, 28 promotions had a window sheet in the stores and 44 of the promotions were announced on the national radio in the country where the supermarkets were located.

Development of candidate methods Our objective is to develop a demand prediction procedure that learns from multiple SKU time series of sales, discounts and promotions. We consider three design elements: the model scope, the input variables (or features), and the modeling technique to construct candidate methods. Scope The traditional forecasting models work on the individual SKU time series that is specific to the store. To incorporate the ability to learn from multiple SKU store combinations, we increase the model scope. In the most general scope one model is used to predict the demand for all SKUs in the category across all stores. Between these two extremes there are two alternative scope options: store level models and subcategory level models. Store level models consider all SKUs in the category in the store across subcategories, whereas subcategory level models predict demand for all SKUs in the subcategory across stores. Contrasting this with the aggregation of SKU demand across stores or across sizes of the same brand flavor cited in literature (e.g. Zotteri et. Al 2005), the scope increase does not amount to aggregation. The sales for different SKUs from different stores in different weeks form the data points, where dummy variables for SKU, store, and subcategory, as well as the week number are used to identify the particular SKU store combination and the week. Notice that we are not "aggregating", but simply including observations in the same model. There are reasons to expect the accuracy of the forecasting procedure to be affected by the increase scope in both a positive and negative manner: As the model scope increases, there should be more opportunity to learn from similar situations. As it becomes more specific it should be able to focus on the "local" issues better. From cost and convenience perspective, having fewer models results in easier maintenance of the forecasting system, other things being the same. Input variables We consider three sets of input variables, spanning a spectrum in data preparation cost. Raw data The simplest data set contains the promotion (TV, Radio, Windowsheet), price and discount (dummy, absolute and percentage discount) variables for that week for the SKU store combination, in addition to the store, SKU and subcategory dummies, the weeknumber and the actual sales amount for the current and last four weeks.

7

Raw data with smoothed non-promotion sales Our initial experimentation with the benchmark method, consistent with the literature, has shown that the exponential smoothing estimate is a very good estimator of the non-promotion week sales. Therefore, we considered the aforementioned dataset with the addition of the smoothed sales of the non-promoted weeks. We used a fixed smoothing constant of 0.2 without resorting to optimization. Explicit features At the high end of data preparation cost, we used contextual knowledge to produce explicit features that are likely to affect the sales for the item. We have used variables similar to those in Cooper et al. (1999), and added other variables to capture category, store, SKU, and promotion characteristics. These variables are calculated by using only the same information as in the first data set; i.e. sales and promotion information over time. We do not leverage external information such as which SKUs have the same brand, their size, assessment of substitutability or complementarities. The variables are mainly historical statistics (average, sum, trend, standard deviation, etc) and stocks (Nerleve et al 1962) calculated based on the raw variables, for different SKU-store sets. The history included in the statistic varies from 4 to 12 weeks. Producing these 100 explicit features involves coding, computation and updates. Thus we attach a high data preparation cost to methods that use explicit features. The modeling technique We consider two groups of techniques. The stepwise regression is a well-known statistical method, and we use it to represent the traditional group of methods. As an alternative, we have the newer techniques of machine learning that produce more complex models. Within the machine learning family of techniques, we use the support vector regression (SVR) method with three different kernel functions, and the regression tree technique. The next section gives further description of the data mining techniques and the particular implementations we have used in our analysis. The candidate methods We formulated the following alternative methods where each method refers to a technique and data set combination. Note that while stepwise regression is applied in combination with three data set alternatives, some technique and data set combinations are not reported based on our preliminary analysis. We run each method with three scope alternatives. We later evaluate the accuracy performance of these methods against a benchmark method and each other. During our evaluation, we consider other dimensions of importance for adopting a forecasting method as well: namely the complexity of the models and the cost of data preparation. 1. Regression: Stepwise linear regression using raw data 2. SVR Poly1: Support Vector Regression with Polynomial kernel of degree one using raw data 3. SVR Poly2: Support Vector Regression with Polynomial kernel of degree two using raw data 4. SVR RBF: Support Vector Regression with RBF kernel using raw data 5. Regression+SM: Stepwise linear regression using raw data plus smoothed non-promotion sales

8

6. SVR Poly1+SM: Support Vector Regression with Polynomial kernel of degree one using raw data plus smoothed non-promotion sales 7. SVR Poly2+SM: Support Vector Regression with Polynomial kernel of degree two using raw data plus smoothed non-promotion sales 8. SVR RBF+SM: Support Vector Regression with RBF kernel using raw data plus smoothed nonpromotion sales 9. Feature stepwise: Stepwise linear regression using explicit features 10. RT Features: Regression tree using explicit features The scope alternatives are: i.

One model for the category across stores

ii. A model for each store across category SKUs iii. A model for each subcategory of SKUs across stores In total we evaluated 10 x 3 = 30 alternative methods, and compared their accuracy with each other and the benchmark method.

Description of Techniques and Implementation Environments The benchmark method In exponential smoothing with lift adjustment, the forecast is simply the smoothed value of the past non-promotion weeks, if there is no promotion in the coming week, else the last observed lift amount is added to the smoothed value to get the forecast.

⎧⎪ M ist , if Discist = 0, TVit = 0, Radioit = 0,Windowit = 0 Dˆ ist = ⎨ M ist + Lˆist , otherwise ⎪⎩ Where Dˆ ist is the demand forecast for SKU i, in store s, for week t Sist is the number of SKU i items sold in store s in week t TVit , Radioit , Windowit are dummy variables indicating if there is a TV ad, radio ad, and window sheet application for the SKU i in week t, respectively. Discist stands for the discount percentage applied to the SKU i in store s in week t. Note that while the chain determines which SKUs to advertise on store windows and/or promote in flyers with deep discounts, the stores are free to decide whether and how much to discount. Mist is the smoothed number of SKU i items sold in store s in week t, based on non-promotion weeks. We used a smoothing constant, α, value of 0.2 based on the traditional default value for exponential smoothing. For weeks with a promotion or discount we simply replace the sales amount with the most recent smoothed value.

9

⎧(1-α ) M is (t −1) + α Sist , if Discist = 0, TVit = 0, Radioit = 0,Windowit = 0 M ist = ⎨ M is ( t -1) , otherwise ⎩ The last observed lift amount is calculated as the difference of the actual sales and the smoothed nonpromotion sales expectation at the time of last promotion or discount. Stepwise linear regression The search for a linear relationship between an output variable and multiple input variables has resulted in stepwise selection of input variables in a regression setting. The goal is to build a function that expresses the output variable as a linear function of the input variables plus a constant. Two general approaches in stepwise regression is forward and backward selection. In forward selection, variables are introduced one at a time based on their contribution to the model according to a predetermined criteria. In backward selection, all input variables are built into the model to begin with, and than input variables are removed from the regression equation if they are judged as not contributing to the model, again based on a pre-determined criterion. We have implemented stepwise linear regression models using ProcReg module of in SAS software (Copyright 2002-2003) version 9.1.3 with the stepwise model option. Support Vector Regression SVM and its regression version Support Vector Regression (SVR) implicitly map instances into a higher dimensional feature space using kernel functions. In its most basic form, SVR ideally seeks to identify a linear function in this space that is within e distance to the mapped output points. The socalled soft margin formulation allows and penalizes deviations beyond the predetermined ε distance, and minimizes the sum of violations along with the norm of the vector that identifies the linear relationship (Smola and Scholkopf, 2004). Although it is still an area of active research, there are many implementation issues that are addressed by experimentation in applications. One such key question is the choice of the Kernel function used in mapping the data. The Radial Basis Function (RBF) as a kernel has been used successfully in many different settings. Apart from the RBF kernel, the polynomial kernel family is usually available in implementations. In this paper, we experimented with the RBF kernel and the polynomial kernels of degree one and two. Another issue in SVR implementation is the choice of the parameters ε and C. The parameter C controls the trade-off between the generalization captured by the regression relationship and the sum of violations of individual observations beyond the ε band. We have conducted our runs using the SMOreg function by Smola and Scholkopf (2004); which relies on the sequential minimal optimization technique as implemented in Weka 3.5.2 (Witten and Frank 2005) with the default values for all kernel parameters.

10

One major shortcoming of the SVR methodology is its difficulty in providing explanations and insights beyond predictions. Although the regression equation may be obtained, the kernel mapping makes it difficult to directly interpret the findings. Regression Tree The regression tree approach partitions the data into smaller subsets in a decision tree format and in each leaf it has a regression function that is used to predict the outcome. While trees are transparent, in the sense that the prediction for a particular case can be traced back to the conditions in the tree and the regression function that is applicable for cases that satisfy those conditions, trees with many layers are not easy to interpret in a generalizable manner. Yet the technique is known to be powerful and has been commercially implemented. We have used the M5P tool in the Weka environment (Witten and Frank, 2005) created by Wang and Witten, (1997), which develops and prunes the tree.

Evaluation of candidate methods We evaluate the accuracy of the candidate methods on the supermarket sales data that was described in earlier. The individual SKU models with exponential smoothing with lift adjustment technique provide the benchmark accuracy, as described earlier. It is a common best practice and its different flavors are implemented in commercial applications (Bucklin and Gupta 1999). Experiment Each of the 30 candidate methods described earlier is fed with the first 51 weeks of data as a training set. The subsequent 25 weeks constitute the test data. The resulting training data set contains 7766 observations, while the test set has 4170 observations. While in non-time series applications, the dataset is partitioned into training and test data randomly, with time series applications to get the true accuracy of the method, the test needs to simulate the real life where the model will be trained on earlier data and tested on later occurrences (Gür Ali 2008). This way, the negative impact of any concept drift; i.e., changes in the dynamics represented in the data, that may have occurred between the time that the model is trained to when it is used is reflected in the test accuracy figures. Measures We use the mean absolute error (MAE) calculated over all SKU store weeknumber combinations as a measure of overall accuracy of the candidate method. While other measures such as the mean absolute percentage error (MAPE) are typically used in forecasting applications, having SKUs with average number sold ranging from 1 to 91, such a relative measure would downplay the accuracy of the large volume SKUs. MAE, on the other hand, reflects the weight of individual SKUs in the mix and corresponds to the mistake in the ordering process in terms of overall number of items. While currency weighted results may be closer to the cost incurred, for the sake of focusing on method performance and considering that the prices of SKUs are not too different from each other, we retain the MAE as the accuracy measure.

11

To measure the significance of accuracy differences between methods we use paired differences of absolute errors, and report the significance level testing no difference between method accuracies. Beyond accuracy, the candidate methods vary in terms of their data preparation cost and model complexity. The data preparation cost involves the collecting, cleaning and storing of the sales, price and promotion information for those methods using the raw data. Extending the input data by calculating and updating the smoothed sales increases the data preparation cost marginally. Using the explicit features requires substantially more effort and cost, due to coding of the calculations, weekly update and storage of the features.

Results First, we evaluate the methods in terms of ease and cost of implementation. Other things being equal, the retailers should use methods with lower data preparation cost and complexity. For the retailer to assume the additional costs, a method has to perform significantly better in terms of its accuracy. Next, we evaluate the accuracy of the models over all SKU-store combinations. Finally we analyze the accuracy data further to identify which method performs better when. While we have performed analyses on a number of dimensions, we report only on those where we have seen meaningful difference. Overall results Figure 1 indicates the relative position of each method in the 'data preparation costs' versus 'technique complexity' plane. Clearly, the preparation cost increases as more input variables are required. It increases slightly going from the benchmark which uses only the SKU store specific sales and promotion/discount indicator to the methods using raw data; i.e. the sales, and the promotion variables (sales, TV, radio, discount variables) for all SKUs, and it again increases slightly when the smoothed non-promotion sales information is added to the raw data. The large jump in data preparation cost comes when we move to explicit features. The complexity of the technique increases as we move from the benchmark method of exponential smoothing with last promotion adjustment, which can be implemented with a simple (spreadsheet) program, to stepwise regression which is a procedure that requires a statistical package or at least a spreadsheet add-on, and needs to be updated regularly. The data mining techniques embody further complexity, as they are not necessarily linear models, and require a special purpose computer program to train and to provide predictions. As a first step in the accuracy analysis, we report the average MAE of each method over the 25 weeks and 168 store-SKU combinations in the test dataset, focusing only on the scope version that pools all the observations from all stores and subcategories together. As can be seen in the boxes in Figure 1, the Benchmark method has a MAE of 4.40. Interestingly, working with traditional statistical methods,

12

the accuracy performance does not improve compared to this Benchmark by adding more features or variables. FIGURE 1 ABOUT HERE However, going to the right along the horizontal axis by increasing the technique complexity, gains are observed. More specifically, SVR with Poly1 and RBF are doing better or similar to the Benchmark. SVR Poly2 cannot compete against the Benchmark. Adding the smoothed value from the Benchmark; i.e. the smoothed value calculated based on non-promotion weeks, further improves all SVR techniques except Poly2. The Regression Tree with Features significantly improves the performance compared to the Benchmark. An average percentage improvement of 24% is observed. Table 2 shows the significance values of the paired sample Z test for the differences of the means (Note that since the number of observations is large enough, we can use the Z probabilities rather than the t probabilities). In this table, for the clarity of the presentation we considered only those methods that were not clearly outperformed by the Benchmark based on Figure 1. As we can see from Table 2 , Regression Tree with features has a significantly better accuracy than all the other methodologies. TABLE 2 ABOUT HERE Based on this overall analysis presented in Figure 1 and Table 2, we can conclude that the more advanced machine learning/data mining techniques are outperforming the Benchmark, but at the expense of increased data manipulation and technique complexity. Spending more time and effort on the data preparation without employing complex techniques does not result in an accuracy improvement. Using similar data as in the Benchmark but feeding it to machine learning complex techniques already gives a significant improvement compared to the Benchmark. The regression tree with features is the best based on the accuracy criterion but comes at the expense of using more complex techniques and a large amount of data preparation. The following methods emerge as being on the efficient frontier: the Benchmark, SVR Poly1, SVR RBF, SVR Poly1+Sm, SVR RBF+Sm and RT with features. These methods provide an improvement in accuracy in return for the complexity and/ or data preparation costs that the use has to incur. While we have performed the analyses for each method with the other scope options, namely by subgroup, or by store, keeping all the observations in the model performs as good or better than the other alternatives. The MAE figures in Table 3 indicate that the largest scope performance is either the best or within 1.8% of the best in all cases. TABLE 3 ABOUT HERE Effect of Promotions on Acuracy Next, we evaluate how the relative method performance is affected by promotions. We examine this effect in two ways, first we evaluate accuracy in promotion and non-promotion weeks, and second, we evaluate the accuracy performance of store-SKU combinations with fewer or more promotions.

13

To evaluate the accuracy of a promotion forecast, we split up the initial dataset in two parts: weeks with promotion and without promotions. We start from the complete test set of 4170 store-SKU-week combinations and separate the weeks based on the presence of a promotion. This results in two subsets of size 384 (weeks with a promotion) versus 3786 (weeks without a promotion) store-SKU-week combinations. Table 4 shows the accuracy measured by the MAE for these two subsets for the methods on the efficient frontier. TABLE 4 ABOUT HERE Interestingly, none of the complex techniques based on machine learning and data mining are able to improve the forecasting accuracy in weeks when there is no promotion. Only the SVR with Poly1 and smoothing is giving the same performance as the benchmark for the no promotion weeks. This suggests that the benchmark is performing good if the sales are relatively stable and without interference of any promotions. On the other hand, focusing on the promotion weeks, we see that the benchmark has a poor performance with an MAE equal to 22.19. All methods considered in the table significantly improve the performance compared to the Benchmark (improvements range from 21.13% to 65.17%). Moreover, the Regression Tree with Features completely outperforms all other methods: an improvement of 65.17% is observed in the forecast accuracy. This is an important improvement compared to the current practice methods used in this type of retail grocery chains. Note that this is a similar conclusion as observed by Alon et al. (2001). The authors concluded in their paper that the simple methods perform good for relatively stable data series, while advanced (artificial neural network) models were doing better for dynamic demand series with a lot of seasonality and strong trends. Alon et al. (2001) looked however at aggregate US sales data, while we see similar results but on a store-SKU level. The above results suggest using a combined forecasting approach: for the non-promoted weeks use the benchmark method, while the promoted weeks the RT with features can be used. This jointly would yield an MAE of 3.07, which corresponds to an overall performance improvement of 30.2% in the forecast accuracy. The combination would not deteriorate the forecast quality for the non-promotion weeks but would improve the average forecast quality with 65.17% for the promotional items. Another way of evaluating the impact of promotions on method performance is to consider the accuracy performance on store-SKU combinations with fewer and more promotions. Here, we look into the number of promotions per store-SKU and the effect on the forecasting accuracy. The number of promotions range from 1 to 6 over the 25 weeks in our test set. We use three equally sized classes based on the number of promotions resulting in [1,2], [2,3] and [4,6] each containing 1/3 of the total number of observations. Figure 2 shows the MAE performance of the non-dominated methods by promotion class. FIGURE 2 ABOUT HERE

14

Figure 2 shows that, regardless of the method, more promotions per SKU lead to higher forecasting errors. Intuitively, this is reasonable as forecasting is harder with a lot of changes in the sales due to discounts, etc. On the other hand, the accuracy of the advanced data mining techniques is getting affected less by the volume of promotions. Overall, Regression Tree with Features is again outperforming all other methods for all promotion classes, while SVR RBF+Sm provides a robust performance as a less costly method. Other Observations As expected, across all methods we see that the MAE increases with the size of the store-SKU; i.e. the average sales; however the MAE as a percentage of the average sales goes down. Similarly, across methods, as the sales variability increases, the forecasting accuracy decreases. The more dynamic the SKU, the larger the difference in performance in favor of the data mining methods.

Conclusions

In grocery retailing, forecasting sales is a very important problem. Assortments have increased, making the forecasting problem more challenging as it involves more and more products. Simultaneously, forecasting methods deployed in industry are mainly based on exponential smoothing, possibly with a lift adjustment for promotion periods. Advances in data mining / machine learning techniques in the past decades have hardly been applied in grocery retailing, while opportunities appear to be abound due to the wide availability of point of sales data. In this study, we evaluated the benefits of using machine learning techniques in forecasting grocery retail demand. Our main finding is that using machine learning techniques considerably increases forecast accuracy. We observed an improvement of 24% using Regression Tree with explicit features reflecting the current SKU and category dynamics. Further analysis shows that this improvement is caused in its entirety by the forecast improvement in the promotions periods. In fact, exponential smoothing cannot be improved by any of the machine learning techniques in the periods without promotions. This is in line with earlier (limited) studies on macro-economic forecasts. A second main finding concerns the use of more detailed data in the models. Our results clearly indicate that using more detail is only beneficial if more advanced techniques are used. Using a linear regression model and using elaborate features does not add any benefits, while doing the same using a complex machine learning model that can apply the appropriate features for appropriate subsets of the data brings significant benefits.

15

Thirdly, pooling the observations across SKUs and stores does increase accuracy. While in this research we have evaluated the methods on a particular grocery chain and category, the results show the potential of pooling, and using machine learning techniques along with elaborate features borrowed from marketing literature that express the potential factors affecting demand. Our results give a very strong and clear suggestion as how to better make use of available data. Retailers can improve forecast accuracy during promotion periods substantially, with improvements of up to 65% in our data set. Simultaneously, for periods without promotions, it is not worth investing in more advanced techniques and exponential smoothing serves as a very good forecasting method. Further research is needed to re-evaluate and improve these methods in diverse situations, involving retailers potentially from different segments and cultures, and diverse SKUs in terms of e.g., price range, category, and seasonality.

Acknowledgement This research has been partially supported by KUMPEM research funds. KUMPEM is the joint Professional Education Center of Koc University and Migros. We thank Gokhan Tekiner for his assistance in performing technique runs.

References Alon I, Qi M, and Sadowski R J (2001). Forecasting Aggregate Retail Sales: a Comparison of Artificial Neural Networks and Traditional Methods. Journal of Retailing and Consumer Services 8: 147-156. Baltas G (2005). Modeling Category Demand in Retail Chains. Journal of the Operational Research Society, 56: 1258–1264. Bayus B, Putsis W P J(1999). Product proliferation: An empirical analysis of product line determinants and market outcomes. Marketing Sci. 18(2) 137–153. Blattberg R C, Briesch R, and Fox, E J (1995). How Promotions Work. Marketing Science 14: 122132. Breiman L, Friedman J H, Olshen R A and Stone C J (1983). Classification and Regression Trees. Wadsworth. Bucklin R E, and Gupta S (1999). Commercial Use of UPC Scanner Data: Industry and Academic Perspectives. Marketing Science 18: 247-273. Bunn D W and Vassilopoulos A I (1999). Comparison of Seasonal Estimation Methods in Multi-Item Short-Term Forecasting. International Journal of Forecasting 15: 431-443.

16

Christen M, Gupta S, Porter J C, Staelin R, and Wittink D R (1997). Using-Market Level Data to Understand Promotion Effects in a Nonlinear Model. Journal of Marketing Research 34: 322-334. Cooper L G, and Giuffrida G (2000). Turning Datamining into a Management Science Tool: New Algorithms and Empirical Results. Management Science 46: 249-264. Cooper L G, Baron P, Levy W, Swisher M, and Gogos P (1999). PromoCast Trademark: A New Forecasting Method for Promotion Planning. Marketing Science 18: 301-316. Cordo V, and Pindyck R S (1979). An Econometric Approach to Forecasting Demand and Firm Behavior: Canadian Telecommunications. In S. Makridakis and S.C. Wheelwright (eds.), TIMS Studies in the Management Sciences 12: 95-111. Cui D, and Curry D (2005). Prediction in Marketing Using the Support Vector Machine. Marketing Science 24: 595-615. Fader P S and Hardie B G S (1996). Modeling Consumer Choice Among SKUs. Journal of Marketing Research 33: 442-452. Foekens E W, Leeflang P S H and Wittink D R (1994). A comparison and an exploration of the forecasting accuracy of a loglinear model at different levels of aggregation. International Journal of Forecasting 10: 245-261. Gür Ali, Ö, (2008). Evaluation Techniques for Data Mining. In F. Ruggeri, R. Kenett, and F. Faltin (eds.) Encyclopedia of Statistics in Quality and Reliability 2008. J. Wiley. Hansen J V, McDonald J B, and Nelson R D (2006). Some evidence on forecasting time-series with support vector machines. Journal of the Operational Research Society 57: 053-1063. Hansen J V, and Nelson R D (2003). Forecasting and recombining time-series components by using neural networks. Journal of the Operational Research Society 54: 307-317. Hughes G D (1980). Sales Forecasting Requirements. In S. Makridakis and S.C. Wheelwright (eds.), The Handbook of Forecasting: A Manager’s Guide, John Wiley & Sons, Inc., 13-24. Lattin J M, and Bucklin R E (1989). Reference Effects of Price and Promotion on Brand Choice Behavior. Journal of Marketing Research 26: 299-310. Lee C B (2003). Demand chain optimization: pitfalls and key principles. EVANT White Paper Series. Lee W Y, Goodwin P, Fildes R, Nikolopoulos K, and Lawrence M (2007). Providing support for the use of analogies in demand forecasting tasks. International Journal of Forecasting 23:3, 377-390. Makridakis S, and Wheelwright S C (1982). Introduction to Management Forecasting: Status and Needs. In S. Makridakis and S.C. Wheelwright (eds.), The Handbook of Forecasting: A Manager’s Guide, John Wiley & Sons, Inc., 13-24. Mulhern F J, and Leone R P (1991). Implicit Price Bundling of Retail Products: a Multiproduct Approach to Maximizing Store Profitability. Journal of Marketing 55: 63-76.

17

Narasimhan C, Neslin S A, and Sen S K (1996). Promotion Elasticities and Category Characteristics. Journal of Marketing 60: 17-30. Nerlove M and Arrow K J (1962). Optimal Advertising Policy under Dynamic Conditions. Economica 29: 129-142. Neslin S A (1990). A market response model for coupon promotions. Marketing Science 9: 125-145. Newbold P (1979). Time-Series Model Building and Forecasting: A Survey. In S. Makridakis and S.C. Wheelwright (eds.), TIMS Studies in the Management Sciences 12: 59-73. Parackal M, Goodwin P, and O'Connor M (2007). Judgment in forecasting. International Journal of Forecasting, 23: 343-345. Quinlan J R (1992). Learning with continuous classes. Proceedings of the Australian Joint Conference on Artificial Intelligence. 343--348. World Scientific, Singapore. Quinlan J R (1986). Induction of Decision Trees. Machine Learning, 1, 81-106. SAS Sofware (2002-2003), SAS Institute Inc., Cary, NC, USA. Sethuraman R, and Tellis G (2002). Does Manufacturer Advertising Suppress or Stimulate Retail Price Promotions? Analytical Model and Empirical Analysis. Journal of Retailing 78: 253-263. Shawe-Taylor J, and Cristianini N (2000). Support Vector Machines and other kernel-based learning methods. Cambridge University Press. Smola A J

and Scholkopf,

B (2004). A tutorial on support vector regression.

Statistics and

Computing 14: 199-222. Struse R W (1987). Commentary: approaches to promotion evaluation: a practitioner’s viewpoint. Marketing Science 6: 150-151. Van Heerde H J, Leeflang P S H and Wittink D R (2002). How promotions work: SCAN*PRO-Based evolutionary model building. Schmalenbach Business Review: ZFBF, 54 (3), 198-220. Walters R G (1991). Assessing the Impact of Retail Price Promotions on Product Substitution, Complementary Purchase, and Interstore Sales Displacement. Journal of Marketing 55: 17-28.

Wang Y and Witten I H (1997). Induction of model trees for predicting continuous classes. Proceedings of the poster papers of the European Conference on Machine Learning. University of Economics, Faculty of Informatics and Statistics, Prague. Witten I H and Frank E (2005). Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco. Zhang G P and Qi M (2005). Neural network forecasting for seasonal and trend time series, European Journal of Operational Research 160: 501-514. Zhang G P (2003). Neural Networks in Business Forecasting. Information Science Publishing.

18

Zotteri G, Kalchschmidt M and Caniato F (2005), The impact of aggregation level on forecasting performance, International Journal of Production Economics, Vols. 93-94, pp. 479-491

FIGURES Figure1

Data preperation costs

Feature stepwise

RT w features

4.86 (+10.4%)

3.35 (-23.8%)

Regression +Sm

SVR Poly1+Sm

SVR Poly2+Sm

SVR RBF+Sm

4.50 (+0.2%)

3.91 (-11.2%)

5.65 (+28.4%)

3.85 (-12.5%)

Regression

SVR Poly1

SVR Poly2

SVR RBF

5.03 (+14.3%)

4.38 (-0.5%)

5.45 (+23.8%)

4.18 (-4.9%)

Benchmark 4.40

Traditional

Machine Learning Technique complexity

Figure 2 6 5.5 5

MAE

4.5 4 3.5 3 2.5 2 1

2

3

Promotion class Benchmark

RT Feat

SVR RBF+Sm

SVR RBF

SVR Poly1+Sm

SVR Poly1

19

TABLES Table 1 Mean

Minimum

Maximum

7.10

0

127

32.11

1

303

Number of promotions

2.28

1

6

Regular price

1.53

0.65

2.45

Price cut (absolute)

0.31€

0.05€

1.02€

Price cut (relative)

21%

5%

52%

Weekly Sales (no promotions) Weekly Sales (promotions)

Table 2 Benchmark Benchmark SVR Poly1 SVR RBF SVR

SVR Poly1

-

SVR RBF

SVR

SVR

Poly1+Sm

RBF+Sm

0.0151

0.2166

0.4924

0.5503

1.0458

(0.3964)

(0.1041)

(0.0002)

(0.0000)

(0.0000)

-

0.2015

0.4773

0.5352

1.0308

(0.0000)

(0.0000)

(0.0000)

(0.0000)

-

0.2758

0.3337

0.8293

(0.0000)

(0.0000)

(0.0000)

-

0.0579

0.5535

(0.0883)

(0.0000)

-

0.4956

-

-

-

-

-

Poly1+Sm SVR

RT Feat

-

-

-

-

RBF+Sm

(0.0000)

Table 3 SVR

SVR

SVR

SVR

RT Feat

Poly1

RBF

Poly1+Sm

RBF+Sm

All observations

4.39

4.18

3.91

3.69

3.35

By store

4.64

4.16

4.42

3.95

3.86

By subgroup

4.31

4.20

3.90

3.88

7.61

20

Table 4 Benchmark Promotions No

22.19

2.60

promotions

SVR Poly1

SVR RBF

SVR

SVR

Poly1+Sm

RBF+Sm

RT Feat

17.50

15.43

16.94

14.98

7.73

(-21.13%)

(-30.48%)

(-23.66%)

(-32.50%)

(-65.17%)

3.05

3.04

2.58

2.72

2.91

(+17.46%)

(+17.23%)

(-0.64%)

(+4.83%)

(+12.12%)

CAPTIONS FOR FIGURES AND TABLES Figure1: The forecasting accuracy (MAE) for the different methods (Values in brackets are the percentage difference compared to the Benchmark. Boldfaced methods are non-dominated.)

Figure 2: The number of promotions versus forecasting accuracy (MAE) Table 1: Descriptive statistics Table 2: Means and significance level(between brackets) for the paired sample z tests on the mean difference for the non-dominated methodologies. Table 3: The forecasting accuracy (MAE) for the non-dominated methods by model scope Table 4: The forecasting accuracy (MAE) for promotions versus no promotions weeks

21