arXiv:1607.02177v2 [cs.LG] 8 Aug 2016
Applying Deep Learning to the Newsvendor Problem
Afshin Oroojlooyjadid Lehigh University Bethlehem, PA 18015
[email protected]
Lawrence Snyder Lehigh University Bethlehem, PA 18015
[email protected]
Martin Takáˇc Lehigh University Bethlehem, PA 18015
[email protected]
Abstract The newsvendor problem is one of the most basic and widely applied inventory models. There are numerous extensions of this problem. One important extension is the multi-item newsvendor problem, in which the demand of each item may be correlated with that of other items. If the joint probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may be not optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately as is typically done. The algorithm does not require the knowledge of the probability distributions of the demand. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and SVM approaches, especially for demands with high volatility.
1 Introduction The newsvendor problem optimizes the inventory of a perishable good. Perishable goods are those that have a limited selling season; they include fresh produce, newspapers, airline tickets, and fashion goods. The newsvendor problem assumes that the company purchases the goods at the beginning of a time period and sells them during the period. At the end of the period, it must dispose of unsold goods, incurring a holding cost. In addition, if it runs out of the goods in the middle of the period, it incurs a shortage cost, losing potential profit. Therefore, the company want to know the order quantity that minimizes the expected sum of the two costs described above. The optimal order quantity for the newsvendor problem can be obtained by solving following optimization problem: min C(y, d) = Ed [cp (d − y)+ + ch (y − d)+ ] , (1) y where d is the unknown demand, y is the order quantity, cp and ch are the per-unit shortage and holding costs (respectively), and (a)+ := max{0, a}. In the classical version of the problem, the shape of the demand distribution (e.g., normal) is known, and the distribution parameters are either known or estimated using available (training) data. If F (·) is the cumulative density function of the demand distribution and F −1 (·) is its inverse, then the optimal solution of (1) can be obtained as cp ∗ −1 y =F . (2) cp + ch (See e.g., Snyder and Shen [2011]) There are many extensions to this problem. In real-world problems, companies rarely have only a single item, so the model should consider many items. In this case, the demand of each item is correlated with that of other items. Therefore, the demand follows a multivariate probability
distribution. Another common extension of the newsvendor problem considers the availability of some additional data—called features—along with demand information. This might include weather conditions, day of the week, month of the year, store location, etc. . One interesting extension of the newsvendor problem—the version we consider in this paper—is a combination of these two variants, i.e., a multi-item newsvendor problem with multi-dimensional features. The remainder of this paper is structured as follows. A brief summary of the current state of the art on this problem is presented in Section 2. Then, Section 3 summarizes the drawbacks of the current algorithms and clarifies our contribution. Section 4 presents the details of the proposed algorithm. Numerical experiments are provided in Section 5, and the conclusion and a discussion of future research complete the paper in Section 6.
2 Current State of the Art Currently, there are several approaches in the literature for solving the multi-item newsvendor problem with data features. One common approach is to consider a multi-variate normal distribution for demand. Since the cost function can be decomposed into n separate cost functions, each of which corresponds to one item, the problem can be solved easily. Indeed, for each item the optimal solution can be obtained with (2). A broad list of research on this approach is in Turken et al. [2012]. However, approximating the probability distributions is prone to error, resulting in a sub-optimal solution of the newsvendor problem. Therefore, the classical approach is not always appropriate, and an efficient algorithm which takes advantage of multiple features and does not require approximating probability distribution would be more effective. In order to address this issue, Bertsimas and Thiele [2005] proposed a data driven model to solve the classical newsvendor problem. The algorithm sorts the demand in ascending order and selects the cp demand dj such that j = ⌈n cp +c ⌉. The model does not approximate the probability distribution, h so it avoids that pitfall. They also provided an extension of this model for problems with shortage and holding costs and a problem with multiple items. However, their model does not perform well in cases in which there is not much historical data available and/or the demands are volatile. In addition, the model only works with categorical features and cannot consider continuous features. To overcome the problems of the classical approach and data driven models, some SVM models and demand prediction methods with SVM have also been proposed. Rudin and Vahn [2013] deal with the newsvendor problem when there are p features for each of n historical observations. They presenteded three algorithms to determine the order quantities. Two of their algorithms are SVM algorithms with a revised objective function which is the newsvendor objective function plus the norm of the SVM variables. They compared their models with Separated Estimation and Optimization (SEO) and Sample Average Approximation (SAA) models. Their data comes from a nurse staffing problem with about 2000 records. Their features are the day of the week, time of the day, m days of past demands, the sample average of past demands, and the differences in the order statistics of past demands, called Operational Statistics (OS). Their results show that the non-linear kernel with ℓ1 regularization yields the best results and has a smaller CPU time than the SVM models. As mentioned, one of the most common approaches for solving the multi-feature newsvendor problem is to predict the demand with SVM and then solve the classical newsvendor model using the forecast demand distribution. Yu et al. [2013] proposed a SVM model to forecast the newspaper demand. They considered the sales data of newspapers at different types of stores. In addition to the store type, they considered 32 factors such as education distribution in the area of the shop, age distribution, average income level, and gender distribution. A model is proposed to forecast demand for each newspaper at each store. Their numerical results show that they have about a 12% prediction error. The number of warranty claims in each year can also be modeled as a newsvendor problem. Time series models, artificial neural network models, piecewise models of exponential distributions and Weibull and log-linear Poisson models are the most common approaches to forecast the number of warranty claims. However, these models are not designed to consider available features. Thus, Wu and Akbarov [2011] proposed a weighted support vector regression model, considering all of the available information (features), to forecast the number of warranty claims. Their model gives more priority to the most recent warranty claims, which was new in the literature. In order to solve the model and obtain the related parameters, Wu and Akbarov [2011] presented a Genetic Algorithm. 2
They did not intend to minimize the entire supply chain’s costs; however, they claimed that a high quality forecast leads to overall cost savings. Viaene et al. [2000] proposed a model to predict whether or not a purchase would result from any offered direct mailing company in a different product category. They proposed a LS-SVM classifier model, and a comprehensive feature selection experiment was performed. They mentioned that removing irrelevant or redundant features often improves the quality of a classifier and makes the problem easier. Finally, three main categories and 25 features were selected to build the LS-SVM model. In their numerical analysis, the matrix of the samples was huge, so it could not easily be stored. To address this problem, an iterative approach based on the Hestenes-Stiefel conjugate gradient algorithm was proposed to solve the model. Using the LS-SVM model, they reduced the 25 features to 9 features with almost the same performance. Finally, based on those features, they evaluated their model. Chi et al. [2007] proposed a SVM model to determine the replenishment point in a Vendor Managed Replenishment system. They considered eight features to build the model: Order-up-to level, average time between merchandiser visits, time between shipments, shipment lead time, range of time between merchandiser visits, minimum shipment quantity, day shipments are stopped, and initial inventory. Based on the training data, a SVM model is provided. In order to obtain the parameter values, a Genetic Algorithm is used, and the model is built and solved to obtain the replenishment point. Carbonneau et al. [2008] presented a SVM model to forecast the manufacturer’s demand, which is an estimation of demand in the next three days. They compared the result of Naive Forecast, Average, Moving Average, Trend, and Multiple Linear Regression algorithms with the results of Neural Networks, Recurrent Neural Networks, and SVM algorithms. According to their numerical experiments, the Recurrent Neural Network and the Least Squares Support Vector Machine (LSSVM) have the best results. Ali and Yaman [2013] proposed an algorithm for demand forecasting of groceries that have large datasets. Mostly, those datasets contain millions of records, and for each record there are thousands of features. Thus, the general SVM methods is not able to solve these problems. To solve this problem, the authors proposed an algorithm to reduce the number of rows and columns of the datasets without losing the accuracy of the model. Finally, an extensive numerical experiment is presented with about 170 columns and 100,000 records. Lu and Chang [2014] proposed a sales forecasting algorithm by integrating Independent Component Analysis (ICA) with K-means clustering and Support Vector Regression (SVR). In their algorithm, they used ICA to obtain hidden features of datasets. Then, the algorithm clusters the sales data into several disjoint clusters. Finally, in each cluster, the SVR forecasting models are applied and the final forecasting results are provided. Another common approach to solve the problem is using deep learning to obtain a demand forecast. This is a similar approach to the work of Yu et al. [2013]. Vieira [2015] proposed a deep learning algorithm to predict online activity patterns that result in an online purchase. The results can be directly used to predict the number of buy sessions or the demand.
3 Drawbacks of the Current Approaches In the previous section, we reviewed three approaches for solving the multi-product newsvendor problem with data features. The classical approach obtains an estimate of the probability distribution of demand and uses (2) to obtain the order quantity for each item. However, obtaining the probability distributions with any approximation procedure involves some level of errors, especially when there are insufficient historical data. In addition, sometimes there is not enough data to approximate a probability distribution so in each of these cases, the classical approach cannot be used. Alternatively, Bertsimas and Thiele [2005] proposed a data driven model that directly optimizes the order quantity. However, this model fails when there is volatility among the training data. In addition, similar to the classical newsvendor algorithm, it only considers categorical features and cannot deal with continuous features, which are common in supply chain demand data. We saw such continuous 3
features in the datasets used by Ali and Yaman [2013] and Rudin and Vahn [2013], which had price and operational statistics in their inputs, respectively. Finally, SVM and deep learning models have been proposed to obtain a forecast of the demand. The works of Viaene et al. [2000], Chi et al. [2007], Carbonneau et al. [2008], Wu and Akbarov [2011], Ali and Yaman [2013], Yu et al. [2013], Lu and Chang [2014] use this approach. However, the aim of the newsvendor problem is to obtain the order quantity, so those models require a second step in which the demand forecast is used to derive an order quantity. Most of the approaches cited above set the order quantity equal to the demand forecast. However, this ignores a core insight from the newsvendor model, namely, that the order quantity should be larger (or, in rarer cases, smaller) than the demand forecast, with the actual quantity driven by the relative magnitudes of the shortage and holding costs. To address this drawback, Rudin and Vahn [2013] proposed a revised SVM model, which considers the cost coefficients to obtain the order quantities. However, their model does not work for a large number of instances, and its learning is limited to the current training data. In addition, if the training data contain only a small number of observations for each combination of the features, the model learns poorly. Therefore, there is a need for an algorithm that works with even small amounts of historical data. The model must not only work with large datasets, but also solve them in a reasonable amount of time. 3.1 Our Contribution In this research, we develop a new approach to solve the multi-product newsvendor problem with data features. Our algorithm is based on deep learning. Deep learning has many applications in image processing, speech recognition, drug and genomics discovery, time series forecasting, weather prediction, and—most relevant to our work—demand prediction. For a comprehensive review on the algorithm and its applications, see Schmidhuber [2015], LeCun et al. [2015], Deng et al. [2013], Qiu et al. [2014a], SHI et al. [2015], and Längkvist et al. [2014]. In this study, we introduce a new application of deep learning to solve the multi-feature newsvendor problem. To adapt the deep learning algorithm for the newsvendor problem, we propose a revised loss function, which considers the impact of shortage and holding costs. The revised loss function allows the deep learning algorithm to obtain the minimizer of the newsvendor cost function directly, rather than first estimating the demand distribution and then choosing an order quantity. In the presence of sufficient historical data, this approach can solve problems with known probability distributions as accurately as formula (2) solves them. However, the real value of our approach is that it is effective for problems with small quantities of historical data, problems with unknown/unfitted probability distributions, or problems with volatile historical data—all cases for which the current approaches fail.
4 Methods In this section, we present the details of our approach for solving the multi-product newsvendor problem with data features. Assume there are n demand observations for m items, and for each demand observation, we also have an observation of p features. That is, the data can be represented as {(x1j , d1j ), · · · , (xnj , dnj )}, where xij ∈ Rp , dij ∈ R, i = 1, · · · , m, j = 1, · · · , n. The problem mathematically is formulated in (3), obtaining the order quantities y1 , · · · , ym for each period j. # "m X (3) ch (yi − di )+ + cp (di − yi )+ . min Ed y1 ,··· ,ym
i=1
As noted above, there are many studies on the application of deep learning for demand prediction (see SHI et al. [2015]). Most of this research uses the Euclidean loss function. (See Qiu et al. [2014b].) However, the demand forecast is an estimate of the first moment of the demand probability distribution; it is not, however, the optimal solution of model (3). Therefore, another optimization problem must be solved to translate the demand forecasts into a set of order quantities. This approach is called the separated estimation and optimization problem, which may result in a non-optimal order quantity (Rudin and Vahn [2013]). To address this issue, we revised the Euclidean loss function 4
such that instead of predicting the demand, it minimizes the newsvendor cost function. Thus, running the corresponding deep learning algorithm gives the order quantity directly. Let us note that from our experience, using the loss function given by (3) leads to slower training and is giving a little bit worse results when compared with squared cost function. Therefore, we use "m # X + + 2 cp (di − yi ) + ch (yi − di ) , min Ed y1 ,··· ,ym
i=1
which penalizes the order quantities that are far from di much more than those that are close. Since at least one of the two terms in this function must be zero, we have N
1 X Ei N i 1 ||cp (di − yi )||22 Ei = 21 2 2 ||ch (yi − di )||2 E=
(4) yi < di , di ≤ yi .
To determine the gradient of the network’s weights, we first note that in the case of excess inventory (di ≤ yi ), ∂Ei ∂(yi − di ) ∂(gk (zk )) = ch (yi − di ) = ch (yi − di ) ∂wjk ∂wjk ∂wjk ∂(zk ) ′ g (zk ) = ch (yi − di )aj gk′ (zk ), = ch (yi − di ) ∂wjk k k) where ∂(z ∂wjk = gj (zj ) = aj . Similar analysis holds for the inventory shortage case (yi < di ), so the gradient is: ∂Ei cp (di − yi )aj gk′ (zk ) yi < di , = (5) ch (yi − di )aj gk′ (zk ) di ≤ yi . ∂wjk
With the proposed loss function (4), deep learning using (5) iteratively updates the weights of the networks, to obtain an approximate optimal solution for the newsvendor problem.
5 Numerical Experiments In this section, we discuss the results of our numerical experiments. In addition to implementing our deep learning algorithm, we implemented the data-driven model by Bertsimas and Thiele [2005], the SVM model by Rudin and Vahn [2013], and the classical solution (2) with parameters µ and σ set to the mean and standard deviation of the training data in each data cluster. Hereinafter, we call these algorithms DNN, DDM, SVM, and NV, respectively. In order to compare their results, the order quantities were obtained with each algorithm and the corresponding cost function cost =
m X n X
cp (dji − yji )+ + ch (yji − dji )+
i=1 j=1
was calculated. We implemented the loss function and the corresponding gradient for the deep learning algorithm in caffe (Jia et al. [2014]) using C++. All of the deep learning experiments were done with caffe. Note that the deep learning and SVM algorithms are scale dependent. So, the tunned parameters of the problem for a given set of cost coefficients does not necessarily work for other values of the coefficients. Thus, the cost coefficients were scaled. Also, the features were translated to their corresponding binary representations. In this way, the maximum accuracy of the learning algorithms was achieved. In what follows, we demonstrate the results of the four algorithms in three separate experiments. First, in subsection 5.1, a small problem is used to show the situations in which deep learning gives a smaller cost than the other three algorithms. Second, the results of the four algorithms on a realworld dataset is presented in subsection 5.2. Finally, in subsection 5.3, to determine the conditions under which deep learning outperforms the other algorithms on larger instances, several datasets were generated randomly and the results of the four algorithms were compared. 5
Table 1: Demand of one item through three weeks. Week 1 Week 2 Week 3
Mon 1 6 3
Tue 2 10 6
Wen 3 12 8
Thu 4 14 9
Fri 3 12 8
Sat 2 10 6
Sun 1 10 5
Table 2: Order quantity proposed by each algorithm for each day and the corresponding cost. The bold costs indicate the best newsvendor cost for each instance. (cp , ch ) (1,1)
(2,1)
(10,1)
(20,1)
Algorithm DNN DDM SVM NV DNN DDM SVM NV DNN DDM SVM NV DNN DDM SVM NV
(1,3) 3.5 1.0 1.3 3.5 5.8 6.0 6.0 5.0 5.6 6.0 8.0 8.2 5.8 6.0 8.0 9.4
(2,6) 6.0 2.0 2.2 6.0 7.3 10.0 7.0 8.4 9.3 10.0 10.0 13.6 9.6 10.0 10.0 15.4
(Day,Demand) (3,8) (4,9) (5,8) 7.5 9.0 7.5 3.0 4.0 3.0 3.1 4.0 4.9 7.5 9.0 7.5 8.9 9.7 8.9 12.0 14.0 12.0 8.0 9.0 10.0 10.2 12.0 10.2 11.2 13.1 11.2 12.0 14.0 12.0 12.0 14.0 16.0 16.0 18.4 16.0 11.6 13.5 11.6 12.0 14.0 12.0 12.0 14.0 16.0 18.1 20.8 18.1
(6,6) 6.5 2.0 5.8 6.5 8.1 11.0 11.0 9.2 10.2 11.0 18.0 15.0 10.6 11.0 18.0 17.1
(7,5) 5.5 1.0 6.7 5.5 7.0 10.0 12.0 8.2 9.2 10.0 20.0 14.0 9.6 10.0 20.0 16.1
cost 2.5 29.0 20.3 2.5 10.7 30.0 18.0 18.5 24.6 30.0 53.0 56.2 27.2 30.0 38.0 70.1
5.1 Numerical Experiments on a Small Problem Consider the small, single-item instance whose demands are contained in Table 1. In order to obtain the results of each algorithm, the first two weeks are used for training data and the third week is used for testing. To train the corresponding deep network, a fully connected network with one hidden layer is used. The network has nine binary input nodes for the day of week and item number. The hidden layer contains one sigmoid node, and in the output layer there is one inner product function. Thus, the network has 10 variables. Table 2 shows the results of the four algorithms. The first column gives the cost coefficients. (Note that we assume cp ≥ ch since this is nearly always true for real applications.) For each cost coefficient, table lists the order quantity generated by each algorithm for each day. The last column lists the total cost of the solution returned by each algorithm, and the minimum costs for each instance are given in bold. First consider the results of the DDM algorithm. The DDM algorithm uses ch and cp and provides the index of a historical data. In this problem there are only two observed historical data. Thus, DDM’s output is one of the two training records. In particular, for cp /ch ≤ 1, the DDM algorithm chose the smaller demands were selected as the order quantities and for cp /ch > 1 it chose the larger demands. Since the testing data is nearly equal to the average of the two demand vectors, the difference between DDM’s output and the real demand values is quite large. Thus, it does not give a competitive cost. This is an example of how the DDM algorithm can fail when the historical data are volatile. Now consider the results of the NV algorithm. For the case in which ch = cp (which is not very realistic), NV’s output is approximately the mean demand, resulting in a cost of 2.5, which is equal to DNN’s cost. In all other cases, however, the increased value of cp /ch results in a larger order quantity. Therefore, the distance between the actual demand and the proposed order quantity is getting larger, resulting in higher costs. As shown in the table, for all other values of cp and ch , the cost of DNN is significantly smaller than NV’s cost. Finally, DNN outperforms the SVM algorithm because SVM uses a linear kernel, while DNN uses a linear and non-linear kernel. Also, there are only two features in this data set, so SVM has some difficulty to learn the relationship of the inputs and output. Finally, the small quantity of historical data negatively affects the performance of SVM. 6
Relative Cost Ratio To DNN DNN DDM SVM NV
2.2
cost ratio
2 1.8 1.6 1.4 1.2 1 1
2
3
4
5 6 cp / ch
7
8
9
10
Figure 1: Ratio of each algorithm’s cost to DNN cost on a real-world dataset.
This small example shows some conditions under which DNN outperforms the other three algorithms. In the next section we show that similar results hold even for a real-world dataset. 5.2 Numerical Experiments on a Real-World Dataset We tested the four algorithms on a real-world dataset consisting of basket data from a retailer in 1997 and 1998 from Foo. There are 9288 records for the demand of 24 different departments in each day and month, of which 75% is considered for training and the remainder is for testing. The data were transformed to their binary equivalent, so there are 43 input features. A fully connected network, which has 350 and 100 sigmoid nodes in the first and second layers, respectively, was used for training of the DNN algorithm. The results of each algorithm for 82 values of cp and ch are shown in Figure 1. In the figure, the vertical axis shows the normalized costs, i.e., the cost value of each algorithm divided by the corresponding DNN cost. The horizontal axis shows the ratio cp /ch for each instance. As before, we set cp ≥ ch to reflect real-world settings. As shown in Figure 1, for this data set, for every value of cp /ch the DNN algorithm outperforms the other three algorithms. Among the three remaining algorithms, the NV algorithm has the closest results to DNN. On average, its corresponding cost ratio is 1.09, whereas the ratios for DDM and SVM are 1.26 and 1.29, respectively. However, neither DDM nor SVM are stable: their cost ratios fluctuate between 1.10 and 2.30. Finally, consider the CPU time. All of the computation are done in 1.8 GHz machine with 18 GB of Ram. For the basket dataset, NV and DDM algorithms are giving the solution, each in about 10 seconds. The SVM algorithm works about 1500 seconds for the training. DNN also requires about 14000 seconds for the training. The testing time of SVM and DNN is about 2 seconds. 5.3 Numerical Experiments on Simulated Data In order to further explore the conditions under which DNN works well, we generated 10,000 records from a normal distribution and divided them into training (75%) and testing (25%) sets. The data were categorized into clusters, each representing a given combination of features, and we varied the number of clusters while keeping the total number of records fixed at 10,000. In each cluster data are generated according to N (50i, 25i2) which i is the cluster number. Like the real-world dataset, we considered three features: the day of the week, month of the year, and department. Each problem was solved for three values of cp and ch using all four algorithms. Figure 2 shows the results. As shown in Figure 2, if there is only a single cluster, all four algorithms produce nearly the same results. This case is essentially a classical newsvendor problem with 10,000 data observations, for which DNN does a good job of learning the distribution. However, as the number of clusters in7
1
10 number of clusters
200
DDM NV SVM
7 5 3 1
DDM NV SVM
7 cost ratio
cost ratio
cost ratio
DDM NV SVM
7 5 3 1
5 3 1
1
10 number of clusters
200
1
10
200
number of clusters
Figure 2: Ratio of each algorithm’s cost to DNN cost on randomly generated data.
crease, i.e. the numbers of samples in each cluster decreases, the performance of DNN deteriorates somewhat. This is because there is not enough data for DNN to learn the distribution well. SVM also has a similar problem and as the number of samples decreases, its performance deteriorates. However, Figure 2 shows that DNN outperforms SVM, particularly as the number of clusters increases. In Figure 2, cp = 1 and ch is equal to 0.5, 1, 1.1, respectively for the figures from left to right. Although NV and DDM outperform DNN in Figure 2, we note that this experiment is somewhat biased in favor of those methods. In particular, the NV algorithm assumes the demands are normally distributed, and these data are in fact normally distributed, thus giving NV an advantage since it only takes around 12 samples to get a good estimate of the mean and standard deviation. Moreover, as Bertsimas and Thiele [2005] mentioned, DDM works well with normally distributed data and is almost as good as NV. In sum, when the data come from a known distribution, NV and DDM work well. However, when the distribution is unknown, DNN should be used, especially if there are sufficient training data available.
6 Conclusion In this paper, the multi-item newsvendor problem is considered. If the probability distribution of the demands is known, there is an exact solution for this problem. However, approximating a probability distribution is not easy and produces errors; therefore, the solution of the newsvendor problem also may be not optimal. Moreover, other approaches from the literature, such as data-driven and SVM algorithms, do not work when the historical data are scant and/or volatile. To address this issue, an algorithm based on deep learning is proposed to solve the multi-item, multi-feature newsvendor problem. The algorithm does not require knowledge of the demand probability distribution and uses only historical data. Furthermore, it integrates parameter estimation and inventory optimization, rather than solving them separately. Extensive numerical experiments on real-world and random data demonstrate the conditions under which our algorithm works well compared to the algorithms in the literature. The results suggest that when the volatility of the demand is high, which is common in real-world datasets, deep learning works very well. Moreover, when the data can be represented by a well-defined probability distribution, in the presence of enough training data, deep learning works as well as other approaches. Motivated by the results of deep learning on the real world data-set, we suggest that this idea can be extended to other supply chain problems. For example, since general multi-echelon inventory optimization problems are very difficult, deep learning may be a good candidate for solving these problems. On the other hand, there other machine learning algorithms that can be applied to the newsvendor problem. Another direction for future work could be applying other learning algorithms to exploit the available data.
References FoodMart foodmart’s database tables. http://pentaho.dlpage.phi-integration.com/mondrian/mysql-foodmart-database. Accessed: 2015-09-30. Özden Gür Ali and Kübra Yaman. Selecting rows and columns for training support vector regression models with large retail datasets. European Journal of Operational Research, 226(3):471–480, 2013. Dimitris Bertsimas and Aurélie Thiele. A data-driven approach to newsvendor problems. Technical report, Technical report, Massechusetts Institute of Technology, Cambridge, MA, 2005.
8
Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research, 184(3):1140–1154, 2008. Hoi-Ming Chi, Okan K Ersoy, Herbert Moskowitz, and Jim Ward. Modeling and optimizing a vendor managed replenishment system using machine learning and genetic algorithms. European Journal of Operational Research, 180(1):174–193, 2007. Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Mike Seltzer, Geoffrey Zweig, Xiaodong He, Julia Williams, et al. Recent advances in deep learning for speech research at microsoft. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8604– 8608. IEEE, 2013. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. Chi-Jie Lu and Chi-Chang Chang. A hybrid sales forecasting scheme by combining independent component analysis with k-means clustering and support vector regression. The Scientific World Journal, 2014, 2014. Martin Längkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters, 42:11 – 24, 2014. ISSN 0167-8655. doi: http://dx.doi.org/10.1016/j.patrec.2014.01.008. URL http://www.sciencedirect.com/science/article/pii/S0167865514000221. X. Qiu, L. Zhang, Y. Ren, P. N. Suganthan, and G. Amaratunga. Ensemble deep learning for regression and time series forecasting. In Computational Intelligence in Ensemble Learning (CIEL), 2014 IEEE Symposium on, pages 1–6, Dec 2014a. doi: 10.1109/CIEL.2014.7015739. Xueheng Qiu, Le Zhang, Ye Ren, Ponnuthurai N Suganthan, and Gehan Amaratunga. Ensemble deep learning for regression and time series forecasting. In CIEL, pages 21–26, 2014b. Cynthia Rudin and Gah-yi Vahn. The big data newsvendor : practical insights from machine learning analysis, 2013. URL http://hdl.handle.net/1721.1/85658. Systemvoraussetzungen: Acrobat Reader. Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
Xingjian SHI, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun WOO. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 802–810. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5955-convolutional-lstm-network-a-machine-learning-approach-for-precipitation Lawrence V Snyder and Zuo-Jun Max Shen. Fundamentals of supply chain theory. John Wiley & Sons, 2011. Nazli Turken, Yinliang Tan, Asoo J Vakharia, Lan Wang, Ruoxuan Wang, and Arda Yenipazarli. The multiproduct newsvendor problem: Review, extensions, and directions for future research. In Handbook of Newsvendor Problems, pages 3–39. Springer, 2012. Stijn Viaene, Bart Baesens, Tony Van Gestel, Johan AK Suykens, Dirk Van den Poel, Jan Vanthienen, Bart De Moor, and Guido Dedene. Knowledge discovery using least squares support vector machine classifiers: a direct marketing case. In Principles of Data Mining and Knowledge Discovery, pages 657–664. Springer, 2000. Armando Vieira. Predicting online user behaviour using deep learning algorithms. CoRR, abs/1511.06247, 2015. URL http://arxiv.org/abs/1511.06247. Shaomin Wu and Artur Akbarov. Support vector regression for warranty claim forecasting. European Journal of Operational Research, 213(1):196–204, 2011. Xiaodan Yu, Zhiquan Qi, and Yuanmeng Zhao. Support vector regression for newspaper/magazine sales forecasting. Procedia Computer Science, 17:1055–1062, 2013.
9
A
Experiments on Real Dataset
In the real-world basket dataset used for the experiment reported above, we used three features: the day of the week, month of the year, and the item department. Each of the features is transformed to its corresponding binary representation, so there are 43 binary inputs for each record. A fully connected network is designed with two hidden layers. The first layer has 350 nodes and the second layer has 100 sigmoid nodes. In the real world we may need to see quickly the result for a new pair of cost coefficients. Since the training process may take a long time, we may not be able to tune the parameters of the network for every pair of coefficients. Thus, we tuned the network only for cp = 2 , ch = 1 and obtained the results for all other combinations of cp and ch when cp /ch ≤ 10. For the case of cp /ch > 10 we tuned another network. As we saw, in the real-world data set, DNN always beats the other three algorithms. If we tuned the network for each pair of cp and ch , we could get even more competitive results. In addition, Table 3 shows the details of each algorithm’s result. For each combination of cost coefficients ch and cp , the newsvendor cost is reported. In fact, Figure 1 shows the normalized values of this table.
B Experiments on Small Problem As mentioned in Section 1, supply chain and retail data commonly have a periodic behavior. In general, the retailers’ demand during weekends is more than in the middle of the week. The small instance in subsection 5.1 reflects this pattern. Figure 3 shows the demand values over time. Demand of three weeks 14
week 1 week 2 week 3
12
Demand
10 8 6 4 2 0 1
2
3
4 5 Demand of three weeks
6
7
Figure 3: Demand over three weeks for the small problem.
C
Experiments on Simulation Data
In order to obtain the results on the simulation data, a similar tuning procedure as mentioned in Appendix A is used. Table 4 shows the detailed results of the simulated data which we discussed in Section 5.3. As shown, as the number of training samples in each cluster decreases, the performance of DNN degrades. 10
Table 3: The results of the real case with different cost coefficients. cp 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 7.00 7.00 7.00 7.00 7.00
ch 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00
DNN 145982 203413 241495 267478 292685 309279 326053 340411 355492 181425 257774 310990 350744 384183 413324 434356 455627 476297 208853 303277 371184 424856 463805 501022 531870 558302 590499 235894 343795 420278 481698 532351 576090 620807 653398 672019 258166 378015 465128 538514 594726 650301 691547 733754 766955 277706 411321 506029 589867 654340
NV 171861 229364 263583 287826 305934 320590 333229 344184 354420 212549 292997 344046 380439 408737 431738 450610 466753 480884 244665 343722 410161 458728 496377 527167 553270 575651 595040 271893 387037 466606 526125 573411 611854 644350 672516 697512 296150 425097 515583 585994 641589 688093 727046 760879 790750 318109 458719 559830 639237 703597
DDM 278400 301999 304426 327309 324034 346765 342280 362417 360488 210762 310187 344046 395896 415168 455043 461163 490662 493159 446884 457896 514419 515171 545868 560729 589656 598576 613648 275569 403855 462326 554790 573411 632418 648840 712588 704984 571745 569232 687225 653671 732471 721913 784374 778486 842247 326662 472719 554484 669949 699860
SVM 226533 284687 313177 329504 339955 346939 351863 355560 358751 285696 377514 427030 458086 479402 494256 505575 513812 520408 331496 453045 523018 569373 602144 626353 644744 659014 670755 368704 516360 606570 667124 711716 745420 771767 792866 809882 399201 571392 679562 755002 810650 854048 888441 916171 939529 426128 619832 744296 834352 900997
cp 7.00 7.00 7.00 7.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
11
ch 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 1.01 1.03 1.05 1.07 1.09 1.12 1.16 1.20 1.23 1.26 1.29 1.33 1.36 1.40 1.44 1.49 1.53 1.60 1.68 1.75 1.81 1.88 1.96 2.10 2.30 2.50 2.70 2.80
SNN 710888 759641 806940 843726 298432 440162 544107 634443 704777 764201 822020 875504 917273 318671 466534 583398 673437 752239 824902 881032 933141 986566 350981 355068 363281 363791 370950 375173 377104 389306 393389 401133 404206 411927 414991 421303 426285 432080 442021 450798 461421 473549 481458 492298 498295 518847 544003 571818 591779 602138
NV 756843 802775 842125 876971 338263 489331 600479 687444 760038 820322 871999 917457 957156 357153 517578 637646 732238 811460 878991 936420 987020 1032139 377026 381021 384971 388891 392774 398538 406148 413626 419167 424649 430079 437217 442491 449482 456408 464886 471533 482945 495701 506647 515847 526324 538035 558073 585587 611664 636587 648585
DDM 800742 802775 868940 880424 658316 658468 809265 767554 874809 857057 984169 931920 1029415 370365 541569 632400 764686 806376 929661 933363 1047311 1032139 376916 380816 384716 388616 392517 398367 406167 413967 419817 425414 431259 439051 444895 452687 461751 471317 478970 492362 505532 517986 528661 541115 555348 582477 614004 628680 654794 667852
SVM 953750 996389 1031198 1060148 449809 662991 803337 906083 984824 1046027 1096832 1138731 1173953 470851 702305 857088 971696 1061575 1132503 1190322 1239500 1281072 492851 498970 504983 510991 516942 525777 537354 548543 556766 564971 572998 583567 591313 601580 611645 623947 633571 650061 668414 684349 697602 712545 729142 757329 794536 828739 861039 876799
Table 4: # clusters cp 1 1 1 1 1 1 10 1 10 1 10 1 200 1 200 1 200 1
Detailed results of the four algorithms on simulated data. ch Deep Learning cost SVM DDM NV 0.5 13678 0.98287 0.98287 0.9828120691 1 19674 0.99997 0.99997 1.0000068489 1.1 20608 1.00050 1.00050 1.0005414481 0.5 76693 2.70304 0.99705 0.9957713397 1 111080 2.63191 1.00065 0.9999455625 1.1 116529 2.60453 0.99921 0.9989134561 0.5 710868 6.07176 0.99120 0.9798522263 1 1035399 6.18563 0.99421 0.9839775101 1.1 1092147 6.14383 0.98802 0.9785558906
12