An application of data mining for overbooking optimization

27 downloads 16412 Views 52KB Size Report
the use of advanced data mining techniques on large corporate databases ... focus on an application of a business intelligence system to the quite well known ...
An application of data mining for overbooking optimization Francesco Viriliº, Bernd Freisleben¹, Kai-Uwe Kalka², Ulrich Oppitz³ ºDepartment of Information Systems and ¹Department of Computer Science University of Siegen - Hoelderlinstr, 3 – D-57068 Siegen, Germany ²Lufthansa Systems Berlin GmbH, Fritschestr. 27-28, D-10585 Berlin, Germany ³Lufthansa Germany Airlines, Lufthansa Base, D-60546 Frankfurt, Germany Email:

[email protected] - [email protected] [email protected] - [email protected]

Keywords AK01 models, AK02 optimization, AK03 statistics, AL02 neural networks, HB32 airline reservation systems, KDD, data mining. Abstract This report is about a research in progress at a major airline company. The project is based on the use of advanced data mining techniques on large corporate databases containing passenger and booking information (mainly the so called Passenger Name Records, 'PNR'), to predict passenger behavior, optimizing capacity planning and improving overbooking management. In section 1, a motivation for such a contribution is given. Our study is in the relatively new field of “Knowledge Discovery in Databases” (KDD or Data mining, [SIGKDD 00], which has recently come to the attention of the IS community with the diffusion of the so called ‘business intelligence’ applications, aimed at discovering, using, diffusing and maintaining business knowledge ‘hidden’ in the corporate databases of an information system. Moreover, some journals are now focusing on a new IS area, sometimes called “Intelligent Information Systems” (IIS), which is at the confluence of Information Systems, Artificial Intelligence, and Data Mining; examples are the Journal of Intelligent Information Systems (Kluwer), Knowledge and Information Systems (Springer), Intelligent Data Analysis (Elsevier). Here we focus on an application of a business intelligence system to the quite well known problem of seat inventory control and optimization for a major airline company. The project goal is to investigate whether the data contained in Passenger Name Records (PNR) could be useful for improving the quality of predicting the no-show probability of passengers. Such predictions are currently used in the yield management system for capacity planning and overbooking optimization; a higher accuracy in the estimates would have a considerable impact on the corporate revenues. The weighted mean method (WM henceforth), described in section 2, is currently used to produce the no-show predictions. The WM method uses the booking history of the flights (typically around two years) to compute show-up predictions for each triple . The computation is based on the mean of the historical show-up rate, weighted by the number of final bookings; the show up rate (SUR) is simply the ratio between the number of final SHOWS (so called ‘booked out’), and the number of final bookings. Since in addition to the three data fields used by the WM method, the passenger name records contain about 100 further data items, it is natural to ask whether (some of) these data items could be exploited for forecasting purposes. The basic assumption behind using these data items is that the booking/passenger data in the PNRs give some indication of whether the passenger is likely to show up or not.

In section 3, the crucial issues related to collecting and managing the new data sets are described. The PNR data were not permanently stored before the project was started: after the flight departure, the passenger data were discarded and erased. Actually, a new database structure and a new set of data storing/checking and maintaining procedures had to be designed to start collecting PNR data, not without difficulties. The project is just an evaluation study, whose final aim is to verify the opportunity to actually use PNR data; for this reason, in order to limit the project costs and to reduce the project time a decision was taken for an evaluation based on sample data sets of 42 selected flights. During the first project phases, an important activity was the data cleaning, which led to modification in the initial structure of the relational database and of the used keys, to the detection and elimination of duplicates, inconsistencies, and “dirty” data items. In section 4, the model building phase is described. After collecting enough checked and stable data, an appropriate temporal window for training and test data was selected. Given that the show/noshow event to predict is binary, appropriate predictive methods were chosen, including conditional averages, linear probability models, logit models [Aldrich & Nelson 84] and neural networks [Bishop 95], [Venables & Ripley 97]. We had to face two important aspects in model building and selection: 1) the number of input variables (around 100) was too high to obtain results with most of the methods above; 2) a reduction of the number of variables was highly desirable also to reduce data handling and execution time. The application of a powerful business intelligence application, based on CHAID [Brosius 96], resulted finally in the selection of a set of 12 input variables with a significantly high impact on the show/noshow event. We decided to build different forecasting models, including (a) the WM method (conditional average for each flight, (leg), day of week and booking class), (b) an extended version of the WM method, called XWM, in which we added the top 3 variables according to the CHAID analysis, (c) a logit model, which is a linear regression model wellsuited for conditional probability estimation, and (d) a neural network model to capture the eventual presence of more complex nonlinear relationships among the observed data. For (c) and (d), all the finally selected 12 variables were used as predictor variables. In section 5, our main findings are summarized: the average number of mispredicted noshows (measured using the Mean Absolute Deviation) is reduced by 10-20% with the methods adopted in comparison to the original WM method. Performance measurements based on the Mean Squared Error gave similar results. In section 6, some conclusions are drawn: the preliminary results show that it is possible to successfully use PNR data and appropriate models to improve the overbooking optimization process; critical success factors were: 1. data collection and preparation; 2. the method used for exploratory analysis and data reduction; 3. the models: complex models performed better, but simple models might be preferred for lower computational requirements and overall cost. Bibliography [Aldrich & Nelson 84] Aldrich J. And Nelson F., Linear Probability, Logit and Probit Models, Sage Publications, 1984. [Brosius 96] Brosius F., SPSS CHAID – Statistische Datenanalyse für Segmentierungsmodelle und Database Marketing, International Thomson Publishing, 1996. [SIGKDD 00] ACM SIGKDD Organization: Charter page, http://www.acm.org/sigkdd/charter.html. [Bishop 95]

Bishop C., Neural Networks for Pattern Recognition, Oxford Univ. Press, 1995.

[Venables & Ripley 97] Venables W.N. and Ripley B.D., Modern Applied Statistics with S-Plus, Springer, NY, 1997 (NN sw lib: http://www.stats.ox.ac.uk/pub/MASS2/Software.html).

Suggest Documents