Understanding patterns and structures in data ...

7 downloads 2444 Views 1MB Size Report
cases, as they are the most mature domain in adopting Analytics and .... Stickiness is low – the average traveler still visits about 3 sites before making a booking,.
Understanding patterns and structures in data--Analytics and Machine Learning and application to Travel Industry Shibaji Mukherjee, Bratati Ghosh, Kalpana Sitaraman

Overview: Analyzing consumer transaction, intent and behavior pattern has become extremely important for online commerce. This effort has taken up a new meaning with the advent of web scale data stores and multiple data channel inputs. This has thrown up new challenges in terms of understanding the scenario, the technology and the science behind it and most importantly the implementation and investment decision in building such capability in the web platform. This activity has seen a mix of the traditional Business Intelligence (BI), Large scale data analytics, statistical modeling and Machine Learning(ML). There is significant debate in the industry about the potential and applicability of Big Data and Analytics and Learning Machinery in order to improve revenue and customer satisfaction. Analysts differ in their opinion. Gartner thinks this has reached a peak in the hype cycle and expectations are too high.. IDC thinks this has huge potential and is poised for tremendous growth. The signal that's coming from the market in terms of investments, adoption and ROI is at a too early stage to really give some definite indicator. One of the main reasons for this, we think, is because of the very nature of evolution of this domain. Unlike traditional business solutions this domain is complex, and intensive in its use of the more theoretical and esoteric aspects of Computer Science. It also leverages significant inputs from Statistics, Mathematics and even Theoretical Physics. This is one of the areas which have been dominated by academics and it is making a transition to industry. However, one thing is sure: the data channels are not going to die, the data production and accumulation is not going to stop and data volume is only going to increase with time. Given the potential of some early adopters to gain significant competitive advantage through the smart use of data to make better decisions, it is impossible to just sit on this data volume and do nothing. The ROI may not be very well defined now, but the scope and future gains are too promising to neglect. Stepping into this territory and deciding to implement and invest comes with lot of decisionmaking issues involving infrastructure, expertise, regulatory issues and patience. We present

a comprehensive overview of the analytic apparatus, the science behind it and its applicability in industry. We hope to give the reader a sufficient broad view of the subject, the solution framework used and the business impact of implementing a Learning framework with a particular emphasis on recommendation engines. Our domain of focus is the Travel industry and we use retail and social analytics as the base cases, as they are the most mature domain in adopting Analytics and Machine Learning. Our intention is not to present a scientific review or a specific case study or a discussion of the framework of machine learning or statistical analytics. We accordingly don't present any mathematical equations or models and neither do we refer to original sources or present a reference list. We expect the reader to consult relevant mathematical, technical or business case study references to go deeper into details. This domain is very jargon heavy and we hope to give a gentle introduction to those terms and concepts and give an overall high level overview of the problems, solutions and experiences in this area. Each segment of the paper can be read mostly separately and in any order. The paper is divided into two parts, in part 1 we describe the theoretical framework, explain the terms drawing examples from travel wherever possible and give a general introduction to the problem and methodologies. In part 2 of the paper we will discuss the application scope in travel industry and discuss how machine learning and analytics is getting adopted and used by different players in the domain.

Source: Dilbert comic Strip-Scott Adams

Analytics & Big Data: The Travel Industry & Other Industries The Travel industry is at a technology transition point now. Travel industry generic workflow management, which is very much transaction-oriented is very robust and involves complex distributed solutions with strong reliability and consistency built in. The solutions have been mostly based on big machines, mature RDBMS (Products like Oracle, SQL Server) and web based portals. User navigation and User experience have been a very intensely focused area of development in the solutions for this industry. Another important area of solution development has been building huge Data warehouse solutions and BI tools churning out reports and charts. The industry has a very rich data source, quite a significant volume of which is profile mapped. This data set has been an obvious target of information mining and classical market basket analysis, product bundling and ranked pricing solutions have been the major focus points. This is an appropriate area of enquiry tied directly to the business use case for this industry. The point that has been missed is that the huge data stored has more meaning and information and revenue hidden in it than what has been mined. A comparable space which has a huge data storage is search, retail and social media (the Google, Facebook, Linkedin, Twitter, Amazon space). The way they have developed, adopted, modeled and extracted information out of their data set is years ahead of the travel industry. This however is not a failure point but an opportunity existing to tap, which some smart travel analytics companies have started exploring. There is some fundamental difference between the Travel & Social Media domains that we think have contributed to this. The business use cases are significantly different here. Travel industry is transactional, RDBMS solution based, which the media companies are not. They have been able to start with solutions which are very aligned to deriving meaning out of multi format, multi mode and multi source data whereas travel have been mostly forced to convert and synchronize data representation for supporting transactions between multiple nodes, formats and devices. Travel data has got strong privacy and identity issues which are diluted quite a bit (and still subject to debate) in the media industry. Travel data is also very contextual in terms of location, user profile and scope and furthermore it is time variant. Retail, on the other hand, is similar to the Travel Industry in being transactions based, and was mostly brick and mortar until Amazon arrived (while Travel, without the fulfillment/shipping challenge, actually operated more complex networks, global distribution systems and other online distribution mechanisms). Now almost every retail player has a mature BI, if not analytics and Learning solution in place.

This is mostly because there was an early market leader in the Retail space which drove the adoption of Analytics. Retail transactions are also a bit different from Travel and mostly involve collection of average to low value items, with the scope to squeeze in more in a basket. Homogeneity is also a factor here, as Retail data is relatively static and depends on lesser number of parameters. We should also mention some other industries which have been pretty advanced in applying statistical methods and data mining – Telecom, Utilities and Finance (particularly Equity and Derivatives) but we are not going to discuss them here. There focus area is also dominantly modeled after risk profiling, risk management, consumption pattern modeling and product bundling. So rather than looking at the scenario as a missed opportunity, we are of the opinion that it is still early going, and that the Travel Industry need to redefine the way it looks at its data and the need and relevance will be very clear then, along with an added bonus of higher returns and stable predictive models. We will look into slices of the situation one at a time and break it up mostly like this: The State now, What's the problem, Why it matters, Where one should look for solutions.

Scope, Problem, Solution The State Now and the Problem Travel is a very mature industry in traditional sense generating a direct and indirect combined revenue of $2.0 Trillion and 14.6 million jobs in US alone in 2012. International travel to US generated a net trade surplus of $45 billion in U.S export-import account in the same year. Workflow solutions are robust, in place and technology solutions are well known and BI is providing the maximum information that it can. Traditional workflow modes through travel agency and call center channels have been almost closed in many cases and online portals have become the only window for interaction. Both the booking process, and subsequent customer service touch-points have been systematically moved online. The online revolution in Travel has mostly resulted in commoditization of Travel inventory. Travelers typically shop for price, especially in the case of airline tickets, and jump from site to site in search of a better bargain. Stickiness is low – the average traveler still visits about 3 sites before making a booking, conversion is low (Look to book ratio for online channels has remained in the 2% range), profiling is vanilla (static, demographic profile data is all that is collected), push market is random (all offers are sent in a blanket way/ or with some rudimentary A/B testing, but much more intensive customization is possible), solution bundling is static (dynamic packaging technology has not had many breakthroughs) , crowd sentiment feedback is decoupled

(TripAdvisor & Google are seen as the keepers of the Review / sentiment score, while Expedia and their ilk are the transaction facilitators), information is scattered. The problem exists partly because the scale of data wastage is too high. The Travel seller assumes that revenue generation has a majority stable component based on static need (people will go to attend business meeting, family meetings, ship goods, move residence whether feedback exists or not) and there is no major adopter in the market of an alternative, more dynamic view of travel planning. In social media space you cannot survive if you don't mine user behavior and profile to make customized recommendations, and so it is to some extent in retail, but not so in travel. It matters-- Stable segment of an industry is always a low growth area and may be disrupted one day completely by a start-up. New revenue and growth for travel will come from the niche segments, unexplored areas and untapped customers. 1. Volume driven blockbusters are temporal and low margin. 2. Information is a publicly available and sourcable commodity, information provider companies like Google can integrate into multiple domains without being a brick and mortar player in the domain and they could garner market share. 3. Lastly it's a business fallacy to not extract revenue growth information from a source which is lying with you (large Travel industry retailers) in databases. Looking for solution --We don't need to look far: the solution is in the data, in its volume, velocity and variation and the data is mostly with the company and in integrable sources. A new strategy of working with data needs to be worked out, the start of which is to understand that things have changed. As we have mentioned earlier data was always there, data warehouses were there and BI was also there from spreadsheet ages. The change that has happened is in the mode, source and frequency. The volume of data has just become too much, its beyond the scope of traditional statistics to analyze it. A new way of data manipulation on a large scale, approximate model based, parallelized computing environments(lot of computers tied up together), collective computing nodes is the requirement for petabyte (approximately 58292 movies or 1000 times the storage of a standard laptop) scale mixed format, multi source data. The flow and frequency of data has changed, it can be streaming data with very limited lifetime and huge volume, the context of canning it, cleaning it and mathematically sampling it no longer exists. The source of data is no longer only the POS terminals or the forms clients fill up; comments, observations, tweets, recommendations have become strongly integrated into user decision making and transaction, this new reality need to be factored in. Well what we can get by doing all this? You know your client better, you have a sense of what she may buy, what she may comment, which opinion she follows, how she makes decisions, what is her pain points, can she be a future customer, will she be interested in

seeing a movie when she is on a business trip, will she read my promotional mails, will she delete the coupon in the mail and a lot lot more... How do you know all this? Statistics at scale, Analytics, Machine Learning - that's the holy grail.

The Holy Grail Data and Statistics We can start with asking the question “What is data?” Naive as it may sound, this is the central most important concept and majority of any Analytics work is getting the data and putting it in good shape, with experts claiming that 80% of Analytics is cleaning and polishing the data, just 20% is the Mathematics. In any problem we deal with a set of things or events or opinions like dog food, breakfast menu, food quality, book title and so on; all of them need to be reduced to some number on some relative or absolute scale to be of any use and that's why you have all the id's and 'on a scale of 1 to 5'. Data is broadly classified into two types numeric and non numeric. However, for the purposes of analysis, everything needs to be somehow changed to be a number (absolute or relative). Mathematically we also speak of continuous and discrete data/variables; suffice it to know that it mostly refers to whether some sort of mathematical model has been applied or not. Once we have data we put it in some file, some spreadsheet and do Statistics on it. Statistics has always been a subject based on manageable sample sets with a rigorous mathematical foundation, average simulation and more strong on exact solutions based on stochasticity and clean or cleaned data sets. Collecting the data, testing its validity, extracting a representative sample, defining error limits are fundamental problems of statistics. After getting the data in shape we calculate various measures on it, the most common being centrality measures the mean and the median; the mean is our good old average and median is the centre of the data if you arrange it in order. These two simple measures are quite dominant parameter reported and calculated in Data Analytics. Next we go further and calculate data variation which has got measures like variance, standard deviation, covariance, correlation coefficients. We will see all of them widely used in Analytics and Machine learning. They are not difficult concepts and essentially they try to find out how much is the spread of the data about some reference point or how two data sets are related to each other. They have exact deterministic formulas which any statistics book will give and any spreadsheet will calculate the numbers, given the data set. The exact formulas are of no relevance to us and suffice it to know that they are defined, serves our purpose and tools can easily calculate them.

Next in increasing step of difficulty is distributions, which are extremely important for Data Analytics. The basic idea behind this is that in most of the cases the data set we get is an outcome of some collection of events which is a random event(like a coin toss) so there is a chance of getting different values each time, which is called probability. So each data point comes with a chance and if you arrange the data point and it's chance together you get lot of points which can be put in different looking graphs or charts; the graphs are basically called the distributions. The exact definition has lots of conditions but the crux of the matter is plotting the data set from a repeated random event. There are various distributions, the most famous being Normal or Gaussian distribution. The other distributions that we encounter in the Analytics papers and reports are Log, Exponential, Binomial, Levy and a host of others. Again the basic point here is that they are mathematical models, whose internals we don't need to bother about, tools can calculate and plot them. Once the data fits to any of them, the property of the set can be read off from any book which lists the properties of that particular distribution. We will elaborate a bit on three distributions which is very much common Normal, Exponential and Power Law distribution.

Normal Distribution 1

In case of Normal distribution the value is peaked about the mean, dies down smoothly and quite rapidly as values move out from the mean, it technically goes to infinity to reach zero, but for all purpose, most of the data is contained within 3 times the standard deviation. The curve becomes flatter or thinner depending on the standard distribution. This is a very popular distribution and the ideal case is everything becomes normally distributed.

Exponential Distribution 1

Exponential distribution is a rapidly growing and decreasing curve, it dies out quickly as values move out from origin. It depends on a scale parameter and reaches higher peak with higher value of that parameter and vice versa. This is also a pretty common distribution. Most of the logarithmic distributions are variations on this distribution.

Power Law distribution 1

Power Law distribution follows a constant power relationship between two parameters and generally have a high density of points near the origin and falls of as the points moves out from the origin. The fall is not as steep as the exponential and there are significant number of points at the far end. This is famously referred to as the Fat Tail (the nearer to origin part) and Long Tail (further from the origin part) of the curve. This distribution is scale free, which means that stretching and shrinking the parameters is only going to push the curve in or out and not change its basic nature. Image Source: Wikipedia

This is pretty much all that is of relevance to the domain we are discussing (Don't think this is all that is there in Statistics!) and Statistics is the mother discipline of Data (Big or Small) Analytics and Machine Learning. So why can't we stop here and why do we need to bother about Machine Learning? There may not be any universally accepted answer to this, but the basic issue is data in real life is not always well behaved, complete, small and lot of time we need to work with the whole volume than the sample. Also the focus of data analysis may not

be exact mathematical formulas but algorithms or computer programs to process them and most importantly a lot of time we don't know what we are looking for , what to calculate and what is the meaning of the answer. There is a hidden learning component to data which needs to be discovered, statistics is not exactly the subject for that, it's a subject which concentrates more on the mathematics of data than exploring meaning in data. However there is nothing in Machine Learning except for the process and formulation which is not there in Statistics. Statistics is the foundation of Machine Learning and their relationship is something like language and the book written in that language. One useful bit of information, R is the most famous statistical package and it's free and works great. SAS and SPSS are commercial versions and they are great tools with features much beyond statistical calculations. Big Data We move onto Big Data, a term which is everywhere now. Big data is just lots of data in all possible mixed format (but still it needs to be converted to numbers to be of any use). How big is a relative term , if it's 1990 then it's 1 GB, if it's 2000 then 1 TB and now may be couple of 100 Petabytes (PB) and in future may be millions of PB. Data channels for this volume of data is obviously not a statistician collecting field data with a work book or user forms, but they are automated terminals, web logs, satellite data, high energy physics machines or biological sequence readers. In our domain it will be mostly web logs, social network data (tweets, opinions, blogs), POS data. Important point to note is that the number of traceable events and event channels have increased million fold and that's the major source of data explosion, it's not that suddenly individual data terminals have started working million times faster(this is a debatable point, but not of relevance to us).

File or Data Size Table Name

Symbol

Binary Decimal Number of Bytes Measurement Measurement

Equal to

kilobyte

KB

2^10

10^3

1,024

1,024 bytes

megabyte MB

2^20

10^6

1,048,576

1,024KB

gigabyte

GB

2^30

10^9

1,073,741,824

1,024MB

terabyte

TB

2^40

10^12

1,099,511,627,776

1,024GB

petabyte

PB

2^50

10^15

1,125,899,906,842,624

1,024TB

exabyte

EB

2^60

10^18

1,152,921,504,606,846,976

1,024PB

zettabyte

ZB

2^70

10^21

1,180,591,620,717,411,303,424

1,024EB

yottabyte

YB

2^80

10^24

1,208,925,819,614,629,174,706,176 1,024ZB

Table Source: Webopedia ( ^ means to the power)

Big Data comes with its set of issues; where do we store the data, how do we manipulate the data, how do we integrate the data. These problems takes up a big section of the Data Analytics research and Technology space. Big data has forced us to think of data storage and manipulation in a radically new manner. Traditional data storage at logical level is either a spreadsheet or a file or a database. Spreadsheets are simply not made for this, they are out of consideration. Databases are planned for actually storing unlimited data, but the catch is it comes with pre conditions, cost and lot of data logistics which are not valid or affordable in the present context. The biggest issue with databases are that they assume a well defined structure of the data, with various integrity constraints and relationship, without which the data doesn't go and also they favor a dense data representation in a normalized form which means that small set of attributes in each table; well in Big data domain most of the time we don't have any structure, relation or constraint defined, this is caveat number one. In theory databases can store unlimited amount of data , but we have to go on spending more and more money which is always in short supply and hence some better way need to be found out. Traditional databases are very much transaction oriented and they take lot of pain in providing transaction guarantee and recovery which requires lot of extra overhead. Lastly the language of

interaction with databases is mostly SQL, which is a nice, easy language, but slow, limited and not integrated easily with traditional programming languages like Java or Python or C/C++/C#. The data set for most of the Big Data use case comes from multiple channels, hence they are of different format, next the data comes with a huge attribute set and mostly the data set is not at all dense. Two more very important point are that the data usage is mostly non transactional in this case and seldom involves write, it mostly a read only data with no need of log, transaction support or exclusive rights of manipulation (technical term is lock) and lastly bulk of the computation or programming in data analytics is mostly in traditional languages and requires ultimate optimization. Another surprising thing is data loss is unimportant in this domain and lot of data is sometimes just thrown out, so keeping logs of transactions is an absolute no go as there is seldom any requirement for recovery. Another important issue is that the data generally have a separable character which is technically called vectorization and a matrix representation is very suitable. Matrix(If you don't know Matrix, just think of it as a spreadsheet with many rows and columns) based tools are very mature and algorithmically they are a very well researched topic, so very fine solutions exist. If data can be vectorized parallel computing becomes easy, and parallel computing always has greater throughput than sequential computing. The last item left from the classical storage paradigm is file and it's still there in a different format. File is the basic unit of logical storage for Operating System and it's not going to go. The problem in this space is that files individually have a capacity limit which varies between Operating Systems, but can be assumed to be pretty small by Big Data standard. So the problem of file storage is now that you cannot store all your data in one file but have to distribute it over many files; that's okay as long as you can treat them like isolated pieces. However data have coupling and need to be in sync and need to look like one piece logically even if it's broken into many pieces. So Big Data requires new tools and a new Science, as expected there are new terms which are also all pervasive now, namely NoSQL, Hadoop, Mapreduce, HDFS,GFS. All of them are some sort of computing and computing infrastructure solution and they mostly work together. NoSQL is the data model, Hadoop and Mapreduce is the computing framework and HDFS, GFS are the new file systems. NoSQL is the new data model and database software, the term actually meant No use of RDBMS, but the meaning is now debated and changed also to look more correct. The problem it tries to solve is that transaction support is not required, data is mostly read many times and offline loading is done once in a while, no need of logs or recovery, no requirement for schema, data format is varied, data set looks somewhat like a matrix, data

representation is sparse. There are various implementations of NoSQL system like MongoDB, CouchDB, DynamoDB, HBase and a lot of others. They don't solve all the problems, but have their positive and negatives. The important point is that they get the work done and we need not bother about the internals. One important concept that gets mentioned in this space is the CAP Theorem (looks like Maths!) which in plain term says that if you have a distributed system with many nodes which can fail you will not get Consistency, Availability and Partition Tolerance (the node resilience, consistency) together, have to give up one or the other. In similar spirit there is the Eventual consistency theorem which states that if you are doing changes to multiple nodes and propagating information, ultimately as you stop making changes with time, all nodes will mostly look same. There is nothing much to read into them as the assumption that transaction guarantee doesn't hold is a starting assumption. HDFS, GFS etc. are acronyms for file systems, they are specifically developed file systems which allow the system to store lot of data, in big chinks and keep track of segments of file stored in various disks across various nodes. The design of the file system gives a logically uniform view. They are variation on the distributed file system concept, which is a classical operating system concept. Hadoop and Mapreduce are most probably the most used jargon in Big Data and looks pretty esoteric with its open source status and Apache/Yahoo/Google tag. Combined together they are a batch computing framework for parallelizable tasks which are represented in a specific way. Hadoop is a infrastructure pattern which tells how to couple lots of machines (generally cheap boxes are put together, but no harm in stringing up expensive boxes also if you can afford to do so) so that they can work together. It's a nothing shared architecture, so each box has its own memory and storage and channel. A master node takes care of the book keeping for all the boxes and a series of data feeding channels feed unit of computation to each of the boxes. Commodity hardware make the setup cheap and also prone to failure. Significant amount of redundancy is built up in the overall framework, so if some node dies, others kicks in. Mapreduce is a typical functional computation pattern which takes a data unit, a function instruction and computes that function for that data unit and pops out the result. So the overall architecture is like this: You have a huge volume of work, which can be broken down into many independent unit (think spell check in a 20 million page document, do it by batch of 1000 pages each), feed each unit of work to each of the box, design redundancy in the system, monitor the health of the boxes to see anyone dying, if so replace and re-compute. Each unit does the same routine function on different data pieces, gives the output, a layer collects all of them, do some arrangement and sorting and gives the final result. This makes

huge computation possible, parallelization gives non linear improvement in throughput and huge volume of data crunching work gets done. Scaling is a very easy job here, just add a box, feed a data pipeline, throw in a mapper function and link up with the final output..no hassles, no changes(well it is not as simple as it sounds!). Now the good part is that people have already written most of the code that needs to be written for this, there are great solutions from Cloudera, HortonWorks and though this is not a child's play, it can be done by using off the shelf solutions with some coding and tweaking(but you will need some external or internal expertise for implementation, planning and maintenance) Machine Learning We now move onto our last topic namely Machine Learning, but before that we want to touch briefly on BI and Data Mining. BI is a very well established subject both theoretically and technically.BI does a fantastic job and Teradata, Informatica are great tools. The focus of BI is however mostly reporting based on the data set and predicting exact parameters. There is an implicit assumption of a database behind it, namely a datawarehouse (this is mostly an aggregated, redundant, read only database and very huge in volume).This is not scoped for dealing with a learning scenario. Data Mining and Machine Learning, some experts will say are the same subject and actually theoretically they are most probably really same when they overlap but there is significant difference of scope. Machine Learning deals a lot with Image Processing, Robotics which neither BI nor Data Mining have the framework to deal with. Sometimes Machine Learning is described as an extension of Neural Network, in which case the difference becomes obvious. In implementation terminology data mining is again generally coupled with a datawarehouse, some sort of a regular database schema and mostly concentrate on association type of relationship. In terms of implementation this is also not scoped for multiple data source integration, various learning scenarios and also not mapped to a Hadoop type of framework. However for reasons of capturing more market and sales the tools in both BI and Data Mining area have morphed very much and the differences between BI, Mining and Machine Learning tools are quite fuzzy. The tools in BI and Mining are more mature and they tend to be complete tools whereas Machine Learning solutions are generally software libraries. However all three can be considered to be different offerings from the same basket. In terms of an implementation perspective we suggest to look into domain, business use case, data volume, data source, application integration requirement and infrastructure capability and expert advice to make a choice. Machine Learning is a pretty mature branch of Theoretical Computer Science with some easy and some esoteric stuff in it. There seems to be no common agreement about what it is and rather than worrying about what it is, it's sufficient to know that Machine Learning has the

bag of tools, sound theory and proven results to mine meaning out of a mixed format, multi source, high volume, high frequency data set. The use case for Machine Learning in Industry is finding the pattern, structure, correlation, behavior profiling, recommendation engine, user grouping, segmentation, clustering of the data source typically for very large data volume using Hadoop/Mapreduce framework. Sometimes the name Predictive Analytics is used in place of Machine Learning, they essentially means the same. Machine Learning is also extensively used for Fraud detection and Anomaly detection, but generally travel and retail may not be using those features much.

Machine Learning

Image Source: Max Planck Research Group for Machine Learning, University of Hamburg

Cases and Workflows for Machine Learning We will explain Machine Learning through the use cases it tries to solve. Let's see some work flow diagram and we will go into discussion of individual cases in details.

Case 1: This can be a very straight forward problem if you have a exact equation for airfare in terms of known variables, like height to weight ratio, you just plug in the variables (assuming you know them) and get the fare. The problem is in most cases you will not know the variables, neither you will have any equation, what you will have is a historical data set of airfare along with time of the year and say aviation fuel price and some seasonal information. The problem at hand is to make this data useful and this is called a regression problem in Machine Learning. They are a class of problems called Supervised Learning problem. The basic idea here is to take the data set, take a portion of it to be true and fit that into some relationship model. Now we take most of the remaining data set (keeping aside a portion for validation) and put that data into the model and note the value of the unknown parameter of interest (fare in this case). The value will not exactly match the observed value, so choose a measure of deviation, which is generally square of the deviation from the true value and go on adjusting the free parameters in the model and get newer set of prediction. Repeat the error calculation again and go on doing it until it reaches an acceptable limit. Validate this claim using the validation set. If everything is within acceptable limits we get a model in place which predicts the unknown given a set of parameters. So we may find out that fare depends on 2nd power of the aviation fuel cost and difference of the flight time from peak hours, if so given the aviation fuel price and the flight time a value of the price can be

predicted within error limit. This problem can also be solved graphically by fitting a curve to the data set, which is called a best fit curve. The ideal case is single valued linear regression where the mathematics and computation is simple. That most of the time is a not achieved and generally multi valued regression is the norm.

Case 2: Once we understand Case 1, Case 2 becomes simpler, we can as well think of having a equation to predict junk/no junk. The difficulty is how do you quantify mail? So here the problem shifts to identifying characteristics and labeling content as junk or not. A simple example is if the content has the phrase 'zero price' then it's junk. So a set of words/phrases like this is collected and again the training mail set is labeled as junk/no junk accordingly. The machine parses through a large set of content and optimizes the choice of words and phrases and comes up with a model. Mails from the test set is then fed to the system to label them; depending on the number of correct/wrong classification the model is tweaked and a final optimal set is generated. The model is now ready to classify; given an unknown mail it will label it and accordingly send it to inbox or junk.( This is the way your web mail classifies your mail, thought they look into lot of other parameters like size, sender address etc). This is called a classification problem and belongs to the class of supervised learning problem, So if we want our mail to reach the customer, we have to be careful about its content and meta data. One interesting case is what happens if user has put a filter like 'if the

mail is from [email protected] its junk' , then this condition overwrites all learning and it becomes a deterministic problem. This is one example where applying machine learning is not suitable ( Learning can be applied trivially to exactly match the mail address, but that's a wrong application)

Case 3: This is a case of peeping over the surfers shoulder, a content based learning or a reinforcement learning or online learning scenario. We have to assume that a model program is getting feedback or able to read what the surfer is browsing, then the system learns from the environment or the event and picks up matching item/content to push to user. This is not a easy scenario in an uncontrolled environment.

Case 4: This is a classic scenario where system has lot of data, but don't know what are the relevant parameters and what exactly it's trying to learn or is there any pattern in the data. The approach is to assume there is some regularity and try to aggregate useful information and classify. The difference with Case 2 is that there is no label, the system don't know any way to classify the news stories. This is a class of problem called Unsupervised learning problem. In this case the system takes the data set, assumes some centrality measure and classifies items as per their nearness or sameness to that measure. The choice of measure and the sameness measurement is repeatedly refined so that over multiple iteration the data set tend to divide into groups on the basis of learned labels. In this case it may be travel packages are 'bad', 'good', 'excellent', 'value for money'; for a food menu it may be like 'spicy' ,'hot', 'continental'. Note that if the stories or the menu comes with a label predetermined then it becomes a supervised classification problem, here the system learns the label also from the data.

Case 5: This is the hottest case in Travel/Retail/Social Network. The use case is cross sell, enhancing the shopping cart, pushing item of choice for sale, and in a modified scenario product bundling, advertising, promotion sales, discounts. This is a separate class of problem which has an assumption that variables are related and that can be measured through various similarity measure. A crucial factor here is a knowing the profile of the customer beforehand (though recommendation works even for anonymous profiles but is not that predictive). Here the idea is that a customer is a vector or collection of properties(remember they are somehow numbers), so is an item. As everything is a number and have a approximately common parameter set, they can be compared and any comparison of two quantity will say how close they are. If it has been observed that customers who buy travel suitcase also buys umbrella, then just flash an offer for an umbrella. More the system knows about the customer, more accurate the prediction becomes; accordingly even the interface can be dynamically customized for a customer and a customer can also be sold crap if the gullibility quotient of the customer is high. The mathematical machinery behind this is pretty involved and evolving, this is part art, part science and also a bit of black magic. Almost all commercial models are proprietary and unknown in this domain.

Case 6: This is not a Machine Learning problem and we mention it here to show the difference. Client is buying an explicitly chosen known product, system is just pushing content about that product, it can as well be a simple SQL query with a condition (a where clause) of Jamaica Holiday package. You can forcefully make it a Machine learning case if you are in love with that subject. So in summary Machine Learning is a subject for constructing learning algorithms and applying the algorithm to data sets to derive non explicit knowledge from the data. Technically there are classes of problems called, Supervised (we are including regression here as a subclass), Unsupervised and Reinforcement Learning. In case of supervised learning its learning with a teacher (the training set) who already has the knowledge. Unsupervised learning is learning by self without any teacher, which means that knowledge is not explicit, it needs to be constructed, Reinforcement learning sits in between where it's a collective learning by the environment and the system. This case is common in Game theory and Economics (Various players collectively work to reach an optimal state). This completes our survey of Machine Learning in the context of relevance to our use case of interest. This is just tip of the iceberg and Machine Learning has a huge theoretical and application content which we haven't even mentioned. In particular we have not discussed the algorithms and models (Gradient Descent, Logistic regression, K-Means, Support Vector

Machines, Kernels, PCA, Hidden Markov Model, Boltzmann Machines) which we believe are best left for a mathematical exposition and requires significant mathematical background for understanding.

Recommender Systems “If I have 2 million customers on the Web, I should have 2 million stores on the Web.” Jeff Bezos What is it and what's the impact: Recommender systems are a type of learning system which

takes in a data set about items and users, learns a relationship between the two set and predicts a pattern of coupled relationship, behavior and time development. They are the most important learning engine applied in Data Analytics. They have got a measurable direct impact on Sales, Product Marketing and User retention. Specifically online shoppers are now accustomed to getting a plethora of personalized suggestions. Amazon suggests books to buy, Netflix suggests videos to watch, Facebbok suggests friends to connect to, Linkedin suggests jobs to apply, groups to join, professionals to connect with. TiVo records programs on its own, for later viewing. Pandora builds personalized music streams by predicting what user want to listen to. Recommenders have actually moved much beyond this and nowadays they are surprisingly sophisticated. They are customizing and dictating the user experience and almost reaching the stage where a robot holds your hand and shows you what to buy, what to see, where to go, whom to date, whom to connect with, what mails to read, in short any transaction or interaction of an individual is controlled, defined, monitored and archived. Individual web is the last frontier left for recommender systems to completely take over. Recommender systems have been a theoretical topic of interest in academia for long time and the first prototypes came out from Industry Labs and Universities in early 1990's. The most early one is probably Tapestry, a system developed at Xerox PARC, which mainly recommended emails to users. GroupLens research lab in University of Minnesota is a big research group working on this systems for a long time. Pattie Maes at MIT is one of the early researcher in this field and she also founded a company Firefly to analyze and predict user actions. Paul Resnick at University of Michigan is also an early researcher and a pioneer in this field . This is a pretty vibrant domain of research in academics now. ACM has a separate yearly conference called RecSys (from 2007 onwards) on recommender systems. Industry has also been a very important participant in this research and all the important players like Google, Amazon, Facebook, Linkedin, Yahoo, Twitter, eBay, Netflix have either research labs or research groups dedicated to this effort. There is a vast amount of literature on this subject and various commercial implementations, this is a complex subject both theoretically and technically. We will give a brief overview of the domain so the reader gets a flavor of what's behind the eco system which suggest the user to buy, lookup or consider various products or connect with specific people when they are browsing or buying or checking friends and connections. The crux of the matter is that recommender systems have become important decision aids in both the electronic and brick and mortar marketplace and it's an integral part of the business

models of many firms. Interestingly this has got a parallel with content delivery or search engine domain. In search space once the data is dressed up and indexed properly fetching the 1 million pages is not a problem, at the base level it reduces to a string matching problem. The problem there is which results to show first and which are most relevant, this is the famous ranking problem and the engine having the better algorithm wins. Similarly in this case getting the product(say '3 star hotels in Miami') from the inventory once it's description is known is not a problem, but which one to show first and which one the user may prefer and what additional thing the user may prefer is the problem, which is the job of the recommendation engine. Unlike search this has the additional problem of user preference, taste and choice dependency which are all very hard to quantify. So basically recommender system is an information filtering type technology to determine items that are going to match most closely with a client's preferences and choices for a category of product. This systems learn about the user behavior through historical transaction records, profile data, shopping cart content, product browsing pattern and makes a model of a user in terms of certain attributes. The system then compares users and try to predict the preference of one user on the basis of other users incorporating further feedback in terms of item ratings. One thing needs to be clear is that this is mostly heuristic and there is no exact mathematical model behind it(If there was one, it will no longer be a Machine Learning problem, but will be a algebraic equation in many variables). It is argued that recommender systems helps business in the following ways:     

Customers spend less time searching for products Customer satisfaction is increased Customer loyalty is increased Cross-selling is increased. Browsers turns into buyers

Various case studies have been done on use and adoption of recommender systems as part of the business work flow and the numbers are pretty encouraging. We will mention some of the results reported in the literature, one caveat is that majority of the studies have been done in academic departments and hence has some control condition built into it. However though the numbers may not be known but continued success of Amazon, Netflix, Facebook and others is enough proof that in real scenario recommender systems impact business significantly. A study of online Korean consumer found that online shoppers put more trust on information from comments than the ones from other resources. Subjects who interacted with some form of recommendation system reported more positive shopping outcomes than those who did not utilize a recommender system. Analyst Jack Aaronson of the Aaronson Group estimated that investments in recommenders bring in returns of 10% to 30%, due to the increased sales they drive. An increase of 50% to 60% has been reported for cross sales in various systems studied. Studies have also claimed that a 5% improvement in customer retention could cause an increase in profits somewhere between 25% and 85%. An empirical study of music recommendations system has shown that receiving suggestions tailored to individual listeners widened exposure to new products. Netflix has reported that more than 60% of its rentals

stem from recommendations, while 35% of Amazon's sales originate from systems that suggest products an individual consumer might like. In a controlled case study on iTunes done at Wharton the author's saw an increase in the volume of purchases of 50% more than expected. They also found an interesting effect of user coming closer together in terms of taste and choice. They plotted relationships between thousands of users and millions of songs and found a 23% increase in the percent of listeners with an artist in common compared to the control group. The authors found that all kinds of users, close as well as far became closer to one another on their networks in the treated group relative to the control group. The group that received recommendations showed more user-pairs becoming closer (36%), while fewer pairs (9.2%) moved farther apart. Long tail items or niche items have become another very widely discussed recommender engine success story. The idea hers is that items and sales may not follow a normal or an exponential curve (only popular things and few items are sold a lot and rest are just negligible sales) but it may be a power law distribution. The term long tail in our context means retailing strategy of selling a large number of unique items with relatively small quantities sold of each, usually in addition to selling fewer popular items in large quantities(block buster items, limited in time). The long tail was popularized by Chris Anderson in an October 2004 Wired magazine article, in which he mentioned Amazon.com, Apple and Yahoo as examples of businesses applying this strategy, he explains this as follows in his blog: " Traditional retail economics dictate that stores only stock the likely hits, because shelf space is expensive. But online retailers (from Amazon to iTunes) can stock virtually everything, and the number of available niche products outnumber the hits by several orders of magnitude. Those millions of niches are the Long Tail, which had been largely neglected until recently in favor of the Short Head of hits. When consumers are offered infinite choice, the true shape of demand is revealed. And it turns out to be less hit-centric than we thought. People gravitate towards niches because they satisfy narrow interests better, and in one aspect of our life or another we all have some narrow interest (whether we think of it that way or not)." It has been claimed in various studies that long tail sales account for almost 30% of revenue of online retailers. How it Works: Recommender systems working is quite a black box, though the overall

architecture and workflow is known well. There are two level of approach to this system, one is the ecosystem as a whole where recommendation engine is a component and another level is the internal working of the recommender system, the algorithms that drives the model analytics and predicts the outcome. At the workflow level websites generally incorporate recommendation systems through directed suggestions, the most popular of them being “Customers who bought this item also bought… .”, " Recommendations Based on Your Browsing History..", "Jobs you may be interested in..", "People You may Know..", "What other customers are looking at now..". All the phrases are self explanatory and the purpose is to guide a user's item choice, influence the shopping cart or connect similar people or group for greater exchanges and focused group

formation. All these activities turns into direct or indirect revenue in terms of sales or focused advertisement. We will now discuss the internals of the engine or how it measures similar user or items. To an engine a user or an item is a variable and hence it's one or a set of numbers, preferably a collection of numbers which is called a vector or just a sequence. There is another representation, which is a user/item pair, which ranks items as per the specific users. This will be a matrix whose size depends on the number of items and users, this matrix can be very large (think 1 million users, 1/2 million items) and will generally be very sparse as a single user will rate only a handful of items. The target here is twofold, one is to find similarity between items or users and another is to find ratings on items that a potential user will give which she has not already rated. There are definite methodologies for getting the measures(which we will discuss) but the essence is that once we get the numbers and set up some scale of measurement we can predict user X is like user Y and item A is like item B and user X has brought item C, so user Y may like item C, or similarly item A and B are similar, so a user buying item A may buy item B; the engine thus comes to know which item to suggest to whom and what should be the push information and targeted advertisement. So if you book a business class flight ticket, you will be shown an offer from a five star hotel in that destination, similarly a teen in 20's will be suggested the most hot music album. How do the system quantify user or items, well that's an art and heuristics, the general idea is that any item or user has a set of inherent attribute and contextual attribute all of which can be at least classified as 0 or 1 and if possible on a larger scale like 1 to 3; assign those numbers to the specific attributes and an item or a user becomes a string of numbers. Mathematics and Statistics has already in place a definition of similarity, which is called either a distance function (sometimes metric) or a correlation coefficient or a cosine similarity. Recommender system picks up suitable measures and calculate the similarity factor. The profiling will be an offline process and generally the prediction will be an online process. In real situations various pre computed values and optimizations are applied. Important point to note is that there are lot of heuristics here like classifying parameters numerically, comparing the similarity measures on a scale and that's why Recommender engine is so domain and specific model dependent. This takes care of similar variables, what about the huge matrix we have. Generally a method from Linear algebra called SVD or singular value decomposition is used; suffice it to know that this involves breaking up the matrix into three different matrices and choosing a set of values rather than the whole large matrix. The idea is that noise in the matrix is removed with help of this technique and that for this reason predictions will be more accurate. The decomposition can be used to compute similarities between users, this process will be an offline process, and the actual prediction of items will be the online process. What we have just discussed will be mentioned in the literature as collaborative filtering, we will define the basic terms to make things map to each other. Collaborative filtering is a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many users. The collection process of user profile, item profile, user item preference is collaboration and filtering the items

through mathematical measures defined on quantified data is the filtering part. Collaborative filtering is broadly of two class, memory based and model based and in practice sometimes a mixed model is used. Memory based engines generally collects user ratings which are the ranking variables and then calculate similarity between users based on their quantified attributes. Then the system uses this information to predict item preference similarity on the basis of the user similarity measure. This is generally an offline process and the system goes on building the data repository as more and more items and users gets added and more transaction data becomes available. The most popular measure used here is the vector cosine similarity and Pearson correlation coefficient. They are well understood mathematical measures and tools can easily calculate them. In model based case the idea is to learn patterns in the data based on Machine Learning methods. The system mines the data, learns the patterns and predicts a model involving the dependent and independent variables (user/item attribute/preference decisions). There are sophisticated Learning algorithms like Markov Chains, Bayesian Networks, Latent Semantic models and SVD. In either case the target is to have a approximate relationship between the determining variable and the decision variable, memory based engines can be thought of more item wise calculators, while model based systems try to give a generic representation. We will now discuss a general web system which includes a recommender engine. We will base our discussion on the simplified block diagram in the figure below. In a real life engine there will be a separate processing block between offline and online modules, but we neglect that in favor of simplification. The workflow case is a User trying to buy a round trip flight ticket from New York to California. We assume that user have no pre determined itinerary choice and will mostly look for suggestions. In case of a predetermined choice the information may be directly delivered from some content store (Generally VAS in the diagram below). Apart from the primary requirement of closing the definite sales, system will look at this case as a potential cross sell opportunity. The starting point is that user needs to be logged in ( we will not consider the case where the user doesn't have an account, that is a valid case, but will make the use case a more of a generic use case). A logged in user will have a profile set, which tells us about the user( business or leisure class/ family/past booking summary). So the user has a profile stored in DB_Pr which can be a Oracle store or can be a NoSqL store. We have a profile information and a transaction intent (book a round trip ticket), we push that information to the recommender engine. Recommender engine will do a basic parsing and will try to find out parameters like planned duration of stay, generally what type of booking this user does(DB_Tr), is he going alone or with family and similar other domain specific information.

A basic block diagram of a Web based system using Recommendation Engine. DB_Tr --Transaction Database, DB_Pr--Profile Database, ML --Machine Learning VAS --Value added service provider (Hotel Aggregator/GDS/Media Library/Cruise Line)

ML is the work horse for this engine and it will supply the algorithms , which the engine will crank up. The engine will seek further data from VAS, which will provide Hotel, car rental, happenings around, places to see, places of interest ticket bookings and many others depending on the user profile and transaction history. The engine will also pull information from opinion sites like Twitter and Facebook about the events happening during that time frame in that location, what others are telling about different hotels, what is the general travel experience and all other relevant comments and tweets. The general architecture of getting data from sites like FB is to call their Open API's. They are just programming interfaces or you can think of them as functions which the sites provide to access data from their data store. You cannot just access anything and everything, it's a controlled access and requires some account and token authentication and sometimes a specific communication protocol. In case of FB it's the Facebook Graph API and for Twitter it's REST API's, we can safely consider them to be some sort of data service function over the net. All these information will be feed to the ML algorithms and the recommender engine will ultimately come up with a list of suggestions, pointers to relevant ads and the system will push those ads and information to the user screen, the screen will dynamically repaint to present a contextual interface, the personal web store, the ultimate target. The hope is that this will potentially influence the user to buy additional items, plan a more organized trip, make him happy about

the site and also leave a huge trail of information for the system to capture and store for future profiling. The system can send an information package like , this will be the best place for you to stay, you'll enjoy going to the following events during your stay, Retail chain X has planned for a huge discount sale there during your stay, you have the following contacts in that place, maybe you want to inform them and hook up with them; the possibilities are endless, it depends on how much more you can learn and store and how much the algorithm can crank up in a limited time frame (no user is going to wait for more than couple of minutes for suggestions). We need to remember that there will be a parallel data flow, which is information about specifically chosen items and events; like you have to send back to the user a list of tickets for New York to California arranged in order of least to highest price irrespective of any recommendation. We stress again that there is a mix of deterministic and predictive information flow in any booking scenario and the goal of the two are ultimately same, namely closing the deal and increase the deal quantum, but the type of information supplied by the two flow are different. We can take this recommendation further and go on sending promotional mails, event update information, weather updates to the user as offline mails during the time frame of the visit. There are infinite possibilities and the ideal case is a customized transaction page, a customized agent, a customized advisor who follows you faithfully during your search, decision making, purchase and consumption. You can also consider it as a monitor sitting over your shoulder if you like! Issues: Recommender systems may seem to be all powerful from the previous discussion, but

they have their issues. We list some of the major issues with them below: Privacy: Recommender systems needs profile data and read above the users shoulder to do

useful prediction. The system needs to run an agent in the user system to read user behavior, or set cookies in the browser to track the session. This is not always possible and sometimes leads to privacy violation. The best case scenario is a user who always signs into the site and then the user can be linked to history and is uniquely mapped through an identifier. This is not always the case, though most of the transactional systems will force the user to have an account and sign on before doing any purchasing. Even if the user is logged on that doesn't entitles an application to record all user events, but users may or may not care. The bottom line is that a recommender system needs true data to learn and predict, so this is a circular situation. Definition of privacy varies from country to country and a recommender system needs to adjust accordingly. Real Life data set: If we read papers on recommender system one thing strikes us that most of

the results are for movie ratings and video rentals, the reason for this is not that recommender system researchers are very movie loving guys but the fact is that there are only two real life publicly available data set namely MovieLens, a a movie dataset which contains 6040 users and 3900 movies and another one is Netflix data set which contains 480189 users and 17770 movies. Both the data set have been extensively researched and they are good data source. There are quite a number of other data sets, but most of them are product of some sort of academic controlled studies and may or may not be publicly available. The golden data sets are the private retailer data sets (Amazon and others) which are not publicly available.

Though company insiders will have full or selective access to those data sets, they are not publicly available. Publicly available huge data sets are Particle Physics experiment data sets, biological sequence data sets, weather data sets, OCR data sets, speech recognition data sets and census data sets . They are all very large datasets but doesn't help much in modeling what a user will prefer to buy or where they will prefer to go on a vacation. So general Machine learning model verification is not a problem, but domain specific model testing is always a problem for the general research community. Data Assumption: A common underlying assumption in the majority of recommender systems

literature is that consumers have preferences for products and services that are developed independently of the recommendation system. So it's assumed that user-reported ratings can be trusted, and the majority of the research directly uses user-reported ratings as the true and authentic user preferences without any evaluation of whether the submitted ratings represent users’ true preferences. However, researchers in behavioral decision making, behavioral economics, and applied psychology have found that people’s preferences are often influenced by elements in the environment in which preferences are constructed. The ratings nay be necessarily biased which is mostly ignored in most of the studies. Selective View: Some researchers are of the opinion that recommenders only reinforce the

popularity of already popular products, which is called the blockbuster effect. Selective view is another reported issue where a vendor might incorporate the profitability of items into its recommendations. A naive approach is to give the most profitable items the highest recommendations. Then these items would presumably be bought more often and the business would make more money. The user is always presented a controlled view and the interface is profiled as per the user attribute, this some researcher complains is balkanization of user experience. Similarly a business trying to clear over stocked items may highly rank it and present it repeatedly to user to influence buying behavior. These are some of the ethical and business issues that concerns recommender system. Cold Start Problem: This is a famous problem which is very simple to state. A new item added

to the list has never been bought before so its rating is zero to start with, hence it never shows up in the recommendation list and it is never bought or listed. So this item in a circular manner never gets rated and hence never comes up in the list. So also is the case for a new user who has no relevant transaction history for any recommendation and hence never gets a recommendation. This is an unsolved problem, but it is taken care of through heuristics. A recommendation engine and ranks are always manually tweaked as per domain and business requirement, so nothing is at (0,0). Also recommender engine is not the only transaction workflow and users transact without any recommendation just by knowing the item or reading some generic content description. So though this is an unsolved issue this is unsolved more in theory than in practice.

Summary We have given a contextual overview of the general framework of Analytics, Machine Learning and Recommender Systems in this article. In the next part we will look into various travel portals to see how the concepts are playing an important role in deriving more value out of business and how is the readiness and adoption scenario in the domain

References 1. US Travel Industry Data (http://www.ustravel.org/sites/default/files/page/2009/11/US_Travel_Answer_Sheet_Marc h_2013.pdf) 2. Instead of listing a set of research papers or Monographs we will mention what someone needs to know to go deeper into this area. If the focus area is mainly business decision and market share then it's best to look up reports from Gartner, IDC, Forbes and other Analyst. Travel Industry hosts various seminars and generally they have parallel focus segments, their reports will be a good place to look for the recent developments. Popular books/blogs/websites on Big data and Neural Networks gives a good overview of Machine Learning and Data Analytics. 3. If the focus area is serious Machine Learning or Data Analytics then it's best to look up standard references in the subject and then follow up with recent papers. It's required to have a solid foundation in Advanced Undergraduate level Linear Algebra, Calculus, Algorithms, Statistics to do a serious study of Machine Learning. A good knowledge of programming Language like Java, Python, R is required for implementation. A knowledge of tools like Matlab and programming libraries like Apache Mahout, Scikit are required for serious implementation. 4. Chris Anderson Blog -- http://www.longtail.com/the_long_tail/2005/09/long_tail_101.html 5. A industry scale Recommender system description from Amazon -http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf