Pattern Recognition in Multiple Bikesharing Systems ...

65 downloads 0 Views 7MB Size Report
tourist attractions, for example, Jefferson memorial, George Town are grouped in this ... muter pattern as shown in Figure 6.18, with bubbles only on the top in the.
Pattern Recognition in Multiple Bikesharing Systems for Comparability To the Carl-Friedrich-Gauß-Faculty Technische Universität Carolo Wilhelmina zu Braunschweig

Author: Matriculation Number: Degree: First Examiner: Second Examiner: Date: First Guide: Second Guide:

Md Athiq Ur Raza Ahamed 3094160 Internet Technologies and Information Systems (M. Sc.) Prof. Dr. Dirk Christian Mattfeld Prof. Dr. Sven Hartmann Braunschweig, on 11. May 2016 Prof. Dr. Dirk Christian Mattfeld M.Sc. Jan Brinkmann

Abstract Bikesharing is a sustainable short-term bicycle rental service that enhances the urban public transport options. The major issue for the providers of bikesharing is the disproportions in the distribution of bikes. Ensuring reliable provisioning of bikes and docks at the bike stations is significant for the bikesharing system. This thesis presents a novel comparability model that provisions to improve the bikesharing systems design. The comparability model was developed by analyzing widespread operational data from multiple bikesharing systems to derive bike activity patterns. The data used for analysis is recorded automatically by the bikesharing systems. Hence, advanced Geo BI and Data Mining techniques are used to clean the data and gain insights of the bike activity patterns. These techniques gained insights from the complex bike activity pattern from multiple bikesharing systems. The insights from the patterns lead to a better understanding of complex bikesharing systems. The patterns are then proved interesting, with several hypotheses. As a result, multiple bikesharing systems are proved comparable, that lead to the development of a novel comparability model (NyDc). The NyDc model helps in determining the trip purpose using spatio-temporal distributions of bike trips. NyDc, reveals the disproportion in the distribution of bikes in multiple systems and helps in understanding the systems better. NyDC could serve as a benchmark for comparability analysis. NyDc consists of useful metadata that could be utilized for future comparability research. Hence, NyDc could be used for improving or enhancing the design, planning, and management of the existing bikesharing systems or a new bikesharing system in another city. Keywords Bikesharing, Data Mining, Geo BI, Activity pattern, Comparability model

Table of contents Tables

vii

Figures

viii

1 Introduction 2 Motivation for bikesharing systems 2.1 Public transportation, private transportation, and personal vehicle 2.2 Shared mobility in the context of bikesharing system . . . . . . . 2.2.1 Business models and models of provision for shared mobility in the context of bikesharing system . . . . . . . . . . . . . 2.3 History of bikesharing system . . . . . . . . . . . . . . . . . . . . 2.3.1 Recent developments in bikesharing systems . . . . . . . . 2.3.2 Summarizing and mapping the history of bikeshare . . . . 2.3.3 How does bikesharing system work . . . . . . . . . . . . . 2.3.4 Benefits of bikesharing system . . . . . . . . . . . . . . . . 2.3.5 Case study: Real time need of bikesharing system . . . . . 2.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

. .

3 3 4

. . . . . . . .

6 8 9 11 12 12 16 18

3 Knowledge discovery in databases and using Data Mining to improve BSS design 3.1 KDD process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Learning the application domain or business understanding . 3.1.2 Creating a target dataset . . . . . . . . . . . . . . . . . . . . 3.1.3 Data cleaning and preprocessing . . . . . . . . . . . . . . . . 3.1.4 Outlier detection . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Data reduction or transformation . . . . . . . . . . . . . . . 3.1.6 Choosing the function of Data Mining . . . . . . . . . . . . 3.1.7 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.8 Interpretation or deployment . . . . . . . . . . . . . . . . . . 3.1.9 Using discovered knowledge . . . . . . . . . . . . . . . . . . 3.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Data Mining as predictive and descriptive tasks . . . . . . . 3.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . .

21 21 21 23 23 23 24 24 24 24 24 25 27 27 31

v

Table of contents 3.2.4 3.2.5

Unsupervised algorithms . . . . . . . . . . . . . . . . . . . . 34 Supervised algorithms . . . . . . . . . . . . . . . . . . . . . 41

4 Related Work

43

5 Why comparability 5.1 Reason for choosing Citi bike and Capital bikeshare . . . . . . . . . 5.1.1 Citi Bike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Capital bikeshare . . . . . . . . . . . . . . . . . . . . . . . .

46 47 47 50

6 Use case 6.1 Pattern recognition architecture . . . . . . . . . . . . . . . 6.1.1 Software’s . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Data preprocessing for creating target dataset . . . 6.1.4 Preprocessing analysis . . . . . . . . . . . . . . . . 6.1.5 Knowledge discovery using Data Mining . . . . . . 6.2 Evaluation and validation for determining the trip purpose 6.2.1 Complete Process . . . . . . . . . . . . . . . . . . . 6.2.2 Accuracy evaluation . . . . . . . . . . . . . . . . . 6.2.3 Cluster evaluation . . . . . . . . . . . . . . . . . . . 6.2.4 Temporal validation . . . . . . . . . . . . . . . . . 6.2.5 Spatial validation . . . . . . . . . . . . . . . . . . . 6.2.6 NyDc and prediction model prototype . . . . . . .

54 54 54 54 57 58 66 67 67 67 71 72 78 92

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

7 Conclusion and future work

93

Literature

95

vi

Tables 2.1 2.2

Carbon dioxide emission for a 10-mile round trip commute for 5 days a week [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2013 summary of Capital bikeshare survey results [32] . . . . . . . . 15

3.1

Actual value and prediction outcome . . . . . . . . . . . . . . . . . 33

5.1

Quantitative comparison of Citi Bike New York City and Capital bikeshare Washington, D.C. . . . . . . . . . . . . . . . . . . . . . . 49

6.1 6.2 6.3 6.4 6.5 6.6 6.7

Software’s . . . . . . . . . . . . . . . . . . . . . . . . . . Classification using Naive Bayes showing precision value Classification using Naive Bayes showing recall value . . Classification using k-nn showing precision value . . . . . Classification using k-nn showing recall value . . . . . . . Accuracy of different algorithms . . . . . . . . . . . . . . Sample attributes of NyDc . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

55 69 70 70 70 70 92

vii

Figures 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 5.1 5.2 5.3 6.1 6.2 6.3 6.4 6.5 6.6 6.7

viii

Means by which residents commute to work in the USA [23] . . . Bikesharing in the context of mobility [68] combined with models of provision [36] . . . . . . . . . . . . . . . . . . . . . . . . . . . . The USA Air quality [22] . . . . . . . . . . . . . . . . . . . . . . . World growth of bikesharing systems [20] . . . . . . . . . . . . . . Timeline of North American bikesharing system [66] . . . . . . . . Graph showing real time need of bikesharing system [13] . . . . . Process diagram of Cross Industry Standard Process for Data Mining [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge discovery process combined with [41] and [67] . . . . . Geo BI process adopted from [69] . . . . . . . . . . . . . . . . . . Predictive and descriptive Data Mining tasks [46, 67] . . . . . . . Example of a three clusters adopted from [46] . . . . . . . . . . . Classification [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of k-means clustering [46] . . . . . . . . . . . . . . . . . Davies-Bouldin index for increasing number of clusters adopted from [68] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davies-Bouldin example for two clusters [68] . . . . . . . . . . . .

.

5

. . . . .

6 13 16 17 19

. . . . . . .

22 22 26 28 29 32 36

. 40 . 40

Rainfall and daytime max degree comparison for New York City and Washington, D.C. [25] . . . . . . . . . . . . . . . . . . . . . . . 48 Real bikeshare station of Citi Bike and Capital bikeshare . . . . . . 51 Bikeshare by their purposes [1] . . . . . . . . . . . . . . . . . . . . . 53 Architecture used for NyDc . . . . . . . . . . . . . . . . . . . . . . Partitioning the dataset . . . . . . . . . . . . . . . . . . . . . . . . Normalized activity for one station throughout the day . . . . . . . Weekday activity of Citi Bike (top) and Capital bikeshare (bottom) Weekday activity of Citi Bike (top) and Capital bikeshare (bottom) for subscribers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weekday activity of Citi Bike (top) and Capital bikeshare (bottom) for customers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weekend activity of Citi Bike (top) and Capital bikeshare (bottom)

55 56 58 59 61 62 63

Figures 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 6.25

Weekend activity of Citi Bike (top) and Capital bikeshare (bottom) for subscribers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weekend activity of Citi Bike (top) and Capital bikeshare (bottom) for customers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of complete knowledge discovery process for comparability K-means being the best accuracy algorithm . . . . . . . . . . . . . Davies-Bouldin index value for different algorithms . . . . . . . . . Average activity for the course of day top (Citi bike), bottom (Capital bikeshare) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average cluster activity for the course of day top (Citi Bike), bottom (Capital bikeshare) . . . . . . . . . . . . . . . . . . . . . . . . . . . Average cluster activity for top stations throughout the week (Citi Bike), bottom (Capital bikeshare) . . . . . . . . . . . . . . . . . . . Average activity for the course of day top (Citi Bike), bottom (Capital bikeshare) for subscribers . . . . . . . . . . . . . . . . . . . . . Average cluster activity for the daily course of day top (Citi Bike), bottom (Capital bikeshare) for subscribers . . . . . . . . . . . . . . Average activity for the daily course of day top (Citi Bike), bottom (Capital bikeshare) for customers . . . . . . . . . . . . . . . . . . . Average cluster activity for the daily course of day top (Citi bike), bottom (Capital bikeshare) for customers . . . . . . . . . . . . . . . Average cluster activity for the weekend throughout the day (Citi Bike), bottom (Capital bikeshare) . . . . . . . . . . . . . . . . . . . Average cluster activity for the weekend throughout the day (Citi Bike), bottom (Capital bikeshare) for subscriber . . . . . . . . . . . Average cluster activity for the weekend throughout the day (Citi Bike), bottom (Capital bikeshare) for customers . . . . . . . . . . . Geographical distribution of clusters top (Citi Bike), bottom (Capital bikeshare) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographical distribution of clusters top (Citi Bike), bottom (Capital bikeshare) for subscribers . . . . . . . . . . . . . . . . . . . . . Geographical distribution of clusters top (Citi bike), bottom (Capital bikeshare) for customers . . . . . . . . . . . . . . . . . . . . . .

64 65 68 69 71 75 76 77 79 80 81 82 83 84 85 87 89 90

ix

1 Introduction Bikesharing system is a cost effective and flexible form of transportation that typically promotes a sustainable lifestyle in the cities [32, 55, 30]. Motivated by the growing challenges of pollution, congestion, global obesity and climate change, international organizations such as World Health Organization, have shown their interest in such systems [59]. These international organizations initiate multisector and multidisciplinary approaches to reduce the car usage and increase physical activity [59]. Many other studies also showed that bikesharing systems have reduced pollution and congestion, increased physical activity [64, 61, 36]. These systems provide the first mile or last mile connection (short-term rental service) to the other modes of transit, thus improving the transportation system’s efficiency and accessibility to the neighborhood. As a result of such benefits and interest from these international organizations, bikesharing systems have received increasing attention and more than doubled since 2008 [49]. First bikeshare program was deployed on July 28, 1965, in Amsterdam with ordinary bikes painted white [37]. These white bikes were used by the public in an unmanaged fashion. The program failed as it was dependent on individual’s integrity. The other bikesharing system later also failed, because they were unmanaged and the bikes were used for private use. Recently, with the advancement in technologies, bikesharing systems are automated to overcome the issues of the previous bikesharing systems. Current bikesharing systems are vividly single systems with multiple bike stations situated in and around the city. The stations consist of racks and bikes where a user could rent and return a bike. The major concern in the bikesharing system is the user satisfaction [47]. They illustrate that there is plenty of room for improvement with user satisfaction. The two major issues were perceived which caused frustration to the users 1) unavailability of bikes at the station when the user needs 2) unavailability of free racks when the users want to return the bike at the station. To solve these problems during designing a new bikesharing system or extending a new bikesharing system, bike activity pattern should be understood. There have been several studies which have analyzed the bike activity pattern [31, 43, 69, 56]. These studies analyzed ride data using Geo BI (Geographical Business Intelligence) which could help in location planning of stations, temporal distributions of bikes, but these studies were done on a single system and few studies which focus on multiple systems, but in an abstract fashion. These studies haven’t taken customer profiles into consideration. This thesis focuses on compa-

1

1 Introduction rability studies between multiple bikesharing systems. The major contributions of this thesis is the development of a comparability model (NyDc) which could serve as an input for operational research (location decisions). The comparative model is developed using the one year of data from two bikesharing systems namely Citi Bike New York City [8] and Capital bikeshare Washington, D.C [6]. The analysis is carried on using advanced Data Mining and Geo BI techniques. The analysis results in clustering stations based on temporal and spatial factors. Furthermore, the results are validated using Geo BI techniques in the final part of the work. The structure of the thesis, is designed in the following parts. A brief introduction and motivation for bikesharing system is illustrated. Following the introduction are the benefits and issues with bikeshare is explained. Then, introduction to Data Mining is briefly explained for the purpose of developing the comparability model. Following Data Mining, related work in the field of bikesharing system to solve the issues of bikeshare is elaborately illustrated. Following related study, it leads to the need of comparability. Then, the architecture and work process for the development of the comparability model is explained in the later section. Finally, empirical results are presented and validated, and concluded with a discussion.

2

2 Motivation for bikesharing systems There has been a gradual increase in the proportion of people living in urban areas in the whole world. Cities are where the people are, with economic, social and environmental problems [10]. In 2007, the earth’s population became more urban than rural [26]. It is predicted by United Nations that by 2050 about 64 percent of the developing world and 86 percent of the developed world will be urbanized [27]. As a result of increasing population, there have been several problems in urban areas for transport. Extending traditional urban transportation concepts will not suffice to solve the problems for urban transportation. There have been different modes of transport such as public, private, personal vehicle and shared transport. On the other hand, with urbanization progressing faster than predicted, there has been a lot of advancement in transportation to satisfy the urban population. Urban transportation can be broadly classified into four important types such as private, public, shared mobility, and personal vehicle. On the other hand, recently new sharing techniques have been practiced, for example, flexible carpooling, green travel, share taxi, ridesharing, trucksharing, and vanpooling. These four important types are discussed as follows.

2.1 Public transportation, private transportation, and personal vehicle Public transportation is a mobility service which is open to all for usage. In contrast, private transportation is owned by individual or group for private purposes [58]. Both modes have advantages and disadvantages. Advantages for private transportation is high flexibility and accessibility, own cars and bikes parked in their private parking space or nearby with ease to access. Whereas, at the same time private mode is expensive and rigid to maintain. On the contrary, public transport is much cheaper when compared to private mode. Nevertheless, public modes have disadvantages as well, they have spatial and temporal limitations. Their accessibility and flexibility are not as good as private mode. It is difficult to determine the reason, why individuals choose a particular mode (their reasons are sparse). For example, even though the family has a car, he or she cannot use it unless the individual has a license, subsequently restricted to use the mode. There are several reasons why individuals choose a particular mode, for example,

3

2 Motivation for bikesharing systems cost, flexibility, and license. The existing public and private transport have a lot of shortcomings and could not satisfy the demand of the increasing population. There has always been innovation in the field of transportation in urban areas to satisfy the population with a flexible, sustainable, and affordable mode of transport. To resolve these problems of transportation, and utilize the existing infrastructure, the purpose of mobility needed to understood (daily home to work pattern). Once the trip purpose is understood, then the people are served with enhanced means of transport. Hence, Figure 2.1, explains the purpose of mobility and illustrates an overview of the different modes of transport in the USA (also with few important cities in the USA). It is clear in the whole USA, 90 percent of the people use cars. On the other hand, with big cities such as New York City or Washington, D.C., a higher percentage of people use public transport, few reasons being, traffic, insufficient parking space, pollution, and cost. Two important problems in cities are congestion and parking difficulties because of the increase in private vehicles [58]. Public transportation is considered the most efficient mode of transportation in the cities, but only problem being its temporal and spatial flexibilities. As in smaller cities, public transport (Portland) is only 12 percent compared to New York City and Washington, D.C., being 56 and 38 percentage respectively as shown in Figure 2.1. It is difficult to serve cities featuring scattered areas with low density due to decentralization. As a result, the created expectations towards flexibility cannot be served by public modes. Hence, a novel transport mode is required. Access, instead of ownership (shared mobility) is now a trend in mobility, which is analyzed with social trends and it also needs innovative mobility. They [65] tend to prove that sharing of mobility has a ground-breaking concept in the mobility sector. With growing population and a need for intelligent modes of transport, public and private modes are not the solution. Shared mobility is one such solution to the existing problems of the traditional mode and different from the existing modes of transport [68].

2.2 Shared mobility in the context of bikesharing system Shared mobility is booming its importance which combines both the benefits of public and private transport. The objective of this kind of system is to have a highly flexible, accessible, cheaper, and quicker mode of transport using the existing infrastructure. In this mode, the individuals can share their vehicles with others for cheaper cost and could be highly flexible. Bikesharing and carsharing system are two important systems in urban areas that are gaining increasing popularity. Bikesharing system being an Eco-friendly, go-green, increased health

4

2.2 Shared mobility in the context of bikesharing system

Figure 2.1: Means by which residents commute to work in the USA [23] effects, and cheaper mode of transport has gained popularity over the last decade. Another such example is a carsharing system where the individual rents a car for short periods of time for a cheaper price. The providers of this cars, for example, commercial business companies and public or cooperative agencies. Carpooling [53], is another concept, but for collective trips with privately owned cars, similar to the concept of personal vehicle sharing [65]. The sharing of mobility resources has prime benefits, pollution control, vehicle ownership, economic, and ecological benefits [62]. To understand the process more clearly, it is necessary to understand the basic definitions of bikesharing systems. Basic definitions Casual user: These users hold membership from one to thirty days for a short period of time. Member: These users hold membership for a longer period of time. Occasional membership: In March 2013, a new membership was introduced, where the short-term users get a free key fob. When this key fob is swiped, the user receives a discounted 24-hour pass. Docking station: The dock consist of a terminal and docking points, user rents and return bikes. When there is a problem with the bike that was rented, it can

5

2 Motivation for bikesharing systems

Figure 2.2: Bikesharing in the context of mobility [68] combined with models of provision [36] be docked at the nearest station on the docking point. Then, a new bike can be taken at no extra cost. Kiosk or paying station: Kiosk is a place where users can rent a bike using a subscriber key, which was obtained through online subscription (30 days or annual) or one can pay as a casual user at the kiosk. One can also see the a record of their journey, find nearby docking station, get extra time without charge when returning if the docking station is full. One can also see local street map, scheme costs, the code of conduct, and information in different languages. Online map: Near real-time online map which shows availability of the bikes and docking station spaces. However, for the success of shared mobility, business models and their models of provision has to be clearly tailored, are as follows.

2.2.1 Business models and models of provision for shared mobility in the context of bikesharing system There have been several different business models evolved after the advent of IT based bikesharing systems. However, similar business models can be applied to other vehicles as well [36], as five top business models, that summarize other models are listed 1) government model or publicly owned 2) non-profit, 3) privately owned 4) public owned or contractor operated, and 5) vendor operated 6) adver-

6

2.2 Shared mobility in the context of bikesharing system tising companies. The pricing in these types of business models can be linear or progressive or flat-rate. In some of the models, there are stations, where the bikes can be rented and returned, but some are station-less. These station-less, are, for example, university-based systems, where the bikes can be used only within the campus. In the context of bikesharing systems, there have been various models of provision, as shown in Figure 2.2, bikesharing providers have included governments, quasi-governmental, universities, non-profits, advertising companies, and for-profits [36]. These models of provisions are combined with the types of urban transport [68], for a clear understanding as shown in Figure 2.2, that is furthermore enlightened (below). • Government model: In this type of model, the local government has an upper hand. Here, the government maintains the liability for the program.“The government of Burgos, Spain, acquired and runs the off-the-shelf bikesharing system called Bicibur” [36]. • Non-profit: The City Bike foundation of Copenhagen, which operates Bycyklen is an example of non-profitable organization. This type of organization collects funding or by grants, sponsorships, and loans from the jurisdiction for the service. • Privately owned: It is owned and operated privately, DecoBike MB is an example of this type of model. The whole process is taken care privately for equipment, maintenance, and the entire operations are handled privately. They have contracts or agreements with the public entities. • Publicly owned and contractor operated: This is similar to the governmental model where the government owns and administrates the entire system. However, the operations are contracted to a private operator. Capital bikeshare is an example of this type of model. • Vendor-operated: Vendors are the ones who design or manufactures the system equipment. In this type of model, these vendors operate the entire system. Broward B-cycle is an example of this type business model. • Advertising companies: This type of model is popular and profits the company or the jurisdiction and the people. One such company is JCDecaux, Clear Channel Outdoor, they want to do advertisement and need public space to display revenue-generating advertisements on billboards, kiosks and so on for advertisement. As a result, they offer a bikesharing program to the jurisdiction. With these models been discussed, it is crucial to understand the history of bikesharing system. For the success of such a system,

7

2 Motivation for bikesharing systems understanding the history is crucial, subsequently, one could learn significant lessons before using such models.

2.3 History of bikesharing system The first bikeshare initiatives have been developed significantly since their introduction in Amsterdam (Europe) in 1960’s. From the beginning of the bikeshare initiative, there have been three generations of bikeshare programs over the past 45 years [37]. The first generation of bikeshare program deployed on July 28, 1965, in Amsterdam with bikes, painted white. These custom painted bikes were used for general purposes and were painted white for identification. These white bikes were used by the public in an unmanaged and was totally depended on the individual’s integrity to drive the bikes responsibly. To use these bikes one could find a bike, ride it to any destination, and just leave it for the next user. The design collapsed in few days, as they were used for private use, thrown into the canals or stolen. In 1991 and 1993, there were few bikesharing programs in Denmark [54]. However, these programs were small Nakskov (1993), had 26 bikes at 4 stations, nevertheless, this led to the beginning of the second generation bikesharing system. In Europe after many years in Copenhagen, they familiarized a large coin-based system as Bycyklen or City Bikes. In this system, there were many improvements and lessons learned from the previous generation. The design meant that a degree of control with a coin deposit. These bikes were used for intense utilitarian purposes, where one could pick up and return at specific locations throughout the central city. Even though they were more formalized with stations, they still had the problems of theft. Since there was no limit on time, individual user kept the bikes for a longer period of time, these problems led to the failure of the system and the rise of new bikesharing system with improved bike tracking. The third generation bikesharing system resolved the shortcomings of the previous systems. In 1996 third generation bikesharing programs were introduced as Bikeabout at Portsmouth University in England. Following this the 3rd generation systems were improved with a variety of technological advancements such as smartphone access, electronically-locking racks or bike locks, smart cards, and on-board computers. In this generation, they have exploited the capacity of information and communications technologies. The whole system was completely automated and is easy to use. In order to make a system safe, they introduced the concept of subscriptions in these programs. They also started the new scheme of 30 minutes of bike use for free. These subscriptions permitted one to know who the customers are, that reduced the likelihood of theft. Bikesharing systems grew slowly, with few new programs deployed annually, such as Rennes (France), Vélo à la Carte in 1998 and Munich’s Call a Bike in 2000. In 2005, they launched a

8

2.3 History of bikesharing system huge third generation bikesharing (Vélo’v) in France, with 1,500 bikes in Lyon by JCDecaux (multinational corporation company for advertisement). They collected the distance traveled and conditions of the bikes (lights, brakes, etc) and detailed statistics about bike usage were tracked. This program had 15,000 members and bikes being used at an average of 6.5 times each day. With the success of this program, it led to the development of similar systems in different cities. On 15 July 2007, Paris launched its own bikesharing program, Vélib’, the system encompasses around 7,000 bikes, which has expanded to 23,600 bikes in the city and suburbs since then. Vélib’ is the world’s twelfth largest bikesharing program with respect to the number of bicycles in circulation. There were about 60, 3rd generation programs globally until 2007 [36]. Fourth-generation bikeshare could include dockless systems and enhanced transit integration. The fourth generation will elevate efficiency, sustainability, usability, and customer satisfaction. These cutting-edging systems require complicated business models. Major advancement intended in this generation is to have a scheme, which has a mobile docking stations that allow stations to be removed and transferred to different locations. As a result, stations can be relocated according to usage patterns and user demands. Solar-powered stations which are another intended concept to be implemented [63][14]. These are the few advancements suggested in the recent research work. There have been several recent developments in bikesharing systems as follows.

2.3.1 Recent developments in bikesharing systems There has been enormous development in bikesharing systems, several advanced features have been added to these systems in order to satisfy the customer. They [66] listed eight new developments as follows, 1) the expansion of pay-as-you-go services; 2) membership portability and interoperability; 3) increased community involvement; 4) developments related to equity and access improvement; 5) the advent of helmet dispensing options; 6) research and development of dynamic pricing; 7) public bikesharing system’s recent filing for bankruptcy protection, and 8) additional research. These developments are explained as follow.

Pay-as-you-go services In 2012, BIXI Montreal introduced a new membership type known as the “occasional” user. These type of users are provided with a program key without any charge. Other bikesharing systems, Nice Ride Minnesota and Capital bikeshare also considered this type of membership to encourage ridership.

9

2 Motivation for bikesharing systems Membership portability and interoperability Interoperability among programs resulted in huge benefits, subsequently, there was a huge development in bikesharing systems. Interoperability is a feature, where annual members can access bicycle program when they are traveling to other cities outside their home. An example of interoperability, B-cycle expanded the interoperability program known as “B-connected” to 15 of its programs in the USA. Community involvement It is always useful to get feedback from the users in order to improve user satisfaction. This community involvement was focused in order satisfy the user, they developed a website for the users. In this website users can suggest a station location and either “like” or “dislike” suggested locations, for example, Bike Nation, bikesharing system in the USA. Developments related to equity and access improvement This feature helped the people, whose income is low and cannot afford to use the service. For example, “Bank on DC” program launched by Capital bikeshare. “Bank on DC” program provides United Bank or District Government Employees Federal Credit Union account holders a US 25 dollar gift certificate. This gift certificate can be used towards the cost of an annual membership [6]. Advent of helmet dispensing options User safety is the most important concern in bikesharing programs. There are several initiatives promoted which tries to develop helmet dispensing to encourage helmet, aimed at user safety. The city of Vancouver introduced a helmet vending machine. This machine will integrate helmet vending with a return receptacle. This machine can hold up to 30 helmets of two different sizes for comfort and tracked by RFID. Research and limited deployment of dynamic pricing The goal of dynamic pricing is to use pricing mechanisms to encourage self-rebalancing of bikesharing. There have been several studies that have examined various heuristic methods and pricing strategies for re-balancing optimization. The latest successful feature of bikeshare systems is to monitor cycle availability and docking station spaces in near real-time online maps.

10

2.3 History of bikesharing system Public Bike System Company (PBSC) files for bankruptcy protection in January 2014 On January 21, 2014, BIXI, North America’s largest equipment supplier filed for bankruptcy. In 2013, there were speculations of PBSC’s inability to repay its debts. Few months later, the company filed for bankruptcy protection. The impact of this filing has not yet fully materialized. Additional bikesharing research and resources There has been several new research work done in bikesharing system and it is still evolving. For example, the Institute for Transportation and Development Policy (ITDP) in 2013 published the bikeshare planning guide [44]. The summary of these bikesharing systems and model is explained in the following section.

2.3.2 Summarizing and mapping the history of bikeshare All bikesharing systems that have been developed are summarized in one of below models. Free bikes It is similar to the first generation bikesharing system. These bikes are given to people for free of cost and released in a certain area. One special case is the bikes released in university or work campus. Where, the bikes would be used with a particular boundary. Small amount It is comparable to the second generation bikesharing system. Here, a small amount, a coin is deposited and anyone can use and return it to a certain station. Since it’s a small amount and not been tracked this technique suffers the problem of theft and could be used privately. Subscription It is the modern technique, alike third and fourth generation bikesharing systems. In this type, the user gets a membership or an access code to access the bikes from the station and return it to any station. Here, the bikes are tracked and benefit both the individual user as well as the provider. There are different proposals in this type of bikesharing systems. For example, solar system assisted bikesharing system or E-bike. Especially e-bikes are gaining popularity recently. It’s a simple

11

2 Motivation for bikesharing systems idea, where these E-bikes would be recharged after leaving them at the stations and after recharge, they can be reused.

Governmental collaboration It’s a huge program in collaboration between the government with any of the above techniques. For example, the railway provider has collaboration with a bikesharing provider. One primary example of this type of partnership is the Call a Bike program in a German city. The Call a Bike program has collaboration with the national rail company.

Private collaboration This type of services are rare, but it’s one such intelligent technique. Here, the car parking operators provide bikes to their customers who park. It’s an intelligent technique to combine with bikesharing system. Vinci Park, in France, is a prime example of this type of sharing. There is a collaboration with other providers as well, carsharing program offering bikes to their customers. City carshare, in San Francisco is an example of a private collaboration. This thesis is about profit based bikeshare, hence, it is necessary to understand how does the bikesharing system work.

2.3.3 How does bikesharing system work Anyone can join the bikesharing organization on an annual, monthly, or daily basis its open to the public. Once the user is registered, he or she can rent a bike at any dock by using their credit card, membership card, key, or key-fob. Trips can be point-to-point, round-trip, or both. Thus, the bikes can be used for one-way transport and for multimodal connectivity. Formerly, the members finish their journey, he or she can return to any dock whenever there is room in the dock. With the brief summary of bikesharing system and its working, the user should be motivated to practice bikesharing system. The following section motivates users to utilize bikesharing systems with the following benefits.

2.3.4 Benefits of bikesharing system Bikesharing is an economical, flexible, versatile program in the urban areas, with several benefits. Few of the most important benefits are discussed in this work.

12

2.3 History of bikesharing system

Figure 2.3: The USA Air quality [22]

Health benefits Health effects of bikes are a major cause for the rapid growth of the bikesharing systems. Bicycling is a physically active and environmental friendly means of transport. In cities, there has been an increasing popularity for a go-green, multipurpose, and versatile public transportation. There have been several steps taken to make a “healthy city” all over the world. Physical inactivity is a strong reason for an unhealthy city and a severe cause for several diseases. As an answer for a “healthy city”, bikesharing system is identified as a significant nominee. Bikesharing system resolves several issues for a “healthy city” by reducing pollution, traffic congestion, and serious greenhouse gasses [72, 24].

Pollution control It is clear from the United States of America (USA) map shown in Figure 2.3, most of the major cities are polluted, with blue being the cleanest, cyan is good, green is moderate, yellow is bad, and red is worst. Table 2.1, explains how much carbon dioxide emission takes place for a 10-mile round trip commute for 5 days a week. The urban areas are highly dangerous and it has a serious effect on the health. Bicycling commuting uses no fuel, no toxic batteries or motor oil contributing to a clean and green environment. All these health effects pave the way for the rapid popularity of bikesharing system.

13

2 Motivation for bikesharing systems

Table 2.1: Carbon dioxide emission for a 10-mile round trip commute for 5 days a week [11] Vehicle MPG Gasoline Emissions Small car 35mpg 68 Gallons .7 Tons of CO2 Mid size car 20mpg 124 Gallons 1.3 Tons of CO2 SUV/4×4 14mpg 170 Gallons 1.9 Tons of CO2 Economic benefits Bikesharing systems is an economical form of transport for first-and-last mile trips that connects to different modes. It is also flexible for both short and long distance destinations in cities. The number of bikesharing systems as of 2014 is around 712 cities, operating with 806,200 bicycles, at 37,500 stations approximately [66]. Figure 2.4, suggests that there is a huge interest in the bike programs in mostly all the major countries in the world. Since this thesis is about the USA bikesharing systems, it’s more reasonable to have insight about North American growth of bikesharing for motivation. The timeline totally explains that there is a boom in bikesharing programs in North America, as depicted in Figure 2.5. There have been hundreds of new bikesharing programs every year and it clearly shows a positive growth. Bikesharing certainly lower transportation costs, reduced fuel use, increased use of public transit, and also aids in economic development [66]. There have several studies which suggest that cyclists shop more frequently and spend less per trip than the other drivers [32, 33]. Consumer spending was analyzed with respect to the transport mode using surveys [34]. They also concluded that non-driving users spent amounts similar or greater than customers arriving in vehicles. They [60], founded that that bikesharing users generated a new traveling and spending pattern. They founded a new pattern that the bikesharing users, often travel to a spending locations (shopping trips). Moreover, bikesharing systems had a positive impact on the sales of the neighboring shops. Graduate student report from Virginia Tech [32], analyzed the impact of Capital bikeshare on business traffic in Dupont Circle in Washington, D.C. They illustrated that the bikesharing had a positive impact and the daily traffic in sales was increased by 11 percent and 13 percent of the sales was increased. There were also other reports which analyzed that the cyclists visiting the supermarket. They [32], summarized the cyclist visit the supermarket, 3.2 times per week and spent around 50 euros per trip while the drivers only visited 2.5 times per week. On the contrary, the drivers spent more than 50 euros per trip. The weekly share for the business was founded to be 48 percent cyclist and 52 percent drivers. Table 2.2, summaries 2013 summary of Capital bikeshare user survey. The table shows that 73 percent of the users almost most of them used bikesharing

14

2.3 History of bikesharing system

Table 2.2: 2013 summary of Capital bikeshare survey results User survey Top reasons for using Capital bikeshare Travel time Enjoyment Exercise Travel costs Share of users traveling to spending destination Spending less than $10 Spending 10−49 Spending $50 or more Spending within 2 blocks of station Spending within 4 blocks of Capital bikeshare station Spending greater than 4 blocks or did not know Share of users making new or induced trip Share of users making a trip regardless of Capital bikeshare Share of users spending more because of Capital bikeshare

[32] % 73% 42% 41% 25% 66% 6% 65% 29% 34% 45% 22% 16% 78% 23%

in Washington, D.C., because it is faster than the other modes, for certain trips. The other important reasons for using bikeshare are 42 percent for enjoyment and 41 percent for exercise as shown in the table. They also concluded that the 25 percent of the users used it because it’s economical and 70 percent of the business in the neighborhood had a positive impact. Overall they concluded, bikesharing programs benefits both the individual users and business with a new pattern of trips and spending. The business benefited both monetary and non-monetary benefits from bikesharing programs. Another important factor for bikesharing program is traffic such as traffic congestion and pollution, have occurred in major urban areas in particular, due to the increased use of cars. There has always been a growth in traffic and as a result of which, serious environmental harms such as noise and exhaust emissions. Physical intrusion, barrier effects, and congestion are some more examples of the effects of these developments. Now after facing such serious problems there has been a growing interest from politicians, planners, and the general public in finding acceptable methods to solve such a problem. Bikesharing system solves most of these problems. In summary according to [66], there are several other benefits “1) increased mobility; 2) lower transportation costs; 3) reduced traffic congestion on roads and public transit during peak periods; 4) reduced fuel use; 5) increased use of public transit and alternative modes (e.g. rail, buses, taxis, carsharing, ridesharing); 6) economic development; 7) health benefits; and 8) greater environ-

15

2 Motivation for bikesharing systems

Figure 2.4: World growth of bikesharing systems [20] mental awareness”. With several benefits discussed it is necessary to discuss a real time need of a bikesharing system for motivation, that is discussed as follows.

2.3.5 Case study: Real time need of bikesharing system This case study is certainly useful, in order to understand the importance of bikesharing systems. In this use case, an individual generalizes the daily usage of different modes of transport by the people in New York City. The user daily travels from home to office in New York City. He uses three types of modes (car, private bike, and bikesharing system) to travel throughout the week to understand, which mode is the best for commuting to his office. He describes how bad the road traffic and it’s difficult to travel in a car during peak hours and parking lots are too much expensive. Then, he describes the disadvantages of his private bike, where it’s expensive, not comfortable, and always gets a flat tire and damaged. He takes four factors (speed, cost, comfort, and ease of use) to measure which mode is the finest to travel. The user travel’s daily from his apartment to office, which is around 1.3 miles. Three of the modes used are as follows. Taxi: The major problem with taxi is finding a taxi, it’s crowded and difficult to find during peak hours. Nevertheless, the taxis are comfortable and it’s relaxing with music and other features. However, its costs him 11.30 dollars per ride. As a result, if he does it 2 times a day and five times a week and 52 weeks a year it is

16

2.3 History of bikesharing system

Figure 2.5: Timeline of North American bikesharing system [66]

around 5,876 dollars. It is expensive and speed was only comparable, 14 minutes 55 seconds that is not impressive. Private bike: It easy to find, since it’s outside the apartment and it’s not that comfortable. Its costs 300 dollar’s for his bike, 80 dollars for its lock and 50 dollars for repairs nevertheless, it’s cheaper than a taxi. It’s fast in the city where everything is crowded and it took him only 9 minutes and 2 seconds, faster than the taxi. Citi bike: He initially had no idea about bikesharing. The user founded it difficult to use at the kiosk since it was empty and had to wait for some time to get a bike. After getting the bike it’s comfortable to use and an annual pass is only 95 dollars and therefore its cheaper than the other two modes. It took him 16 minutes because he had to find an empty kiosk to return. After he came to know about the mobile app and online map, it was easy. As a result, he concludes for cities like New York bikesharing system is promising. Figure [13], clearly shows the real time need of bikesharing system, comparing taxi, private bicycle, and Citi Bike; taking comfort, speed, cost and ease of use into consideration. This generalized use case by the individual motivates to use bikesharing. As a result, this thesis would ease the providers to understand the insights of bikesharing and satisfy the customers in a better way. In recent years, there has been an enormous

17

2 Motivation for bikesharing systems development and increasing importance of bikesharing systems. There have been difficulties in operating and managing these bikesharing programs. There has been enormous effort take in order to solve problems, such as planning the location of the stations or relocation of bikes. This leads to the following hypotheses in the next section.

2.4 Hypotheses There are three hypotheses which are presented in this work. Concerning to prove these hypotheses including spatial relations for location planning or operational research are as follows. 1. Rentals and returns depend on the spatial and temporal factors. 2. Profile of users influences rentals and returns. 3. Profile of the user influence in business development in the neighborhood. According to the hypothesis one, the rentals and returns of the bikes depend on the location factors (Geographical, temporal population, weather, and so on). When hypothesis one is proved this is significant for mapping or designing a new location. When hypothesis two is validated, then customer prediction according to location could be done. When both the hypotheses are proved then according to hypothesis three, the users influence in business development in the neighborhood is true. To prove these hypotheses several definitions are required. Definitions for proving the hypotheses Four types of users and two types of profiles are predicted • Commuters who regularly commute to office or universities in the weekday • Leisure users who use for exercise and fun or leisure purposes • Tourist users who are tourist, who visit tourist spots or parks for leisure purposes • Utility users who are shopping throughout the weekdays and weekends, for daily household and are regular visitors to the neighborhood • Subscribers: They are the regular commuters for employment or university and benefit the bikesharing system economically

18

2.4 Hypotheses

Figure 2.6: Graph showing real time need of bikesharing system [13]

19

2 Motivation for bikesharing systems • Customers: They are tourists or shoppers or household users visiting neighborhood, their trip duration is less in weekdays and high during weekends For proving these hypotheses, knowledge discovery in the databases is necessary and advanced Data Mining techniques are utilized. Geo BI is vital to evaluate these hypotheses. Geo BI is based on the knowledge discovery in the databases [41], gaining valuable insights from the spatial data. These insights are the understandable patterns from the data. When these hypotheses are proved then the patterns are interesting. In the following chapter, these advanced techniques are further elaborated.

20

3 Knowledge discovery in databases and using Data Mining to improve BSS design Data Mining is an important step (analysis) in the knowledge discovery in database (KDD) process producing a particular enumeration of patterns. Knowledge discovery in databases (KDD) is necessary to prove the Hypotheses discussed in the previous section. Knowledge discovery in databases process consists of 9 steps according to [41], 1) learning the application domain, 2) creating a target data set 3) data cleaning and preprocessing 4) data reduction and projection 5) choosing the function of Data Mining 6) choosing the Data Mining algorithm 7) Data Mining 8) interpretation 9) using discovered knowledge, but in modern era, they can be classified in three broad categories, 1) preprocessing (selection, cleaning, data integration and transformation), 2) Data Mining (data exploration, cluster analysis, association analysis, classification and regression, outlier detection) [46], 3) post processing (pattern evaluation and knowledge presentation (visualization). Apart from these three important steps, learning the application domain is an important procedure to iteratively process all the KDD process [41]. CRISP-DM (Cross Industry Standard Process for Data Mining) methodology is the most used methodology by the data miners in this modern era following the above-listed steps. The process diagram is shown in Figure 3.1. Other such used methodologies are SEEMA. These methodologies are almost similar to [41], steps discussed in the next section.

3.1 KDD process According to [41], KDD is an interactive and iterative process which involves the following steps as shown in Figure 3.2.

3.1.1 Learning the application domain or business understanding It significant to understand the domain or application so that it is easy to satisfy the customer. Its serves as the basis and crucial step in KDD as this plays a vital role in the accuracy of the process. The learning is carried out by a discussion

21

3 Knowledge discovery in databases and using Data Mining to improve BSS design

Figure 3.1: Process diagram of Cross Industry Standard Process for Data Mining [9]

Figure 3.2: Knowledge discovery process combined with [41] and [67]

22

3.1 KDD process with the domain expert. If this step is not done the requirements and goals of the system is unclear. Hence, it is the first and crucial step for customer satisfaction.

3.1.2 Creating a target dataset Selecting a proper dataset is important, for instance, the year-wise selection of dataset (2014 New York City dataset) or choosing a subset of the data set (specific months of 2014). The created target dataset serves an input for the core analysis in KDD. If a proper target dataset is not selected, the results of the knowledge discovery process would be misleading with several outliers and failure to discover proper knowledge.

3.1.3 Data cleaning and preprocessing Usually the data collected for analysis are inaccurate, incomplete, and inconsistent data. This is an important step for future analysis, removing noise data or outliers. The major tasks in data preprocessing are data cleaning, data integration, data reduction, and data transformation. For example, mapping the missing data or creating schemas in the database all such database issues are considered in this step. The output of this data is properly cleaned and selected. The better the cleansing, the higher the quality would be. There are many factors comprising data quality such as accuracy, completeness, consistency, timeliness, believability, and interoperability [46]. The missing values or null values are filled or different strategies are followed in the cleansing phase, smoothing the noisy data (a random error in a measured variable), outliers (exceptional data) are detected or any kind of inconsistency is removed in the cleaning phase from all the data sources. It is common to have data from multiple data sources. In the data integration phase, data from multiple data sources are merged into a coherent data store. From data integration, their might be redundancy in the data, as a result of which data reduction is done in the later phases. Data integration is difficult to do since the same attribute may be of different names in other data sources. Integration should be done carefully to avoid inconsistency and redundancy.

3.1.4 Outlier detection In this process the outliers are identified. Outliers or exceptional data which are different from all other data. In certain applications, outliers are important providing some excellent insights. Nevertheless, in most of the cases, they are removed in the data cleansing or reduction phase, because certain Data Mining algorithms are sensitive to this type of data. As a result, the algorithms might perform badly.

23

3 Knowledge discovery in databases and using Data Mining to improve BSS design

3.1.5 Data reduction or transformation Finding the useful features is important when clustering or classification. Hence, aggregating or eliminating redundant features reduces the size of the data. Dimensionality reduction (matrix factorization) is one such technique to do data reduction. Different normalization techniques also applied to transform the data for better accuracy. For example, transforming a particular field (date field to common format).

3.1.6 Choosing the function of Data Mining Deciding on the purpose of the model, derived by the Data Mining algorithm which is necessary for further analysis, e.g. (classification, clustering, summarization). • Choosing the Data Mining algorithm(s): Data Mining algorithm should be chosen according to the application. This algorithm is necessary to find patterns from the existing dataset. According to [73], identified top 10 algorithms namely C4.5, k-means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These 10 algorithms are among the most significant Data Mining algorithms in the research community.

3.1.7 Data Mining Here, one searches for patterns using the algorithm chosen of interest in a certain representational form or a set of such representations. These include classification rules or trees, regression, clustering, dependency, and line analysis. Data Mining is the most important step in knowledge discovery. Hence, it is described more in detail in the next section 3.2.

3.1.8 Interpretation or deployment After searching for the patterns of interest it is necessary to understand the pattern for humans. Visualization has its core importance in business as well as in research. The patterns visualized are easily understood by the users. This deployment phase can be as simple as generating a report.

3.1.9 Using discovered knowledge Using the knowledge discovered is significant, which is the motive of KDD (decision support). After discovering knowledge, appropriate actions based on these knowledge has to be taken, or simply documenting or reporting it.

24

3.2 Data Mining The Whole process is presented as a Geo BI process in Figure 3.3. This process served as the basis for the development of the comparability model.

3.2 Data Mining Data Mining is an analytic process which is about resolving problems by analyzing data which is already present in databases [71]. The data in the database to explore is usually large amounts of data. The exploration is a process of searching for consistent patterns or systematic relationships between variables and then validating it [21]. The information in the massive data in the database is usually “hidden” that is not readily evident. The goal of Data Mining is knowledge discovery, extract information or patterns which are non-trivial, implicit, previously unknown, and potentially useful from large data in the databases in a form which is easy to interpret for predictions or different applications. Predictive Data Mining is the most important and common type of Data Mining. Data Mining has several applications such as financial data analysis, market analysis or cross-market analysis, resource planning, biological data analysis, telecommunication industry, and transportation industry. These applications are discussed as follows, • Market analysis: Customer profiling is a significant task in market analysis. Here, finding customers with same characteristics, location, income level, spending habits, temporal characteristics, then they can be clustered accordingly by choosing the appropriate algorithm. • Financial data analysis: The data from banking or financial industry is highly reliable and is of high quality. With this type data, Data Mining is easy and useful to apply. Here, Data Mining can be applied for clustering customers for marketing or detection of money laundering. • Resource planning: In a corporate industry, planning plays a vital role in the economy of the company. The basic idea of resource planning is to summarize and compare different resources and their spending. • Biological data analysis: In recent times data analysis in biology is gaining popularity. It has a wide variety of applications in the biological industry such as biomedical or genetics industry. They are valuable for discovering structural patterns in the genetic networks as well as analyzing the network. Appropriate visualization tools in the genetic data analysis are gaining immense attention. • Telecommunication industry: Due to the development of new communication technologies, there has been a huge amount of data generated and telecom-

25

3 Knowledge discovery in databases and using Data Mining to improve BSS design

Figure 3.3: Geo BI process adopted from [69]

26

3.2 Data Mining munication industry is emerging. Location-based services, mobile telecommunication services, fraudulent pattern analysis could be provided. • Transportation industry: Data Mining plays a vital role in transportation industry for distribution schedules among warehouses, analyze loading patterns, traffic management, and profiling of the user based on characteristics (for example, location or time). There are several Data Mining tools available, an overview of the commercial tools available are for example, SAS enterprise miner, SPSS, and IBM intelligent miner. There are also some free open sources with student versions such as RapidMiner [17] and Tableau [19], those are used in this work. Each of these tools has advantages and disadvantages they should be chosen according to the specific domain or application. Since the importance of Data Mining is discussed, it is necessary to discuss its tasks.

3.2.1 Data Mining as predictive and descriptive tasks The tasks of the Data Mining could be predictive as well as descriptive as shown in Figure 3.4, usually, the descriptive and predictive models can often be paired. In Data Mining, three types of learning are possible supervised, unsupervised and semi-supervised learning. Data Mining has incorporated many techniques from other domains such as statistics, machine learning, pattern recognition, databases, data warehouse systems, information retrieval, visualization, algorithms, highperformance computing, and many application domains. Some pairs of predictive and descriptive paradigm could be, associative rules with probabilistic rules or clustering with classification. In this thesis, it’s more relevant to explain supervised and unsupervised from machine learning domain for pattern recognition, since only these algorithms are utilized. Descriptive mining tasks illustrate properties of the data in the database. Predictive tasks execute induction on the data in order to make predictions [46]. Clustering is an example of a descriptive model and classification is a predictive model. Descriptive methods find human understandable patterns from the data. Predictive uses variable to predict future values or other variable (recommendation systems). Since clustering and classification are implemented in this thesis to develop the comparability model they are discussed more in detail in the following section.

3.2.2 Clustering Clustering is unsupervised technique were the training data such as observations or measurements, is not accompanied by labels indicating the classes for the observations. Clustering is grouping together of objects that are similar to each other

27

3 Knowledge discovery in databases and using Data Mining to improve BSS design

Figure 3.4: Predictive and descriptive Data Mining tasks [46, 67] in one cluster and dissimilar to the objects belonging to other clusters with only measurements or observations, the aim is to identify a class or clusters in the data. This technique is applied when the given instances are to be naturally grouped, but not for class prediction. Clustering is a form of learning by observation, but not learning by example. It’s a common technique for statistical data analysis and other fields such as machine learning, pattern recognition, and image analysis. The primary goal of clustering is to have a low inter-cluster similarity, i.e., data objects from different clusters should be dissimilar and a high intra-cluster similarity that all data points within a cluster should be mutually similar. Figure 3.5, shows an example of clustering adopted from [46], a 2-D plot of some sample customer data with respect to their locations in a city is clustered. As a result of clustering three clusters based on the data are grouped. There are different clustering algorithm available as discussed (below). Cluster analysis There are several cluster algorithms exists and similar to which there are different methods of clustering also available. Most of the cluster algorithms can be broadly classified by in the below methods. Cluster analysis evaluates the clustering and application of different cluster algorithms using similarity measures. There are several similarity measure, choosing a proper similarity measure depends on the attribute type or application [67]. Below are the three important similarity

28

3.2 Data Mining

Figure 3.5: Example of a three clusters adopted from [46] measures the Euclidean distance, Cosine similarity, and Manhattan distance. Similarity measure Euclidean distance: This is an important and popular measure for numerical attributes. Let i = (xi1 , xi2 , ..., xip ) and j = (xj1 , xj2 , ..., xjp ) be two objects described by p numeric attributes. q

d(i, j) = (xi1 − xj1 )2 + (xi2 − xj2 )2 + ... + (xip − xjp )2

(3.1)

Manhattan (or city block) distance: d(i, j) = |xi1 − xj1 | + |xi2 − xj2 | + ... + |xip − xjp |

(3.2)

Cosine similarity: sim(x, y) =

x.y ||x||||y||

(3.3)

where ||x||qis the Euclidean norm of vector x = (x1 , x2 , ..., xp ), defined as x21 + x22 + ... + x2p . These similarity measures discussed could be implemented in any of the following methods. The datasets (rentals and returns) used in this thesis, are numerical attributes and the most popular similarity measures for numerical attributes is

29

3 Knowledge discovery in databases and using Data Mining to improve BSS design the Euclidean distance. Hence, the Euclidean distance was used throughout the process.

Flat clustering or portioning method In this type of clustering, it finds the number of clusters at once, as a result of which k clusters are formed. Flat clustering creates a flat set of clusters without any explicit structure. The iteration continues to improve the quality of the clusters. Flat clustering is efficient and conceptually simple to implement, but an unstructured set of clusters are returned those are less informative. Moreover, when combined with Business Intelligence techniques it is more informative. Flat clustering example, k-means algorithm [52] and k-medoids [48].

Hierarchical clustering This clustering finds a structure (tree) that is more informative than flat clustering. In this type of clustering, it is not needed to specify the number of clusters. It finds new clusters using previously found ones, being more related to nearby objects than to the objects farther away. It is either agglomerative (bottom-up) or divisive (top-down) in hierarchical clustering. In agglomerative, each item is a cluster, then it merges clusters to form a larger cluster. Divisive on the contrary, all the items are initially one cluster then they split to form smaller clusters, an e.g. SingleLinkage clustering algorithm, the results are usually dendrogram (tree diagram). Prominent algorithms are BIRCH [29] and CURE [45].

Hard clustering In this type of clustering each item can only belong to a single cluster, if it is hierarchical clustering, it is at the lowest level. It is easier to use and there is no overlap between the clusters, e.g. k means where the data item belongs to only one cluster.

Soft clustering In this type of clustering each item, some object can belong to different clusters. Fuzzy, probabilistically are examples of this type of clustering which are difficult to use. There might be overlap between the clusters. For example, fuzzy means, it is similar to k-means, but soft clustering.

30

3.2 Data Mining Model-based methods This model is based on mathematical models. These mathematical models are applied to each cluster. The mapping of instances to the clusters is done using probabilistic functions. Expectation-maximization [38] and COWEB [42] are the two important model-based methods. Density-based methods In most of the existing methods, cluster objects are based on the distance and therefore these methods find only spherical-shaped clusters. Existing methods cannot find clusters of arbitrary shapes. Hence, new clustering methods have been developed that is based on density. The idea to grow a given cluster continuously as long as data points in the neighborhood (density) exceeds some predefined threshold. DBSCAN [40], is a prominent density-based algorithm. Grid-based methods In this method it quantize the object space into a determinate number of cells. Clustering operations are performed on the quantized space, the grid structure. It processes fast since it is dependent only on the number of cells in each dimension of the grid structure. STING [70], is a popular grid-based cluster algorithm. Having discussed several method, it is difficult to choose the number of clusters for these methods, that is further illustrated. How many clusters The important aspect about the clustering is choosing the number of clusters. Let k denote the number of clusters. Moreover, one can decide the number k prior to the searching of the cluster (expert decision) or doesn’t define the number of k, but this k depend’s on some measure of clustering quality [67]. Nevertheless, the right choice depends on the application’s problem that is to be resolved or using evaluation techniques (Davies elbow criterion) discussed in Section 3.2.4.

3.2.3 Classification Classification is a supervised learning technique, it’s a process of finding a model or function that correctly describes and distinguishes the data consists of class labels. The data consists of a set of attributes with a special class attribute known only in training data. The given data is usually divided into training data and test data. The training data is accompanied by a class label (known class label). The learning algorithm takes the training data as input and returns a learned

31

3 Knowledge discovery in databases and using Data Mining to improve BSS design

Figure 3.6: Classification [46] classification model by identifying the relationship between the attributes and the class labels of the training data (induction). This classification model is then applied to the new data (test data) which is then classified (deduction) as shown in Figure 3.6. Furthermore, the correctness of the classifier is calculated by different measures such as precision, recall, and accuracy. Few of the popular classifiers are Naive Bayes classifier, support vector machine, k-nn classifier, rule-based classifier, and neural networks [46, 67, 71, 21]. The classifiers are chosen according to the use case. Definitions for classification • Cross-validation: Depending on the data set, its split (e.g. attributes), into training and test data. One usual technique is cross-validation, k-fold crossvalidation, the initial data are randomly partitioned. This k partitioned data are mutually exclusive subsets, S1 , S2 , ..., Sk , each are roughly equal size. Then, training and testing is performed k times. In every iteration say i, every Si is training set and Si+1 . . . Sk are the test set. In the next iteration, S2 is training and S1 and the remaining is test set. • Training data: The input data for which the class labels are known set of attributes, classification models are built from this data. The learning algorithm takes this data set as input and returns the learned classification model or function. The model generated by the learning algorithm should fit this data set and also correctly predict the class labels of the test data set

32

3.2 Data Mining

Table 3.1: Actual value and prediction outcome Condition: A Not A Test says “A” True Positive False Positive Test says “Not A” False Negative True Negative (data that is never seen). • Test data: The input data for which the class labels are unknown. The model generated from the learning algorithm is applied to this data set. The applied model should correctly predict the class labels for this data set for which accuracy (precision and recall) is calculated. The attributes required to calculate precision, recall and accuracy are as follows. ◦ True Positives (TP): The number of objects those are correctly labeled that belongs to the positive class (correct class). ◦ False Positives (FP): The negative instances that were incorrectly labeled as positive by the classifier. ◦ True Negative (TN): The negative instances that were correctly labeled by the classifier. ◦ False Negative (FN): The positive instances that were mislabeled as negative. With these attributes, precision and recall could be calculated. ◦ Recall : It is the measure of completeness, that is, Recall = T P/T P + F N

(3.4)

◦ Precision : It is the measure of exactness, that is, P recision = T P/T P + F P

(3.5)

The recall and precision are important attributes to calculate accuracy as shown in Table 3.1. ◦ Accuracy: The percentage of test set tuples that are correctly classified by the classifier. It could be done iteratively to improve accuracy. Having discussed the basics of clustering and classification it is necessary to understand their algorithm’s those were implemented in this work.

33

3 Knowledge discovery in databases and using Data Mining to improve BSS design

3.2.4 Unsupervised algorithms In this thesis, three clustering algorithms were tested on the dataset such as kmeans, EM-algorithm and k-medoids and finally the best algorithm was chosen to develop a comparability model. The algorithms are as follows. K-means K-means [52] algorithm is an exclusive clustering based algorithm one of the portioning methods. It is a simple flat clustering algorithm in which each object belongs to precisely one set of the cluster (hard clustering). The primary goal of k-means clustering is to have a high intra-cluster similarity. The most important notion of k-means is the centroid (the center of a cluster), it’s otherwise called centroid based technique. It is represented as a cluster represented with its centroid ci . K-means clustering could use different measures like the Euclidean or cosine similarity or nominal measures such as Jaccard similarity. These similarities calculated with the centroid of a cluster to be frequently imaginary point for which each attribute value in the data is the average of the values for all the instances in the cluster. The iteration continues and the data points are assigned to different clusters until a distance of objects belonging to different clusters is maximized. The goal is to minimize the (Euclidean) distance between data objects. The measure depends on the application and the data used. The Euclidean distance is a common and easy to implement measure. The first step of the k-means algorithm is to decide the number of clusters as k. The right choice of k depends upon the problem to be resolved and with experience (several iterations). Several studies suggest that k-means is appropriate for a medium-sized dataset, one such reason it was chosen in this work. E=

k X X

dist(p, ci )2

(3.6)

i=1 p∈Ci

Algorithm: The k-means algorithm for clustering, where each cluster’s center is represented by the centroid (mean value) of the instances in the cluster. Pseudo code of the k-means algorithm is given below, Input: k number of clusters and a data set D containing n objects Output: A set of K clusters Method: • arbitrarily choose k objects from D as the initial cluster centroids

34

3.2 Data Mining • repeat • based on the selected cluster centroids, (re)assign each object to the cluster to which the object is the most similar • update the cluster centroids, by calculating the mean value of the objects for each and every cluster • until no change Working: • It begins by randomly selecting k data points (initial centroids frequently imaginary) depending on the number of clusters. • Furthermore, it creates k empty clusters. • It then assign’s exactly one centroid to each cluster. • After assigning, it iterates over all instances as a result of the iteration. It then assigns each data point to one cluster with the nearest centroid (mean). • After each iteration, it computes cluster centroids based on the new data points. • Moreover, it checks if clustering is good enough (until no change) or it return to (2) if not. It is relatively efficient: nkt • n: number of objects • k: number of clusters • t: number of iterations t