Ulm University | 89069 Ulm | Germany
Machine Learning Master Thesis at Ulm University
Submitted by: Shima Zarei
[email protected]
Reviewers: Prof. Dr. Manfred Reichert Dr. Rüdiger Pryss
Advisor: Klaus Kammerer 2017
Faculty of Engineering, Computer Science and Psychology Institute of Databases and Information Systems
Revision August 15, 2017
c 2017 Shima Zarei
This work is licensed under the Creative Commons. Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/de/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Satz: PDF-LATEX 2ε
Abstract The demand for Machine Learning applications increases because of the fact that today marketing strongly concerns with computer engineering in order to handle different trends. Continuously, Machine Learning algorithms and tools are extended to this end. Also Machine Learning algorithms are developed in field of Artificial Intelligence to produce Robots and Electrical Devices and in field of information system to develop Information Retrieve systems. This thesis aim to introduce Machine Learning and Data Mining and cloud computing topics. In this way different types of learning such as Supervised learning, Unsupervised learning ,Semi-supervised learning and Reinforced learning are examined and algorithms related to each learning type Like Decision Trees, Support Vector Machine, K-Means, Neural Network, Random Forest, Clustering algorithms, Regression and classification algorithms, Temporal difference, Sarsa and Monte Carlo are explained in details with corresponding methods.In addition, Applications of Machine Learning are denoted in order to express the usage of algorithms explicitly. Big Data handling techniques such as Apache Hadoop, Hive, Pig , cloud computing are explained and softwares for implementation these methods are assessed and compared with each other. Data preparation process and and issues are examined as well. Also some algorithms such as Linear regression, Logistic regression, Support vector machine, Neural network, Decision tree(Bagging) and (pruning), Random Forest and K-means are implemented on Forest Fires and Iris data sets with help of RapidMiner studio and R language. Consequently results of these algorithms are compared and analyzed.
iii
Acknowledgement I appreciate professor Prof. Dr. Manfred Reichert and Dr. Rüdiger Pryss and Klaus Kammerer who gives me advises for this thesis and i am really grateful from my lovely parents who encourages me to extend my knowledge and also supports me during whole of my life and also my friends who gives me heart.
v
Contents 1 Introduction
1
1.1 Problem Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3 Structure of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Fundamentals
5
2.1 What is Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2 Varieties of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.4 Reinforced Learning . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3 Other types of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Learning to rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 Transductive learning . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.5 Self-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.6 Time-series learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Big Data Handling Techniques . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.3 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.4 Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.5 WibiData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.6 SkyTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.7 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vii
Contents 3 Applications for Machine Learning
19
3.1 Machine Learning Applications . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.3 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.4 Machine learning control
. . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5 Credit card fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.6 Economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Machine Learning Algorithms
33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Supervised Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.4 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.5 Maximum Entropy Classifier . . . . . . . . . . . . . . . . . . . . . . 51 4.2.6 K-Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . 51 4.2.7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.8 Conditional Inference Trees . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Unsupervised Learning algorithms . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.3 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 Reinforced Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.1 Temporal difference learning . . . . . . . . . . . . . . . . . . . . . 93 4.4.2 Monte-Carlo algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.3 Sarsa On-policy TD control . . . . . . . . . . . . . . . . . . . . . . 97 4.4.4 Learning Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.5 Q-learning
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.6 Deep Q-learning Training . . . . . . . . . . . . . . . . . . . . . . . 101
viii
Contents 5 Data Preparation
103
5.1 Implementing Machine Learning Algorithms . . . . . . . . . . . . . . . . . 103 5.1.1 Define the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.2 Prepare data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1.5 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Handling data quality issues . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2.1 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3 Visualizing pair of categorical feature . . . . . . . . . . . . . . . . . . . . . 115 5.3.1 Visualizing categorical and continuous features . . . . . . . . . . . 115 5.4 Measuring Co-variance and Correlation . . . . . . . . . . . . . . . . . . . 116 5.4.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.2 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6 Software Tools for Machine Learning
119
6.1 Machine Learning Implementation Softwares . . . . . . . . . . . . . . . . 119 6.1.1 Azure studio software . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.1.2 R Language
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.3 RapidMiner studio . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.1.4 KNIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.5 File Formats comparison . . . . . . . . . . . . . . . . . . . . . . . 124 7 Examples for Machine Learning Application
127
7.1 Implementing examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2 Forest Fires Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2.1 Data Set Explanation . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2.2 SVM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.2.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
ix
Contents 7.3 Comparison of Decision tree Bagging and Random Forest . . . . . . . . . 149 7.3.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.3.2 Comparison of Models . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.4 Iris Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.1 Summary Statistics
. . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4.2 K-means implementation . . . . . . . . . . . . . . . . . . . . . . . 156 8 Discussion
161
8.1 Pros and Cons of Softwares . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.1.1 Microsoft Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.1.2 R Language
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.3 RapidMiner Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 8.1.4 KNIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9 Conclusion
165
A Sources
185
A.1 Linear Regression R codes . . . . . . . . . . . . . . . . . . . . . . . . . . 185 A.2 Logistic Regression R code . . . . . . . . . . . . . . . . . . . . . . . . . . 185 A.3 Decision Tree R codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 A.4 Decision tree Bagging R codes . . . . . . . . . . . . . . . . . . . . . . . . 187 A.5 Random Forest R codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 A.6 Neural Network R codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 A.7 Comparison of Decision tree Bagging and Random Forest . . . . . . . . . 188 A.8 Forest Fires SVM ,R codes . . . . . . . . . . . . . . . . . . . . . . . . . . 189 A.9 Iris K-means R codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
x
1 Introduction
Learning as a common fact is like intelligence that is introduced as a concept which covers the processes that are difficult to explain precisely. So it tries to learn the patterns of data from data modification process [1]. Machine Learning,learns whenever it changes its structure,program or data in which it improves future performance [1]. Machine Learning typically is concerned to the changes in systems that perform tasks related with artificial intelligence(AI), such as recognition,prediction, robot control,diagnosis and planning. Moreover, in data mining it tries to imply and discover the relationship between data. There are multiple views to define Machine Learning; Firstly, Artificial Intelligence View in which learning is defined to control human knowledge and intelligence. Also, it is vital to constructing intelligence machines. In this field automatic learning is prominent.As an example mobile device with its software can be mentioned [1]. Secondly, it is noted that software engineering is another ground that is more easy to coding rather than traditionally coding. Web designing is an instance of software engineering view which aim to develop web applications and websites. Thirdly, the States View that denotes Machine Learning is strongly concerned to computer science and statics in which computational techniques are applied to statistical problems. In addition,speed is more important than accuracy. Continuously, it is mentioned that Machine Learning is consist of two phases such as Training and Application. In Training phase a model learns from a collection of learning data, and in Application phase,the model uses to make decision about some new test data. Machine Learning is a sub-field of Information system Engineering which strongly focuses on data base manipulation. It divides into major categories such as Supervised Learning , Unsupervised Learning, Reinforced Learning [1].
1
1 Introduction In the following chapters fundamentals of Machine Learning such as different types of learning and various applications with algorithms correspond to each one are explained. In addition ,data preparation techniques and tools for implementing them with examples of data sets which are implemented are explained.
1.1 Problem Settings
This work is a general overview on Machine Learning and related topics. The problem corresponding to this research is about two different data sets. The former is Forest Fires data set which indicates a set of features about different area of a park which are in danger of burning.In this work SVM (eps-regression) is implemented on this data set in order to find the area with most probability of smoking. Moreover,some other algorithms applied on this data set that corresponding results are reported and the best algorithms in base of this results are selected which is Decision Tree(Bagging).To apply various algorithms on Forest Fires data set RapidMiner studio and R language are used. The latter is Iris data set which is consist of various features of Iris flower and corresponding branches. K-means algorithm is implemented for this data set to cluster different branch of Iris flower in base of its features.In this project RapidMiner and R language is used to this end and result is illustrated and analyzed. The R code corresponding to these projects are available in appendix section A.
1.2 Objective
The objective of this work is to indicate how Machine Learning methods can be used and implement and what are the usage of each algorithm. Furthermore, execution of algorithms with RapidMiner studio software and R codes of some models are provided.
2
1.3 Structure of Work
1.3 Structure of Work Initially, in chapter 2 an introduction of Machine Learning and Learning definition and various types of learning are represented and Big Data implementation technologies are explained.Additionally, in Chapter 3 applications concerned to different types of Machine Learning algorithms are explained. Continuously, algorithms of each types of learning; unsupervised learning , supervised learning and Reinforced learning are explained. Then , in chapter 4 variety of algorithms for each type of learning are explained. Data preparation in chapter 5 for implementation these algorithms are denoted. Tools for execution these algorithms are introduced in chapter 6. In chapter 7 as illustration of work two data sets of Forest Fires and Iris Flowers are examined and some algorithms are implemented on them and result is analyzed.In discussion chapter 8 advantages and disadvantages of each Tools are mentioned in order to compare them. Finally, in conclusion chapter 9 the future of Machine Learning and cloud computing is denoted. Furthermore, in appendix A, R codes regarding to algorithms which are implemented in data sets examples are written.
3
2 Fundamentals 2.1 What is Machine learning Machine Learning(ML) works in base of probability that obtain pattern from data.It closely concerns with statistics and other disciplines such as Brain models , Psychological models, Artificial intelligence and Evolutionary models [1]. The point is that Machine Learning try to develop computer learning without too much coding. Computer vision, Hand written recognition , Face recognition, Pattern recognition and Data Mining are considered as most important examples of Machine Learning. Machine Learning is used for different aims like when relationship between data is not recognizable or the outcome of the system is not predictable.Moreover, it is essential for calculating noise or delay in a procedure of working of a device. Also, when previous knowledge about a task is not enough to encoding by human,or environment regarding to a task change over time [1]. Machine learning try to classify and organize data in a way which is understandable for human [2]. ML is mapped to programming and in order to apply ML techniques there are some important skills which should be learned,so to get started into ML main strategies are introduced like study and implementing a ML algorithm [3]. It is recommended to use environment which provides tools for data preparation, ML algorithms and the presentation of results. It allows to get good into process of ML and data preparation techniques. Some tactics to implement this strategy is like summarizing the capability of each tool and compare the ability of each of them with other ones or read documentation or follow video tutorial related to these techniques. In order to study ML data set it is recommended to use middle size data set which are available in browsers or library
5
2 Fundamentals of ML data sets like UCI Machine Learning repository. The tactics corresponds to this end are clearly describe problem that the data set represents, summarize the data using descriptive statics,also tuning algorithms to discover algorithm configuration that performs well on the problem.In terms of study a ML algorithm, one should select an algorithm and understand it intimately and discover parameter configurations that are stable across different data sets.selecting an algorithm with modest complexity is more easy to understand and also has many open source implementations and few parameters for you to explore. To this end it is better to study ML library that lead to focus on the behaviours of the algorithm.The related tactics are summarizing the system parameters and expected influences they have on the algorithm,design small experiments with combination of one or two data sets,algorithm configurations and behaviour measures in order to answer a specific question and report results.In terms of implement a ML algorithm,we should select and implement the algorithm with one of languages like R language.It is recommended to study algorithms and select one of them with lowest complexity. It is mentioned that there are some methodology about small projects such as small in time,small in scope and small in resources [3].
2.2 Varieties of Machine Learning
In Data Mining, the procedure of learning is extracted through the changing of objects behavior and comparing its consequences with previous outcome [4]. So, learning relates to performance rather than knowledge.But there is still problem because objects change continuously and we can not call it completely as learning. It should be mentioned that Data mining present leaning in a practical way not theoretical and it is closely related to Machine learning field [4]. Typically,Machine Learning algorithms try to describe the procedure of learning and show what is learned and express it as a set of rules.[4] Learning procedure has four major types: Supervised learning, Unsupervised learning, Semi-supervised learning , and Reinforced learning [1].
6
2.2 Varieties of Machine Learning
2.2.1 Supervised Learning In Supervised learning, data are splitted in to input and target group in which it try to map input data to target values in which input data are training data [5] . This data are labeled it means they have some titles,for example in web application input data is training data and has label or result such as spam/not-spam or a stock price at a time. A model which is presented into learning process try to make a prediction and is corrected when those predictions are wrong. In this way training process continues until the model achieves a desired level of accuracy on the training data.[3].
2.2.2 Unsupervised Learning In unsupervised learning data are unlabeled which means they do not have any title and it is not obvious what are data explicitly. For example they may be a table of some numbers. This data are independently and identically distributed on feature space.The aim of unsupervised learning is to identify structure in feature space in order to learn the process of changing data. Through learning procedure data must split into some clusters. For instance, Google News use cluster to categorize its news into different groups which is more easy for people to read news that they did not saw it before. Dimensionality reduction, quantile estimation, outlier detection and clustering are introduced as examples of unsupervised learning [5]. Let’s explain supervised ans unsupervised learning through an example. Suppose there is a basket of some different fruits such as apple, banana, cherry and grape and the task is to split them into groups. Then in base of previous experience or knowledge same types of fruits like color, taste and size arranges into same groups. This task is supervised learning because there is prior knowledge about types of fruit. But suppose there is a basket of fruits in which there is not any prior knowledge about types of fruits to categorize them into different group and this is first time that fruits are observed. Therefore, in order to arrange fruits into some group, one of fruits are considered and in base of its characters other fruits will arrange. For instance, color take into account
7
2 Fundamentals for this aim and there is two color red and green. So, apple and cherry arrange in one group and grape and banana into another group [3].
2.2.3 Semi-supervised learning Semi supervised learning is a type of learning in which one part of data set is labeled and other part is unlabeled [5]. There are three important assumption related to Semisupervised learning. First of all, Smoothness-Assumption which indicates that if two variables are close to each other then corresponding output is close too. The second one is Cluster-Assumption that indicates, if points are in same cluster then they are properties of a same class. The third concept is Low density-Assumption that implies the decision boundaries are concerned to low-density region. The forth one is ManifoldAssumption that indicates the high-dimensional data lies on a low dimensional manifold. The main problem regarding to learning algorithm is dimensionality problem and this is because of fact that number of data grows exponentially with the dimensions and for the statistical task like reliable estimation of densities an exponentially growing number of examples is required. Example of this kind of learning is classification and regression problem. Manifold Assumption is useful for classification and regression by having accurate density estimation and distance. Data structure related to this learning type is represented as a graph in which data points are shown in nodes of graph and edges of graph are labeled with pairwise distances of the incident nodes. Example of this kind of learning is speech recognition where recording huge amount of data is easy but labeling it need humans to listen and prepare transcript [5].
2.2.4 Reinforced Learning Initially it is necessary to know that Reinforced learning has two main ground Markov Decision process(MDP) and Bandit problem. In terms of MDP can say it is an optimal control problem which known as Markov Decision Process [6]. The solutions of MDP are dynamic programming and also reinforced learning methods. MDP is consist of four part, a set of state ’S’ and a set of action ’a’ and a real value reward function R(s,a) and
8
2.2 Varieties of Machine Learning a description of effects of each action in each state. It should be taken into consideration that according to Markov Property,the effect of an action is just depend on its state not the prior history. In MDP process first determine the state and then with policy execute action. It is denoted that there are two type of actions such as deterministic actions and stochastic actions. In MDP also there are some objective functions in which they map infinite sequences of rewards to single real numbers. In terms of Bandit problem it is considered that there are different options of actions and when we choose the action a numerical reward will allocate to this action. Then objective is to maximize the expected total reward over some period of time. This is called as original Bandit problem. The example of Bandit problem is treatment in which there are T number of patients with same symptoms and they are waiting for treatment. Also, there are two kind of drug for treatment that one of them is better than other one then we can choose the treatment base on past successes and failures [6]. Reinforced Learning tries to connect the situations to actions and therefore to maximize a numerical reward. In contrast to most of algorithms Reinforced Learning can not imply which action is convenient for selection but it can try actions to indicate which one has most reward [7]. In some cases actions can also affect the future reward and situation and subsequent reward. Two main characteristics of Reinforced Learning are delayed reward and trail-and-error search. As a point in some cases supervised learning is not convenient and therefore Reinforced learning utilize [7]. In case of stochastic task,each action must be try for some times to estimate its expected reward precisely. Another approach concerned with this problem is that its goal oriented agent interacting with an uncertain environment. Reinforced learning is a kind of interaction between artificial intelligence and other disciplines. Elements of Reinforced learning problem are policy,a reward function,a value function and optionally model of environment [6].
Key Ingredient of Reinforced Learning The first one is Deterministic greedy policy that won’t explore all actions [8]. The second one is that it don’t know anything about the environment at the beginning. The third one is that it need to try all actions to find the optimal action. In terms of maintaining exploration
9
2 Fundamentals it use soft policies instead :γ(s, a) > 0 (for all s,a). The point which is important is about ε-greedy policy in such a way that, with probability 1-ε perform the optimal/greedy action , with probability ε perform a random action and slowly move it towards greedy policy:(ε− > 0). Moreover, it is a Philosophical motivation for Deep Reinforced learning. Take away from supervised learning, Neural networks are great at memorization and not yet great at reasoning so reinforced learning can be used [8].
2.3 Other types of Learning There are other types of Machine Learning algorithms as well which are concerned with 4 main categories of Machine Learning Algorithms,such as:
2.3.1 Active Learning Active learning is a category of supervised learning is Active learning: This is type of learning in which data are in to group of labeled and unlabeled, some unlabeled data labeling through learning procedure [5]. In statics it is called as Optimal Statistical Design [9]. For instance, in a protein engineering problem, data set is consist of different types of proteins and each one can do one activity.During the procedure of learning some data are labeled and some of them remain unlabeled and some of them take label [10].
2.3.2 Learning to rank It is part of supervised, unsupervised or reinforced learning, that aim to construct a ranking models for information retrieval systems [11]. This model involved training set of data with partial order in which order is reached by numerical , ordinal score or binary judgment for each item. The goal is to rank a permutation of items in new, unseen lists in a way which is equivalent to rankings in the training data [11]. In this method documents are represented as a numerical vector which is called feature vectors and can be split into 3 groups like Query-independent-features that uses for ranking document relevancy
10
2.3 Other types of Learning without consideration of query or static features that are obtained from static frame like shape-based features [12] , Query-dependent-features in which uses also for rank the result in base of likelihood or dynamic features which are in base of dynamic frame like speech features or noise and Query level features which is a quantifier for a query that has some same value across all of documents in a sample [12]. Some examples of these features are Term Frequency (TF) in which implies the frequency of a term or item in a document ,Term Frequency-Inverse Document Frequency (TF-IDF) in which rises to the amount of time a word appear in a document but often relocation by a frequency of the word in collection [13], which is used to rank documents related to query [14] and language modeling scores of document’s zones [15].
2.3.3 Structured Learning It predict structured objects rather than scalar discrete or real values [16]. One of easiest way to understand this type of learning is structured perceptron of Collins [17]. This algorithm combines the perceptron algorithm for learning linear classifiers with an inference algorithm (classically the Viterbi algorithm when used on sequence data). In fact, structured learning is generalized form of supervised learning.The goal is to reach the structure of huge amount of data which is complicated like a sequence or a graph. The examples of this problem are regression and classification. Also, example of structured learning is like make a text from spoken sentences.To this end it is better to assign a class for each word, but it may cause to incorrect classification due to huge amount of classes. Therefore, it is better to consider a class per each word. Structure prediction tries to overcome these problem by using loss function which is suitable for this domain [16].
2.3.4 Transductive learning Transduction refers to reasoning from observed specific training cases to specific test cases [5]. It should be reminded that labeled data are data that have title and unlabeled data are data which does not have any title. Also, training set in learning is a set of
11
2 Fundamentals data in which learner try to learn the object behaviour from this set and test set is a set of data that is split from data set to test the result of learning on it. Through the setting of Transductive learning data are labeled and test set is unlabeled. It try to predict the outcome for only test set. In contrast to Inductive Learning that the goal is to generate the prediction function correspond to the entire space. For instance, suppose we have a 3 set of data points which are A , B , C. Then we have data which are labeled as A,B or C but some data does not have any labeled .To solve this issue cluster concept can be used in a way that we can assign the unlabeled data to corresponding cluster and label it as A , B or C which are the labels of clusters of data points(It means we made a cluster for each A,B and C data separately) [5].
2.3.5 Self-Learning Self-Learning is introduced regarding to use unlabeled data in classification problem which is known as self-training,self-labeling or Decision-directed learning [5]. Wrapper Algorithm is implemented for this type of learning in which use supervised learning method repeatedly by training on the labeled data only. In each step according to the current decision function, a part of the unlabeled points is labeled. Then as additional labeled points supervised learning use its own predictions. The point is that, if selflearning is used with empirical risk then unlabeled data have no influence on the solution. In this case, margin maximizing method can be used. Therefore,decision boundary will eliminated from unlabeled points. Self-learning refer to unsupervised learning [5].
2.3.6 Time-series learning Time-series is a quantities that demonstrates the values taken by a variable over a period such as a month, quarter, or year [18]. Time series data occurs wherever the same measurements are recorded regularly.Generally,time -series data has temporal sequence. Example of time-series data is heights of ocean tides [18]. It is a type of timeseries analysis which is in base of time.It refers to temporal differences between data to analyze them. It is different from temporal analysis which is related to geographical
12
2.4 Big Data Handling Techniques locations. It infer from time-series analysis that observations are more separated. The time values in this procedure are derived from past values.Time series are very frequently plotted via line charts. Time-series are used in statistics, signal processing, pattern recognition, econometrics [18]. Methods correspond to time-series analysis is derived in two categories;frequency-domain methods and time-domain methods [19]. The former is consist of spectral and wavelet analysis and the latter include auto-correlation and crosscorrelation analysis. In addition,time-series techniques may be divided into parametric and non-parametric techniques. The parametric technique,estimates the parameters of the model that describes the stochastic process performance and by contrast in non-parametric approach explicitly estimation of spectrum of process without assuming that process has any particular structure performance. Additionally, methods of time series analysis maybe splited into Linear and Non-Linear ,Univariate and Multivariate. Time-Series analysis can be applied to Real-valued, Continues data, Discrete numeric data and Discrete symbolic data [19]. It is also consist of exploratory analysis, prediction and forecasting, Classification, Regression analysis and Signal estimation. Models correspond to time-series analysis are auto-regression(AR) model, moving average (MA) model,Auto-regression-moving average(ARMA) and Auto-regression integrated moving average(ARIM) models. Moreover, time-series learning uses in anomaly detection in different problems and data sets as mentioned previously [20].
2.4 Big Data Handling Techniques
Big data is a very huge amount of data that can not be handle with traditional computing techniques. Big data is a complete package which involves various tools, techniques and frameworks. In order to handle big data there are several techniques that can be implemented on huge data sets. These techniques has ability to fetch and order data from data sets which are on different servers in different clusters and also different part of the world. Here, most important ones are explained:
13
2 Fundamentals
2.4.1 MapReduce MapReduce is a programming pattern that allows for massive job execution scalability over thousands of servers or clusters of servers and execute a massive jobs on thousands of clusters of servers [21]. Each MapReduce implementation involved two tasks.The first one Map task,in which input data is converted into a various set of key/value pairs,or tuples. The second one is Reduce task in which several of the output of the Map task are combined to generate a reduce set of tuples and it is done by operations like aggregation. Reducer has three phase;Shuffle that input to the Reducer is the sorted output of the mappers and in this way framework fetches the relevant partition of the output of all the mappers, via HTTP. Second phase is Sort that group reducer inputs by keys.Also shuffle and Sort perform simultaneously while outputs which are fetched they are merged [21]. The third phase is Secondary sort in which if equivalence rules for categorizing the intermediate keys are required to be different from those for grouping keys before reduction,then comparator can be specifies. Through this method, different algorithms such as Stripes and Pairs that are two design patterns for computing the word co-occurrence matrix of a large text collection. With pairs, each co-occurring word pair is stored separately and with stripes, all words co-occurring with a conditioning word are stored together in an corresponding array. Mergejoin that works simultaneously by reading and comparing two sorted inputs one row at a time and compare next row for each input and if rows are equal then output is merged rows and continue , InvertedIndex which is an index of data structure storing a mapping from content such as words and numbers to its location in a database or in a set of documents, Nested Loop join that join to set of data with two nested loop, Median count which , Term frequency and Document frequency can be implemented [21].
2.4.2 Hadoop Apache Hadoop is an open source platform for handling big data which is most famous in implementing MapReduce.It can work with multiple data sources and can aggregate multiple data in order to do large scale processing or even reading data from a database
14
2.4 Big Data Handling Techniques to run processor-intensive Machine Learning jobs.Through Hadoop can run different MapReduce tasks which contain various algorithms in parallel on large clusters [21].
2.4.3 Hive It is kind of technology for handling big data in which reading and writing and managing data in a distributed fashion organizes. Also it includes command line and JDBC driver for connection of user to Hive [22]. Hive allows to run queries against a Hadoop cluster like SQL. It is originally developed by Facebook,but has been generated open source for some time now and its higher-level abstraction of the Hadoop framework that allows anyone to construct queries against data which is stored in a Hadoop cluster just as if they were manipulating a conventional data store [23].
2.4.4 Pig It is similar to Hive that try to map Hadoop near to the realities of developers and business users. It is mentioned that Pig contain Porl-Like language that allows for query execution ever data stored on a Hadoop cluster, instead of a SQL-Like language. It was developted by Yahoo and like Hive has also been made fully open source [21].
2.4.5 WibiData Wibi data is a combination of web analytics with Hadoop builds on top of HBase,which is itself a database layer on top of Hadoop. It develops applications based on open-source technologies Apache Hadoop, Apache Cassandra, Apache HBase ,Apache Avro and the kiji project [23].
2.4.6 SkyTree SkyTree is a high performance Machine Learning and data analytics platform constructed on handling Big Data. It can deliver the greatest predictive model accuracy.SkyTree
15
2 Fundamentals empowers data scientists to build more accurate models faster.It simplifies the data preparation process and uses the entire data set, including structured and unstructured data,to run more experiments and identify high-value patterns. It takes advanced analytic to the next level using artificial intelligence to produce sophisticated algorithms,model training and automated experiments [24].
2.4.7 Cloud Computing Cloud technology is able to transfer data between various number of cloud server and then conduct it out of server. A more intensity of cloud computing is that it deliver computational resources from a location which you are computing. Cloud computing can offer cloud computing aggregators and integrator that assembles multiple cloud service in to one service ,also packages of services as a single entry point and products are offered into the cloud [25]. Cloud computing means to extend capacity or capability of infrastructure dynamically without requiring to invest money or purchase new infrastructure and also without the need for licensing new software or training for new person [25]. It has some different aspect like reducing implementation and maintain cost,increase mobility for a global work-space,high speed computing and also increase availability of high-performance applications to Small medium-sized business. There are some pros and cons corresponding to execution of tasks which should be consider like the problem of data ownership and performance and availability is addressed.The challenge is related to execute a task out of an unknown state and giving them necessary knowledge and also educated decisions of their cloud initiatives [25]. Cloud computing is related to grid computing and software-as-a-service(SaaS), to computing huge datasets. Cloud computing can be available for virtual data service but it is not as same as it. As an example can refer to Amazon’s S3 storage service which is constructed as a web scale computing service that make this computing easier. SaaS is another service which has ability of cloud computing and using a multiuser architecture.It focus on end user. Also, there are no costs in server or software licensing for users.For SaaS computing application Sales force.com is the best example. Moreover, it is commonly used for enterprise applications [25]. One of the oldest cloud computing service is Managed Service Providers (MSPs).
16
2.4 Big Data Handling Techniques Services include virus scanning for email,anti-spam services such as Postini ,desktop management services such as CenterBeam or Everdream and application performance monitoring. Additionally, platform-as-a-service (PaaS) is another version of SaaS which is a web service in cloud. It offer Application Programming Interfaces (API). The variation of cloud computing can cause to developing environments for programmer and analysis . As a disadvantage of this approach can mention that it is limited by vendor’s design and capabilities. Google Map is an example of this service which provides applications that can load large amount of data. Some of its important features are such as automatic scaling and load balancing , APIs for authenticating users and sending email using Google Accounts,a fully featured local development environment that simulates GoogleAppEngine on your computer [25]. Cloud computing is often confused by Grid computing. This is because of the fact that Grid computing implement a supercomputer to perform very large tasks and Cloud computing powered by Grid computing [25]. A benefit related to using cloud computing is that computer capacity raise up dramatically. In addition, users can benefit from cloud computing regardless their location and device that they choose. Reliability is often enhanced because service providers utilize multiple redundant site. Cloud computing is reliable because scalability can vary dramatically in base of user demand.It provide data centralization that lead to increase focus on protecting customer resources maintained by the service provider [25].
17
3 Applications for Machine Learning 3.1 Machine Learning Applications Nowadays, Machine Learning uses in for different aims such as Computing on huge data set, Medicine, Marketing, Security applications, chemistry, designing websites and handling electronic devices. This applications are related to classification regression and clustering problems .Most famous applications of Machine Learning are in computer vision field to distinguish various data such as face recognition and speech recognition or handling noise in different devices and weather forecast. In this chapter most important and useful applications of Machine Learning are considered and its procedures are examined. Also, algorithms for implementing these applications are mentioned. Applications which Machine Learning algorithms can be successfully applied are as following:
3.1.1 Information Retrieval Information retrieval is the process of gaining information in base of index that this index could be a full-text.It is noted that database can involve images,videos and different formats of passages [26]. To be encounter with problem of information over load Automated information retrieval is introduced. Web search engines are the most visible Information retrieval applications. In IR database is consider as an object.The difference between information retrieval and classical SQL query is that in procedure of information retrieval the result may or may not be related to our query [27]. In terms of storing data into database,should know that data do not store in database directly instead information
19
3 Applications for Machine Learning exchange or meta-data is used. Most common ranking technique in most information retrieval system is numeric ranking in which denote how well each object is matched to a query. Moreover,iteration is used for query result refinement [28]. One of fundamentals of information retrieval is Mathematical foundation which belongs to category of settheoretic models represent documents as set of words or phrases which similarities are usually derived from set-theoretic operations such as Standard Boolean Model, Extended Boolean Model,Fuzzy retrieval [29]. Algorithms corresponding to information retrieval are clustering and regression and classification and decision tree algorithms 4.2.8. There are various model of ranking documents such as: Boolean Model Boolean Model is a model in which query is denoted in form of a Boolean expression of term, that means it contains operators AND ,OR and Not. Also, documents which contains information are presented as a set of words. The goal is to address the ad-hoc retrieval task in which user specifies his information needs through query and is most standard IR task. In this procedure model try to find documents corresponding to user query result [29] . The second category is algebraic models that represent documents and queries usually as vectors, matrices or tuples. In this model similarity between query vector and document vector is denoted as scalar value such as Vector Space Model that is an algebraic model to represent text documents as a vector like index terms, Generalized Vector Space Model in which it is in base of vector model which contains additional information regrading to documents, (Enhancing)topic-based Vector Space Model that implies how to obtain term vectors from an ontology, Extended Boolean Model in which use partial matching and term weights in vector space model ,Latent Semantic indexing that is a mathematical method to determine relationship between terms and concepts in content, Latent Semantic Analysis in which analyze the relationship between documents and terms inside that document by generating set of concept regarding to documents and terms [29]. The third category of Boolean model is probabilistic models treat the process of document retrieval as a probabilistic inference. Similarities is mentioned as probabilities that a document is relevant for a given query. In this model Bayes’theorem are often used like Binary independence model, probabilistic relevance model which based on the Okapi(BM25)relevance function, uncertain inference that define document and query relationship in Information Retrieval ,Language models
20
3.1 Machine Learning Applications that is a probability distribution on sequence of words, Divergence from randomness model in which term weights are calculated by measuring the divergence between a term distribution produced by a random process and the actual term distribution, latent Dirichlet allocation that sets of observations are explained by unobserved groups that explain why some parts of the data are similar [29]. The last category is feature-base retrieval models view documents as vectors of feature function(or just feature)and seek the best way to mix these features into a single relevance score, typically by learning to rank methods. This functions are arbitrary functions of document and query and they can incorporate to other retrieval model as just another feature[29]. Additionally, there are various measures of importance and correctness in information retrieval such as: Precision that indicate the fraction of result correspond to information need [29]. Recall which imply the fraction of relevant documents to query result in the collection were returned by system[29]. Inverted Index in which construct an index for each term in a way there is a list which records documents of the term occurs. Then each item of this list indicate the number of times each term occur and is called posting. SO, the list is nominated as posting list. Each posting list is sorted by document ID. Fall out that implies function of the number of retrieved documents ,F score/F measure in which consider precision and recall in order to compute the score, Average precision that are single-value metrics based on the whole list of the documents returned by the system , Precision at K that calculates for k documents, R-precision which is defined as the proportion of the top-R retrieved documents that are relevant, where R is the number of relevant documents for the current query [30], Mean-Average precision which is mean of average score for each query in a set of queries , Discounted cumulative gain that in base of graded relevance scale of documents from the result set of the evaluate the usefulness or gain information regarding of its position on list of data, Mean reciprocal rank that calculates the reciprocal of the rank at which the first relevant document was retrieved [30], Spearman ’s man rank correlation coefficient which is non-parametric measure of rank correlation implies statistical dependence between the ranking of two variables [31], GMAP(Geometric mean of average precision) that is the geometric mean of the average precision values for an information retrieval system over a set of n query topics [30], are other measures of importance and correctness. In terms of Visualizations of
21
3 Applications for Machine Learning information retrieval performance we can mention: Graphs which chart precision on one axis and recall on the other[32], Histograms of average precision over various topics[32], Receiver operating characteristic (ROC curve) which indicates binary classifier with different threshold [33], Confusion matrix that demonstrate performance of an algorithm [32].
Adaptive website Artificial Intelligence and Statistical methods are used to generate a model of user interaction which use as fundamental object for building models in which specify the patterns of user interactions [34]. Also, user interaction may extract a patterns from a web server logs or a website.There are two techniques corresponding to this application. The former is collaborative filtering that assest user data collection with multiple users and use Machine Learning techniques in order to cluster interaction paradigms for user models and classify special users patterns to those models. The latter one is statistical hypothesis testing method in which A/B testing method or other similar methods with a library of possible changes to the website or a change-generation techniques like random variation are utilized. Also, examples of this application gentrify for website look and feel Snap ad for online advertising are considered, Algorithms corresponding for this application is RED and clustering algorithms 4.3.1[34].
3.1.2 Computer Vision Computer vision is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision is correspond to automatic extraction, analysis and understanding of useful information from a single image or a sequence of images. It is concerned to artificial intelligence. The image data can take many forms, such as video sequences, views from multiple cameras, or Multi-dimensional data from a medical scanner[35][36][37]. Pattern recognition , Artificial intelligent , Solid State physics , deep learning , biological
22
3.1 Machine Learning Applications vision are mentioned as related fields to computer vision [38]. Sub-domains of computer vision involves scene reconstruction, event detection, video tracking, object recognition, object pose estimation, learning, indexing, motion estimation, and image restoration [38]. The applications correspond to this filed are such as assigning human in identification tasks, controlling process, automatic inspection, detecting events, interaction, modeling objects or environments, navigation, organization information and autonomous vehicles [38]. As a typical important task of computer vision recognition is introduced which try to determine specific objects related to an image. Different varieties of recognition are mentioned like object recognition,identification and detection. Moreover, some special tasks also are introduced such as pose recognition,content-based image retrieval , optical character recognition, facial recognition,shape recognition technology [38]. Motion analysis is another task of computer vision which try to estimate where an image sequence is processed to estimate velocity at each points in image in 3D-sense,or even of the camera that produced an image. Examples of this tasks are ego-motion, tracking and optima flow [38]. It should be consider that there are some typical function which are found in many computer vision system such as image-acquisition, Pre-processing. feature extraction, detection/segmentation, high level processing and decision making. In this applications algorithms such as SVM 4.2.7 and Adaboost are used [38].
Brain Machine interfaces A brain–computer interface (BCI) ,sometimes called a mind-machine interface (MMI), direct neural interface (DNI), or brain–machine interface (BMI), is a direct communication pathway between an enhanced or wired brain and an external device. BCIs are often utilize in researching, mapping, assisting, augmenting, or repairing human cognitive or sensory-motor functions[39]. Generally, In BCI the trigger is a complex process that certain brain areas are activated and by sending signal via peripheral nervous system to related muscles movement occur. Then the output of this activity is called motor out put or efferent output. BCI is recommended as an alternative to natural communication and control. It translate directly Brain activity into corresponding control signals for BCI applications. This translation involves pattern recognition and signal processing. The
23
3 Applications for Machine Learning important point is that BCI has four part and it record activity directly from brain invasively or non-invasively. It must provide feedback to the user and must do so in real time. The main goal of BCI is to generate a signal from cortex. This command use to control devices like computer or robotic limbs [40].
Moreover,a direct brain interface (DBI) can receive voluntary commands directly from human brain without requiring physical movement and can be used to operate a computer or other technology [41]. BMI, BCI and DBI are same because they define same process and they are synonyms. Neuroprostheses which is called neural prostheses are devices which can receive output from nervous system and also provide input. In addition, they can interact with the peripheral and central nervous systems. As a conclusion it should be mention that BCIs measure brain activity, process it and produce control signals that reflect the user intent. In order to measure brain activity non-invasive and invasive methods are introduced. Non-invasive methods use electrical signals which are generated by brain to measure brain activity. Invasive method combine excellent signal quality, very good spacial resolution and a higher frequency range[40] . Foundation of each Brain Computer Communication is mental strategy. It determine user task to generate a pattern that BCI can interpret it. Also mental strategy set constrains such as signal processing techniques on the hardware and software of a BCI. The amount of training to use BCI successfully is depend on the mental strategy. The most common mental strategy is selective attention and motor imagery. The algorithms which can be implemented on this application are classification algorithms such as Generative(informative) classifiers like Bayes quadratic, Discriminative algorithms like Support Vector Machine, statistic classifiers e.g Multilayer perceptron-Dynamic classifiers-Multilayer perceptron [42]. Selective attention is a subject which is closely related to BCI. It is required an external auditory or somatosensory stimuli provided by BCI system.It should be noted that most BCIs system are based on visual stimuli. This type of BCI needs five stimuli.Four are associated with cursor movements(left,right,up and down) and fifth stimuli is for select command. It has ability of dimensional navigation and selection on a computer screen. Moreover, algorithms corresponding to this application are Adaboost, SVM 4.2.7[42].
24
3.1 Machine Learning Applications
3.1.3 Bioinformatics
Bioinformatics is an interdisciplinary field that expands methods and software tools for understanding biological data. Bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics use mathematical and statistical techniques as well [43]. It become important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. It is used in the field of genetics and genomics, text mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It is useful also in the analysis of gene and protein expression and regulation.It should be mention that bioinformatics tools aid in the comparison of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. It analyze and catalogue the biological pathways and networks that are an important part of systems biology. It can simulate and model DNA [44] , RNA [44][45] , proteins[44] as well as biomolecular interactions in the field of structural biology [46]. The primary goal of bioinformatics is to enhance the understanding of biological processes. Hence, it is focused on developing and applying computationally intensive techniques to achieve this goal. Examples like pattern recognition, data mining, machine learning algorithms, and visualization are well known. Major research attempts in this field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis. Database correspond to bioinformatics can contain different types of information such as DNA ,protein sequences, molecular structures,phenotypes and biodiversity. It may involve empirical or predicted data. Some of most common used database are GenBank, UniProt, Interpro, Pfam, Sequence Read Archive, Functional networks, Interaction Analysis databases , GenoCAD. Additionally ,there are various kinds of tools for bioinformatics like single command line tools and more complex software. Also variety of open source soft wares are introduced such as Biopython ,BioJav ,BioJs ,BioRuby ,Bioclips, .NetBio, BioPerl, Orange, Bioconductor
25
3 Applications for Machine Learning ,UGENE and GenoCAD.Other important fact about bioinformatic is that Basic informatics services are classified by the EBI into three categories SSS(Sequence Search Services),MSA(Multiple Sequence Alignment) and BSA(Biological Sequence Analysis) [44]. The algorithms which can be applied to bioinformatics data sets are Genetic algorithm, Mass spectrometry, Needleman-wunsch algorithm, Smith and Waterman algorithm, BLAST algorithm ,Maximal Segment Pairs(MSP)algorithm. The main problem example is classification [47].
Cheminformatics Cheminformatics is a field which use computer and informational techniques to deal with chemical issues. Examples of applications of this field are Information retrieval, Information extraction which are in category of unstructured data and as structured data can mention Data mining, Graph mining, Sequence mining, Molecule mining, Tree mining [48]. The third category is Digital libraries. File formats are like XML-based Chemical Mark up language or SMILES and are used often for storage in large chemical databases. Some other formats are suitable for visual representation in 2 or 3 dimensions and other are more convenient for studying physical interactions, modeling and ducking studies. Chemical data can be related to real or virtual molecules. Virtual libraries of classes of compound were recently generated using the FOG (fragment optimized growth) algorithm. This was done by using cheminformatic tools to train transition probabilities of a Markov chain on authentic classes of compounds, and then using the Markov chain to produce novel compounds that were similar to the training database [48]. Virtual screening: It involves computationally screening in Silico libraries of compounds with various method such as docking, to identify which have desired properties such as biological activity against the given target. In some cases, combinatorial chemistry is used in the development of the library to enhance the efficiency in mining the chemical space [48]. Other subject related to Cheminformatics is quantitative structure-activity relationship (QSAR) which offer calculation of quantitative structure activity relationship and quantitative structure property relationship values, used to estimate the activity of components from their structures.Chemical expert systems are also relevant, because
26
3.1 Machine Learning Applications they represent parts of chemical knowledge as an in silico representation. There is a relatively new concept of matched molecular pair analysis or prediction driven MMPA which is coupled with QSAR model in order to identify activity. Algorithms for implementing cheminformatic are clustering algorithms 4.3.1[49].
Classifying DNA sequences
DNA is a sequence of letters which imply the nucleotides. However, this sequence contain information about vital functions of a living substance [50]. DNA sequence is represented as a sequence of A, C, D and T letters which refer to adenine, cytosine, guanine, thymine — covalently linked to a phosphodiester backbone.In this sequence letters are appeared after each other without gaps [50]. Once DNA sequence is stored in silico in digital format. Then digital sequence may be stored in sequence database. In bioinformatics,for arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be due to functional, structural, or evolutionary relationships between the sequences [51]. Although DNA and RNA nucleotide bases are more similar to each other than amino acids, the conservation of base pairs can demonstrate a similar functional or structural role [52]. Sequence motifs which is one of functional importance examples are such as C/D [53] and H/ACA boxes [54]. In addition, long range correlation is a non-coding base pair sequence of DNA. Also it is not contain in DNA sequence [55]. Other important concept is sequence entropy which is denoted as sequence complexity or information profile [56]. It is a numerical sequence that indicate a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. Information profiles manipulation enable the analysis of the sequences with alignmentfree techniques, for instance in subject and rearrangements detection.Classification algorithms 4.3.1 like FASTA algorithm can be used for implementing DNA sequences [57].
27
3 Applications for Machine Learning computational anatomy In this field machine learning algorithm with applied mathematics and poured mathematics and anatomy are integrated for modeling and simulation of biological structures [58]. Moreover, it utilize newer, interdisciplinary fields like bioinformatics and neuroinformatics in such a way , its interpretation uses metadata extracted of original sensor imaging modalities(e.g Magnetic Resonance Imaging). It focuses on making imagination of anatomical structures, rather than the medical imaging devices. It is similar originally to the history of Computational linguistics and mainly concentrated on linguistic structures rather than the sensor acting as the transmission and communication medium. In order to study different coordinate systems via coordinate transformations as generated via the Lagrangian and Eulerian velocities of flow the diffeomorphism group is used.An abstracted algebra is a linear model of computational anatomy [58]. Examples of these models are Bayesian model of computational anatomy and random orbit model of computational anatomy. In this field shapes are refer as central objects, also one set of examples are 0 ,1 ,2, 3-dimensional sub-manifolds of R3 and the second set is examples of images correspond to medical imaging such as magnetic resonance imaging(MRI) and functional magnetic resonance imaging.0-dimensional manifolds are landmark points and 1-dimensional manifolds are curves like sulcul or gyral curves in Brian, 2-dimensional manifolds are correspond to sub-structural of anatomy like sub cortical of the mid-brain [58]. The main case development of computational anatomy is image wrapping algorithm which reshape the brain atlas to match the anatomy of the new individuals. so, it provide a reach source of morphometric data for data mining or hypothesis testing. Computational anatomy can be constructed by Bayesian algorithms refBS[58].
Medical diagnosis Diagnosis is a subfield of Artificial Intelligence which develops algorithms and technology to determine the correct behaviour of a system. Then these algorithms must be able to recognize the filter of system and its type. The computation is based on observations. Expert Diagnosis is based on experience with system in which a map is built that use
28
3.1 Machine Learning Applications experience and efficiently associated to the corresponding diagnosis. The experience can be provided by human operator and also examples of the system behaviour. In addition, Model-based diagnosis is an example of abductive reasoning using a model of the system. Medical algorithms, Decision tree algorithms 4.2.8,Calculators for instance an online calculator for body mass index(BMI), Flowcharts e.g a binary decision tree for chest pain ,Nomograms for example a moving circular slide to calculate body surface area or drug dosage [59].
3.1.4 Machine learning control It is known as subfield of Machine Learning and Control theory which uses for solving the problem of optimal control. Linear control theory is not applicable for complex nonlinear systems. There are four types of this problems.First of all ,Control parameter identification [60], Secondly, Control design as regression problem of the first kind in which try to identify unknown parameters with a given law [61], Thirdly Control design as regression problem of the second kind which recognize control law to minimize the cost function of the plant [62], and finally Reinforcement learning control that control law modifies by reward function in reinforcement algorithm . Therefore, for this field , algorithms such as Regression types and Reinforcement types are useful. Example of this type of application is Robotic applications which try to recognize and imitate human behaviour such as finger print recognition application or other computer vision applications [63].
Machine Perception Machine Perception is the ability of a computer system to interpret data similar to human analyze them. The basic method of this task in computer is correspond to hardware of system. The system input is presented in different form such as keyboard,mouse. Also with advancement in hardware and software computer is able to take in sensory input similar to human [64] [65]. Machine Perception lead to use sensory input with information with greater accuracy and to present it in a way that is more comfortable for
29
3 Applications for Machine Learning the user. This consist of Computer Vision, Machine Hearing and Machine Touch. The goal of Machine Perception is to offer the ability of see,feel and perceive the world as human do and consequently the ability to explain in an human way why they are making its decisions. Machine Vision, Machine Hearing and Machine Touch are fields which are closely related to Machine Perception. Also, it can be implemented by convolutional networks ans Support Vector Machine 4.2.7. Interfaces of Machine Learning are like GUI which is a graphical user interface including window ,point and click and focus on visualization , SDK that is a set of API with more actions to create an applications and REST that can performs requests and receive responses via HTTP protocol such as GET and POST [66].
Game playing
Game playing use strategic game and autonomous decision making to determine outcome. Also, it use decision tree inside its strategies [67]. It has various type of strategies such as Abstract strategies, Term strategies, Eurogames, simulation, Wargame, Strategy video games. In Abstract strategies the rules do not simulate reality but perform the internal logic of the game such as chess,Go and Arimaa. Moreover, some games that do not follow this criteria but are categorized in this group are like bachgommon, octiles, can’t stop and Mentalis. Term strategies is consist of two teams of two players, whose offensive and defensive skills are in flow in progressing of game procedure. playing this games improve strategic awareness [67]. Eurogames is known as German-style broad games which related abstract strategy and simulation games. they are short time gems which has indirect player interaction and abstract physical component with simple rules. Simulation games attempt to simulate decisions and process in to some real- world situation. rules indicate the result of player actions and their situations in real world. Wargame simulates military battle, entire ware or campains. Strategy video games are split in to continues real-time-strategy and discrete time-based-strategy. The algorithms which are mainly use is Minimax algorithm and decision tree4.2.8.
30
3.1 Machine Learning Applications
3.1.5 Credit card fraud It is an expression for theft and fraud committed involving a payment card, such as a credit card or debit card, as a fraudulent source of funds in a transaction. The aim could be to obtain goods without paying ,or to obtain unauthorized funds from an account [68]. Card information is stored in a number of formats. Card numbers – formally the Primary Account Number (PAN) are often imprinted on the card, and a magnetic stripe on the back contains the data in machine readable format [69]. Main ways for fraud against merchants who sell and ship products, and affects legitimate mail-order and Internet merchants are internet and mail. If the card is not physically present (called CNP, card not present) the merchant must rely on the holder presenting the information indirectly, whether by mail, telephone or over the Internet. While there are safeguards to this, it is still more risky than presenting in person, and indeed card issuers tend to charge a greater transaction rate for CNP, because of the greater risk [70]. Identity theft are divided into two major categories: application fraud and account takeover. Application fraud occur when a person uses stolen or fake documents to open an account in another person’s name. The most common method of account takeover is a hacker reaching access to a list of user names and passwords. Other methods include dumpster diving to find personal information in discarded mail, and outright buying lists of ’Fullz,’ a slang term for full packages of identifying information sold on the black market [70]. The crime of getting private information about somebody else’s credit card used in an otherwise normal transaction is skimming.Skimming is difficult for the typical cardholder to detect, but with a large enough sample, it is quite easy for the card issuer to detect. The issuer collects a list of all the cardholders who have complained about fraudulent transactions, and then uses data mining to discover relationships among them and the merchants they use.Algorithms correspond to this application are Clustering 4.3.1, Classification, Bayesian Neural Network, Fuzzy Neural nets, Automated data algorithms [70].
3.1.6 Economics Economics concentrates on the interactions of economic operators and how economics work. Economic is divided into two group microeconomics and macroeconomics. Also,
31
3 Applications for Machine Learning there is difference between microeconomics and macroeconomics. Microeconomics examines the behaviour of basic elements in the economy, including individual operator and markets, their interactions and the outcomes of interactions. Positive economics, describing ’what is ’ and normative economics approve ’what ought to be’ ,between economic theory and applied economics; between rational and behavioural economics and between mainstream economics and heterodox economics [71]. Mainstream applications uses machine learning algorithms to classify content of images and videos. Economics analysis can be applied throughout society, in business, finance, health care and government. It is also applied to crime,education [72],the family,law,politics and religion.Algorithms which are utilized to implement this applications are such as Random Forest which is utilized for prediction of recession in Economic forecasting 4.2.8, Monte Carlo 4.4.2 [73].
Marketing Marketing is the study and management of exchange relationships [74]. Marketing is used to create, keep and satisfy the customer requirements. Marketing is one of the premier components of Business Management and another one is Innovation [75]. It is mentioned that other services and management activities such as Operations (or Production), Human Resources, Accounting, Law and Legal aspects can be "bought in" or "contracted out". There are several Marketing Orientations such as Product orientation, Sales orientation, Production orientation,Customer orientation,Organizationa orientation. In the early 1960s, Professor Neil Borden at Harvard Business School identified a number of company performance actions that can influence the consumer decision to purchase goods or services. Borden suggested that all those actions of the company represented a “Marketing Mix”. Professor E. Jerome McCarthy, at the Michigan State University in the early 1960s, suggested that the Marketing Mix contained 4 elements: product, price, place and promotion [76]. Algorithms which are used to implement this application are Linear Regression 4.3, Correlation Coefficients, Chi-square test, hypothesis test, quantitative research, qualitative research, frequency distributions, Poisson and binomial distributions.
32
4 Machine Learning Algorithms
4.1 Introduction
Machine Learning contains different algorithms in base of type of learning as mentioned in chapter 2. Supervise learning contains algorithms which are more in base of Decision trees and are considered as classifiers which are best in regression problems [5]. In terms of classification problems also there are varieties of classification algorithms which involves Decision trees too. Unsupervised learning contains algorithms which are in base of clustering that try to group data in different clusters. This learning type involve different clustering algorithms such as Hierarchical clustering and k-Means algorithms [5]. Reinforced learning also involves different algorithms which are in ground of Bandit Problem or Markov Decision Process that is in base of Markov statics. This algorithms try to solve complicated problem of deep learning in optimized from. Main types such as Deep Learning ,Sarsa , Q-Learning and Monte Carlo algorithms are explained in this chapter.
This figure indicates 3 main category of Machine learning algorithms that are Supervised learning algorithms and Unsupervised learning algorithms and Reinforced learning algorithms which then grouped in different algorithm types as mentioned in Figure 4.1 [5].
33
4 Machine Learning Algorithms
Figure 4.1: Machine Learning Algorithms
4.2 Supervised Learning algorithms As it is described supervised learning is corresponded to learning from a set of labeled data in which user can estimate approximately the outcome of system [5]. There are several ways in which the standard supervised learning problem can be generalized . There two family of Supervised Learning. First one is Generative Algorithms that model class-conditional density P (x|y), the predictive density can be inferred by applying Bayes Theorem :
34
4.2 Supervised Learning algorithms
p(x|y)p(y) [5] y p(x|y)p(y)dy
p(y|x) = R
The second one is Discriminate Algorithms which focuses on estimating p(y|x) it means this model takes a data and put probability over hidden structure given the data. If this family of algorithms be greater or less than 0.5 then there is limitation. Support vector machine(SVM) is mentioned as example of this kind of algorithms. Algorithms concerned to Supervised learning type are as following:
4.2.1 Regression Regression algorithms are proposed in two different types Linear-Regression and NonLinear Regression which are used for solving supervised and unsupervised data set issues [77]. Depends on type of regression algorithm activation functions that matches input data to related output are different. In case of Linear regression activation function is a Linear activation function and in case of non-linear regression activation function is polynomial like Gaussian activation function. Each Regression model involves three variable, unknown variables that represent a scalar or a vector, an independent variable and a dependent variable. So in base of this parameters function of relationship between independent and dependent data are determined [77]. In case that unknown variables are more than number of independent and dependent variables most classical regression analysis like classical Linear Regression in which predicts future target values(dependent) in base of behaviour of explanatory(independent variables), can not be performed [77]. In a case that number of data points is equal to number of unknown data points and linear function relationship is between them, dependent variable is a output of linear function of Unknown and independent variables. Most common case in when number of data points is more than number of unknown variables and linear function of relationship is between unknown and independent data,so there is enough information in data to estimate unknown data points [77]. In terms of measures of accuracy in regression model there are some types of errors which are necessary to evaluate: RMSE: Root-Mean-Squared-Error is a quadratic rule
35
4 Machine Learning Algorithms that measures the average magnitude of the error and is square root of the average of squared difference between prediction and actual observation. Also numerical stability of an algorithms indicates how the error is propagated by the algorithm. It is Defined as : r
RMSE =
1 n Σ (yi − yi0 )2 n i=1
where n is total number of target values and yi refers to actual target values and yi0 refers to predicted target values [78]. absolute-error : It is an error in measurement which is a difference between measured value and true value. It is defined in mathematical form : δ(x) = xi − x
in which xi is measurement and x is actual value [78]. relative-error : If real value of a variable were x and the measured value were xo them relative-error is defined as : δx =
∆x x0 − x x0 = = −1 x x x
where ∆x is the absolute error.Also relative error of a value is equal to sum of their relative errors.Percentage error is 100 times the relative error [78]. relative-error-lenient: It is gained by division of prediction of actual value by the maximum of actual value and the prediction.The values of labeled values are actual values [78]. relative-error-strict: The average strict relative error is the average of the absolute deviation of the prediction from the real variable divided by the minimum of the real variable and the prediction. The values of the label attribute are the actual values [78]. normalized-absolute-error : The absolute error divided by the error made if the average would have been predicted [78]. root-relative-squared-error : The averaged root-relative-squared error [78]. squared-error : The averaged squared error [78].
36
4.2 Supervised Learning algorithms correlation :It refers to the correlation coefficient between the label and prediction attributes [78]. squared-correlation: It refers to squared correlation coefficient between the label and prediction attributes [78]. prediction-average: It implies the average of all the predictions. It calculates by sum of values divided to total number of predictions [78]. spearman-rho: It is a measure of the linear relationship between two variables that two variables in this case are the label and the prediction attribute. Also it is used for rank between actual values and predicted labels [78]. Kendall-tau: It measure the correlation between two variables which are actual value and predicted label.It is used for ranking correlation between real value and predicted label [78].
Linear Regression Linear Regression is a Linear model which represent a linear relationship between the input variables(X) and the single output(Y). Moreover, (Y) can be calculated from a Linear combination of the input variables(X). When there is a single input then method is simple Linear Regression and when there is multiple input variables then the method is Multiple Linear Regression. Different techniques are represented to indicated the Linear Regression equation and most common one is Ordinary Least Squares.This is called as Ordinary Least Squared Linear Regression. In a simple Linear Regression ,the form of the model would be: Y = B0 + B1 ∗ x where B0 and B1 are coefficients of equation which are constant and refers to features of problem and X which is a variable, for instance weight and height of people in a data set when we want to recognize their age from these two attributes. In higher dimension, when we have more than one input(x), the line is called a plane or a hyperplane ,e.g B0 , B1 in above equation). Number of coefficient which are used in this model determine the complexity of this model.When coefficient becomes Zero then it does not have any
37
4 Machine Learning Algorithms influence on model and therefore from the prediction made from the model(0*X=0). Ordinary Least Square procedure seeks to minimize the sum of the squared residuals. This means with a regression line through the data the distance from each data point to the regression line, square it and sum all of the squared errors together can be computed and ordinary least squares tends to minimize it. This approach are linear algebra operations to estimate the optimal values for the coefficients and demonstrate data as a matrix. Moreover, other common technique is Gradient Descent [3]. Gradient Descent is a method for optimizing Linear regression. In a case that there are one or more inputs you can use a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on training data.In this approach which starting by random values for each coefficient, the sum of the squared errors are computed for each pair of input and output values. Scale factor is learning rate and coefficients are updated in the direction of minimizing the error. This process is repeated until a minimum sum squared error is achieved or no further improvement can accrue.In order to determines the size of improvement in each iteration a learning rate (alpha)parameter must be selected. Gradient descent use Linear Regression Model. It is useful when data set is huge [3]. Regularization methods are extensions of the training of the Linear Model.They minimize sum of the squared error of the model on the training data with Ordinary Least Squared(OLS) and also reduce the complexity of the model(like the absolute size of the coefficient in the model) [3]. Two common types of regularization procedures for Linear Regression are Lasso Regression in which Ordinary Least Squared is modified to also minimize the absolute sum of the coefficients(called 1L regularization) and implies in from of; ||X − Y b||2 where ||0|| indicates euclidean norm. It uses only a subset of covariance instead of all of them for prediction of accuracy of regression model which is in base of Breiman theory. and Ridge Regression where Ordinary Least Squared is modified to minimize the squared absolute sum of the coefficients(called 2L regularization).In a case that there is incompatibility in data values and ordinary least squared overfit the training data these methods are effective. It is consider as a measure of accuracy in regression model[3].
38
4.2 Supervised Learning algorithms Additionally,Other Least square estimation techniques are such as Percentage Least Squares that focuses on reducing percentage errors, which is useful in the field of forecasting or time series analysis. It is also useful in situations where the dependent variable has a wide range without constant variance. In a case that, percentage or relative error is normally distributed, least squares percentage regression provides maximum probability estimates. Percentage regression is linked to a multiple error model, whiles OLS is linked to models have an increasable error [79]. Iteratively reweighted least squares (IRLS) is used when either heteroscedasticity, or correlations, or both are involved among the errors of the model, but where there is a little information regarding covariance structure of the errors independently of the data [80]. In the first iteration, OLS, or General Least Square (GLS) with a conditional covariance structure is performed, and the residuals are obtained from the fit. Based on the residuals, an improved estimate of the covariance structure of the errors can usually be obtained. A subsequent GLS iteration is then performed using this estimate of the error structure to define the weights.The process can be iterated to convergence, but in many cases, only one iteration is sufficient to achieve an efficient estimate of β [81][82]. Another case is Instrumental variables regression (IV) that can be performed when the regressors are correlated with the errors. In addition, Optimal instruments regression is an extension of classical IV regression to the situation where E[ξi |zi ] and zi indicate an auxiliary instrumental variables. Total least squares (TLS) [83] is an approach to least squares estimation of the linear regression model that treats the covariates and response variable in a more geometrically symmetric manner than OLS. It is one approach to handling the "errors in variables" problem, and is also sometimes used even when the covariates are assumed to be error-free. Linear Regression can be used for example in business for product sale [83]. Example of Linear regression is implemented on Forest Fires data set A and the result of this Regression is shown in Figure4.2 which implies how residual are not fitted into Linear Regression model and accuracy of model can be assess. As plot shows red line is Linear Regression line and black points are residuals of model they are mostly fitted in Linear Regression line. Therefore it implies how residuals have non-linear patterns. In this model most of predictor variables and outcome variables have a great Linear relationship.
39
4 Machine Learning Algorithms
Figure 4.2: Linear Regression
Logistic Regression Logistic regression is generalized model of linear regression in which dependent variable are categorical [84]. If variables are binary dependent variables then result are presented in 0 and 1 target values. For example in election of a political candidate the result is win or loos, so it is a binary case where result can be defined as 0 or 1. In case where, there is more than 2 outcome the model is known as multinomial logistic regression. For instance entering university students make program choices among general program, vocational program and academic program. Their choice might be modeled using their writing score and their social economic status. Also, if multiple categories are ordered then we have ordinal logistic regression. For instance in Olympic swimming. Relevant predictors include at training hours, diet, age, and popularity of swimming in the athlete’s home country. Hence, the distance between gold and silver is larger than the distance
40
4.2 Supervised Learning algorithms between silver and bronze. This model is useful in fields of economy or Medicine when we try to find out whether or not a patient has some kind of disease [84]. For instance Logistic Regression model for Forest Fires is implemented in R code A and its result is illustrated in Figure 4.3. It indicates that predicted variables and outcome values in Logistic Regression model are not fitted greatly. Also in comparison with Linear Regression model 4.2 data have more Linear relationship rather than Logical relationship.
Figure 4.3: Logistic Regression
Kernel estimators Kernel density estimation in statics is a form of kernel density estimation in which the size of kernel depend on location of samples or points is varied. It is also used for multi-dimensional sample space for instance sample space for choosing a card from a
41
4 Machine Learning Algorithms deck [85]. The kernel or filter or mask can be tougth of Linear filter or Box filter. Using fixed filter leads to low density, all samples will fall in regions of the filter with very low weighting, while regions of high density will fall in the tails of the filter with very low weighting and regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fixed this problem, we vary the width of the kernel in different regions of the sample space. In this way, two methods of doing this are Balloon and pointwise estimation. In balloon estimator, the kernel width is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample. For multivariate estimators the parameter, width, can be generalized to vary not just the size, but also the shape of the kernel. It is also used for categorical and numerical data sets. The application of this algorithm are bandwidth selection and shadow detection [85]. Linear filter is used to blurring, Smoothing and shifting an image in different direction.In terms of blurring an image Box filter or Mean filter are used. Box filter is defined as multiplying 1/9 to image batch of pixels.Box filter is defined as: 1/9
1/9
1/9
1/9
1/9
1/9
1/9
1/9
1/9
Table 4.1: Box filter
For instance, there is a matrix of image pixels which Box mask is applied on red pixels. The process is as following: 2
1
0
6
7
2
0
1
6
5
1
1
8
5
6
1
0
6
6
6
3
5
6
7
7
Table 4.2: Image Pixels
42
4.2 Supervised Learning algorithms Therefor the process is as following :
1/9 *
1
1
1
1
1
1
1
1
1
*
2
1
0
2
0
1
1
1
8
Table 4.3: Box Mask
1/9(1(2) + 1(1) + 1(0) + 1(2) + 1(0) + 1(1) + 1(1) + 1(1) + 1(8)) = 1/9 ∗ (16)1.77 = 2 In the next step :
1/9 *
1
1
1
1
1
1
1
1
1
*
1
0
6
0
1
6
1
8
5
Table 4.4: Mask
and result of multiplication is 3. In next step, the process is:
1/9 *
1
1
1
1
1
1
1
1
1
*
2
0
1
1
1
8
1
0
6
0
1
6
1
8
5
0
6
6
Table 4.5: Mask
and the result is 2. Continuously, in next step:
1/9 *
1
1
1
1
1
1
1
1
1
*
Table 4.6: Mask
and result is 4. Hence, the new points after Box filtering is :
43
4 Machine Learning Algorithms 2
1
0
2
2
3
1
2
4
Table 4.7: New data after masking
Then original matrix changes to: 2
1
0
6
7
2
2
3
6
5
1
2
4
5
6
1
0
6
6
5
3
5
6
7
7
Table 4.8: Result of Box filtering
Second blurring Mask is Weighted average filter which is defined as:
1/16 *
1
2
1
2
4
2
1
2
1
Table 4.9: Weighed Average Mask
Sharpening is a technique to increase sharpness of an image. To this end an edge contrast must be added. The reasons for apply sharpening technique are to overcome blurring appear by camera equipment, to draw attention to certain area and to increase eligibility. Laplacian filters are one of sharpening mask, which are defied an below: 0
1
0
1
4
1
0
1
0
Table 4.10: Laplacian filter 1
44
4.2 Supervised Learning algorithms 1
1
1
1
-8
1
1
1
1
Table 4.11: Laplacian filter 2
The implementation of this filters is like previous example of Box filter [86].
4.2.2 Classification Classification which is type of supervised an unsupervised learning problem algorithms is to designed to apply on discrete data set to classify data into some predefined targets which are considered as class labels [87]. When there are some different classes and task is to classify new observations into existing class, problem is concerned with supervised learning category [87]. When it categorizes existing data with same target which are classes into one group this task is related to unsupervised learning. Classification aims to handle wide variety of problem and construct consistent general model for data handling. It focuses on Decision tree which has result derived from logical process. Example of supervised learning classification is when postal code machine may be able to sort majority of letters which is a difficult task for human. Accuracy and speed is to important issues of classification problem in which classification task is sensitive to accuracy but in some cases a classifier with less accuracy and more speed is preferred than classifier with higher accuracy and lower speed [87]. There is three way to allocate class for data features set. First one is class correspond to label of different attributes, for instance there are some animal like cat and dog that are quite from different classes. Second case is when classes are derived from result of prediction problem, it means class must be extract from knowledge of the attributes, for instance prediction of interest rate which is in base of questions like will interest(class = 0) or not interest(class = 1). Third case is when classes are predefined by partitions of attributes.then in base of rule that classify data labels for whole attributes of data allocates, example of this case is credit card data sets [87].
45
4 Machine Learning Algorithms One of important classifier is Fisher’s Linear Discriminant in which it splits data set by a series of line in two dimensions, plans in 3 dimension and generally hyperplans in many dimensions. The line which split data in two classes is drawn to divide in to part the line joining the center of those classes. For instance in Iris data set in order to classify Versicolor and Verginica the rule that is applied is as following: If petal-width < 3.272 0.3254 * petal-length then Versicolor. If petal-width > 3.272 - 0.3254 * petal-length then Verginica [87].
4.2.3 Artificial Neural Network Artificial Neural Network(ANN) is in base of large collection of data sets that is consist of units which is called artificial neurons. This network involve 3 different layer of neurons such as input layer, hidden layers and output layer. Depend on amount of data set hidden layer may have various number of layers and neurons [1]. In ANN in order to determine output of network with respect to give input activation function is define. Moreover,there are two type of activation function between data points in ANN. The former is Sigmoid activation function and the later one is Perceptron activation function.To learn artificial neural network regularized squared-error objective function is used.Hidden layers have some weights which shows the cost. Most common type of neural network is Back propagation which turns back error into system to modify them.Back propagation use gradient descend function to update weights ,also it is known as feed-forward network. It is used to solve a wide variety of tasks, like computer vision and speech recognition, that are difficult to solve using ordinary rule-based programming .It is convenient for categorical data sets(text mining)[1].
Back-propagation
Back propagation is used in Artificial Neural Network with association of other optimization methods such as Gradient descent. It repeats two phase cycle propagation and weight update. Activation function in back propagation determine weights of neurons
46
4.2 Supervised Learning algorithms which is defined as F (netj ) =
1 1+
e− net
j
+ θj
where netj = Σi wji ai is input of network. Input vector forward to network layers until it reaches the output layer. Then with help of Loose function it compare network output with expected output and compute error value related to each neuron of output layer. Loose function is defined as: 1 E = Σp,j (tpj − opj )2 2 of where t is target of network and o is output of network. Back propagation attempt to calculate gradient of loss function with respect to weights in work according to these error values. Gradient descent is defined as : ∆wji = −[
∂E ] ∂wji
where is learning rate. Initial weights of network must be unequal because in some problems that solution needs unequal weight if equal weight is selected network gives error and can not solve the problem. Continuously, errors come back to network backward from the output neurons, until each neuron has an associated error value which imply its portion to the original output [88]. In order to find learning rule of Back propagation, chain rule is required to rewrite the error gradient for each pattern as the product of two partial derivatives. Then error signal drives from this rule. Gradient descent always move towards deepest descent and for two layer neural network error is like a bowl and for algorithm find a solution is not problem and always finds best solution which is called global minima. In case that a layer insert as a hidden layer number of minima grows and it is complex for network to find global minimal since some of them are deeper than other so instead of global minima gradient descent will find local minima as a solution. It should noted that if learning rate be a great value then weight changes greatly and consequently network learns more quickly. Also weights in Back propagation updates after patterns appear in network or after all of patterns in training set are presented. It is mentioned that aim of developing back-propagation is to find a way for training Multi-layer
47
4 Machine Learning Algorithms neural network such that it can learn the internal representation to learn any arbitrary mapping inputs to outputs [88]. IN addition because the fact that perceptron can not implement XOR Back Propagation is used to implement this problem. Back-propagation is useful for fast learning and large database, acoustic, speech, signal processing [88]. Let’s make an example of Back Propagation algorithm.
4.2.4 Bayesian Statistics Bayesian Statistics work in base of Bayesian probability that reflect the state of object. Also, key idea is interpreted as probability [89]. Moreover, two concepts of prior distribution and posterior distribution are important in Bayesian statistics.Two main methods of this field is Monte-Carlo method and Markov-chain Monte-Carlo method(MCMC). Generally, Bayesian statistics use in three condition [90]. The first one is where there is not any alternative but quantitative prior judgments exist, due to lack of data on some aspect of a model, some evidence has to be utilized for making assumptions about the biases involved [90]. The second one is about moderate-size problems with multiple sources of evidence, where hierarchical models can be constructed in base of prior distributions whose parameters can be estimated from the data. Common application areas include meta-analysis, disease mapping, multi-centre studies, and so on. With weakly-informative prior distributions the conclusions may often be numerically similar to classic techniques, even if the interpretations may be different. The third area concerns where a huge joint probability model is generated by relating possibly thousands of observations and parameters, and the only feasible way of making inferences on the unknown quantities is via a Bayesian approach [90].
Naive Bayes classifier It is a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong(naive) independence assumptions between the features. It is also a popular method for text categorization [91]. Bayes models are known under a variety of names, including Simple and Independent Bayes, but it should take into account that Naive Bayes
48
4.2 Supervised Learning algorithms is not a Bayesian method [92][93]. Naive Bayes is a simple technique for generating classifiers. In this way model assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set .All Naive Bayes classifier assume that the value of a particular feature is independent of the value of any other feature, with the class variable in hand.In many practical applications, parameter estimation for Naive Bayes models uses the method of Maximum Likelihood. Naive Bayes classifiers work quite well in many complex real-world situations. It is also outperformed by other approaches,such as boosted trees or Random forests [94]. In addition, in order to classify continues features Gaussian Naive Bayes [93] and for discrete features Multinomial Naive Bayes and Bernolli Naive Bayes is used. Car theft is an example of application of this algorithm [95] [96].
Statistical relational learning (SLR)
It is subdicipline of artificial intelligence and Machine Learning that is concerned with models in which presents both uncertainty and complex, relational structure [97][98]. SRL use first order logic to describe relational properties of domain in general manner and draw upon probabilistic graphical models to model the uncertainty. Some also present the methods of inductive logic programming. It is related to reasoning and knowledge representations. Number of standard tasks are concerned with statistical relational learning,the most common ones are collective classification, link prediction, link-based clustering, social network modeling and object identification/entity resolution/record linkage [99]. One of fundamental goals of result representation formalism in SRL is represent general concepts that are determined to be universally applicable. Some of these formalism are Bayesian Logic program, BLOG model, Markov Logic network, Multi-entity Bayesian network, Probabilistic soft Logic, Recursive random field, Relational Bayesian network, Relational dependency network, Relational Markov network, Relational Kalman filtering. It is utilized for both continues and categorical data sets and for Multi-dimensional and heterogeneous Noisy and for uncertain data like image processing data sets as well [100].
49
4 Machine Learning Algorithms Minimum message (MML)
It is a inference of a computational implementation of Bayesian inference, an informationtheoretic means of finding high posterior probability hypotheses [101]. It is a formal information theory restatement of Occam’s Razor: even when models are not equal in fit accuracy to the observed data,the one that construct the shortest overall message is more likely to be correct. Also, it has theoretical and practical construction. It is used in decision trees, decision graphs, etc. is suitable for huge and multivariate database,categorical and continues data sets.MML is useful in bioinformatics applications such as Molecular Biology [101].
Inductive logic programming
It use logical programming for implementing background knowledge, hypotheses in a uniform way. It drive a hypotheses logic programming that entail all the positive and Non-of-negative examples [102]. In this type of programming, background knowledge is proposed in from of a Horn clause used in logic programming.The positive and negative examples are given as a conjunction E + and E − of positive and negative ground. Consequently a hypothesis h satisfy some requirements such as Necessity, Sufficiency, Weak-consistency and Strong-consistency. Necessity does not impose a limitation on h, but deny any hypothesis construction because the positive facts are explainable without it. In addition, sufficiency requires any constructed hypothesis h to explain all positive examples E + . Weak consistency forbids hypothesis h generation that is not compatible with background knowledge B. Strong consistency also forbids generation of any hypothesis h that is inconsistent with the negative examples E − , offer the background knowledge B; it implies Weak consistency. If no negative examples are given, both requirements coincide. The example of application related to this method is learning from drug structure [102].
50
4.2 Supervised Learning algorithms
4.2.5 Maximum Entropy Classifier This is an approach to dimensionality reduction.Dimensionality reduction can be executed on a data tensor whose observations have been vectorized and organized into a data tensor [103]. The mapping from high-dimensional vector space to a set of lower dimensional vector spaces is a multilinear projection.Multi subspace learning methods such as principal component analysis(PCA) that is a dimensional reduction technique [104],independent component analysis(ICA) in which separate multivariate signals into independent non-Gaussian signals [105], Linear discriminant analysis(LDA) that try to find linear combinations of features [105] and canonical correlation analysis(CCA) which aim to find linear combination between to matrix of random data with maximum correlation with each other, are related to this classifier [106] [107]. Maximum Entropy Classifier is used when there is not any information about the prior distributions and when it is unsafe to make any such assumptions. Moreover, Maximum Entropy classifier is used when we can’t assume the conditional independence of the features.This algorithm is behind of Principle of Maximum Entropy and from all the models that fit our training data, selects the one which has the largest entropy. The Max Entropy classifier can be used to address a large variety of text classification problems similar to language detection, topic classification, sentiment analysis and more [108].
4.2.6 K-Nearest Neighbor Algorithm K-Nearest Neighbor(KNN) is utilized for large datasets where data are categorical or continues variable [4]. The idea is to fined K closest neighbor of observation in data set and then classify all data in base of these neighbors. It is necessary to remove observations that do not contribute to discrimination and they do not affect the performance of discriminant. Also it is necessary to normalize marginal density by transforming observations. Therefore, monotonic transformation of the power law type is applies usually. To this end for instance they combine variables for instance by taking ratio or differences of key variables. Background knowledge of problem helps to drive this law. For instance, in the Iris data, the product of the variables Petal Length and Petal Width gives a single attribute which has the dimensions of area, and may be labelled
51
4 Machine Learning Algorithms as Petal Area. Therefore a decision rule based on the single variable Petal Area is a good classifier [87]. Let,s look through an example: There is a data set 4.2.6 of different Aceton product which indicates two attribute Durability and Strengthen of each one as following:
X1=Aceton
X2=Aceton Strength
Y = Classification
7
7
bad
7
4
bad
3
4
good
1
4
good
Durabilitu
Table 4.12: Aceton Data
a new Aceton product is generated and it is required to determine the quality result base on Durability and Strength = (5,6). In this process k = 3 and square distance measure uses to compute the distance to query (5,6). Therefore the result is shown in table 4.2.6
X1
X2
Squared Distance to (5,6)
Rank
Include in 3 NN?
7
7
(7 − 5)2 + (7 − 6)2 = 4+1 =5
2
yes
7
4
(7 − 5)2 + (4 − 6)2 = 4+4 =8
3
yes
1
yes
4
no
5)2
+ (4 −
5)2 =
3
4
(3 −
1
4
(1 − 5)2 + (4 − 6)2 = 16+4=20
4+1=5
Table 4.13: Aceton similarity
this table also indicates the rank of each Aceton product in base of distance result to new product and determine if it is include in k-Nearest Neighborhood. Hence the category of each product determines in Table 4.2.6:
52
4.2 Supervised Learning algorithms X1
X2
Squared Distance to (5,6)
Rank
Include in 3 NN?
Y=category of NN
7
7
(7 − 5)2 + (7 − 6)2 = 4+1 =5
2
yes
bad
7
4
(7 − 5)2 + (4 − 6)2 =4+4 =8
3
yes
bad
1
yes
good
4
no
-
5)2
+ (4 −
5)2 =
3
4
(3 −
1
4
(1 − 5)2 + (4 − 6)2 =16+4=20
4+1=5
Table 4.14: Result of Categorization
Since there is 3 bad and 1 good quality product and 3>1 it concludes that new Aceton product is in bad category [87].
4.2.7 Support Vector Machines Support Vector Machine is known as SVM and it is kind of supervised learning algorithm which to solve the optimization problem [109]. SVM try to find decision boundaries between two class of data in which they have maximum training set of data.It is the most convenient method to apply on small size training data. It should be considered that for two class of data several linear separators are exist such as perceptron algorithm that try to find any linear separators and Naive Bayes 4.2.4 in which can find best Linear separators. SVM look for detecting the decision boundaries that are far from data points maximally. Then margin of classifier is recognized by distance from decision boundaries. Continuously, the points which indicate the position of separator are called as support vectors. The best classifier is the one that have large margin because it can generate classification decision with high certainty. Linear classifier is defined as : f (x) = Sign(wT + b) [109] where w imply weight vector(decision hyperplan normal vector) and b is decision hyperplan. Our data set is shown by D = (xi , yi ) where D is data set and xi are data points
53
4 Machine Learning Algorithms and yi indicate the target which is mainly 1 or -1 that 1 is one class and -1 is another class.Figure 4.4 shows Linear SVM classifier.
Figure 4.4: Linear SVM
[109] In linear SVM we can define hard margin and soft margin where it is not possible to split instances with the two different target feature levels with a linear hyperplane after using kernel function [109]. In case of non-linear SVM, there is a SVM with quadratic programming(QP)libraries. Also, there are other libraries for SVM to build a model. A non-linear SVM can have different kernel type such as polynomial kernels and radial basis functions.The most common form of radial basis function is Gaussian distribution. A radial basis function(RBF)is equivalent to mapping the data into an infinite dimensional Hilbert space(Hilbert space generalizes the idea of Euclidean space that extends the methods of vector algebra from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions). It should be mentioned that polynomial kernel leads to model feature conjunctions for instance it can be used to find the occurrences of pairs of words that offer information about topic classification,not given by individual words only like "Perhaps" operating.It is required to utilize Cubic kernel if occurrences of triples of words give distributive information [109]. In order to generate an optimal hyperplan, SVM employs an iterative
54
4.2 Supervised Learning algorithms training algorithm, that is used to minimize an error function. In base of error function form, SVM models are categorized into 4 type: • Classification SVM Type 1(C-SVM classification) • Classification SVM Type 2(nu-SVM classification) • Regression SVM Type 1(epsilon-SVM regression) • Regression SVM Type 2(nu-SVM regression) [110] In terms of classification SVM type 1, training is consist of maximization of the error function:
1/2(W T )W + cΣN i=1 ξi subject to the constraints: yi (W T φ(xi ) + b) ≥ −ξi
and
ξ ≥ 0, i = 1, ..., N
[110] where C is the capacity constant, W is the vector of coefficient, b is a constant, and ξ represents parameters for handling non-separable data points and i is a label of training case, y ∈ +− indicate that y is belong to 1 and -1 independent variables. Kernel φ transform data from input to the feature space.It is denoted that larger C can assess more error. Hence, C should be chosen in a way that avoid overfitting. In case of classification SVM type 2 in contrast with type 1,the classification SVM type 2 model minimize the error function :
1/2W T W − V ρ + 1/N
ΣN i=1 ξi
subject to constraint : yi (W T φ(xi ) + b) ≥ ρ − ξi , ξi ≥ 0, i = 1, ...., N
and
ρ≥0
55
4 Machine Learning Algorithms [110] In a regression SVM ,the functional dependence of the dependent variable y on a set of independent variables x is estimated.It suppose that relationship between independent and dependent variables is given by a deterministic function of plus some additive noise: Regression SVM y = f (x) + noise [110] Then it try to select functional form f that can correctly predict new cases like training the SVM model on a sample set,i.e training set,a process that is consist of the sequential optimization of an error function like classification.Depending on the definition of this error function,two types of SVM models can be recognized.First one is Regression SVM Type 1 that error function is: N 1/2W T W + CΣN i=1 ξi + CΣi=1 ξi i
which we minimize subject to : W T φ(xi ) + b − yi ≤ + ξi yi − (W T )φ(xi ) − bi ≤ + ξi ξi , ξi ≥ 0, i = 1, ..., N [110] The second type is Regression SVM Type 2 which error function is given by:
1/2(W T )W − C(V + 1/N ΣN i=1 (ξi + ξi )) which we minimize subject to : (W T φ(xi ) + b) − yi ≤ + ξi
56
4.2 Supervised Learning algorithms yi − (W T φ(xi ) + bi ) ≤ +ξi ξi ξi ≥ 0, i = 1, ...., N, ≥ [110] There are different types of kernels that can be used in SVM. These include Linear, Polynomial, Radial Basis Function(RBF) and Sigmoid which is demonstrated in math equation in Table 4.2.7.
Linear Kernel
Xi , Xj
Polynomial Kernel
(λXi .Xj + c)d
RBF Kernel
exp(−λ|Xi − Xj |2 )
Sigmoid Kernel
tanh(λXi .Xj + c)
Table 4.15: kernels
[110] SVM model is suitable for continues and categorical data and is suitable for regression and classification problems.Example of application is n field of computer vision like human detection [110]. Pros: SVM has some advantages like; it is suitable for both Linear and Non-Linear separable data, comfortable with Semi-Supervised learning, guarantee optimal solution is global minimum not local minimum, can deal with very high dimensional data, can learn very elaborate concept, usually works very well. Cons: It can not accommodate natural language processing that use through collection of words, it requires both positive and negative examples, needs to select good kernel function, it needs lots of CPU time. Also there are some numerical stability problems in solving the constrained.
57
4 Machine Learning Algorithms Minimum Complexity Machines Minimum Complexity Machines (MCM) performs conventional SVMs in terms of test set accuracy, while often using for fewer support vectors. This approach minimizes the machine capacity may be measured from the fact that on many data sets, the MCM indicates better test set accuracy while using less than 0/1-th number of support vectors obtained by SVMs. The Linear MCM can classify data of separable data set with zero error. Because of the fact that data set will not be linearly separable often, the soft margin equivalent if the MCM is obtained by introducing additional slack variables. The kernel MCM is other type of MCM that define separating hyperplane in the image space. It is useful for categorical data sets [111].
4.2.8 Conditional Inference Trees It works in base of statics which split data in base of non-parametric tests and revised multiple testing to avoid overfitting. In this algorithm pruning is required and result is in unbiased predictor selection [112]. Dimensions of decision trees might be various as below: Firstly, test could be multivariate means that testing on several features at once or uni-variate in which testing on only one feature of inputs. Secondly, the test could have two or more than two conclusion. If all the tests have more than two results then we have Binary decision tree [112]. Thirdly, features are represented in to categorical or numerical and Binary data types. Finally, when input is Binary and we have two classes then tree implement Boolean function and it called Boolean-decision tree. A highly unbalanced form multivariate decision tree could be implemented by K-DL class of Boolean functions. In this type of decision tree to select the type of test, if attributes are binary the test are values which are corresponded to 1 or 0. If attributes are categorical then test might be formed by dividing the attribute values into individual and perfect subset [112]. Another method is using uncertainty reduction to selecting test set.In case we have binary attribute in procedure of learning Decision tree the problem is the order of test and in case we have categorical or numerical attributes in selection the order,must decide about the type of test [112]. Algorithms to build decision trees use usually work top-down, by choosing a variable at each step that is the best for split the
58
4.2 Supervised Learning algorithms set of items.Different algorithms utilizes various metrics for measuring best feature.As an example for these measures we can notice some of them such as; Gini impurity : It is a measure that indicate the wrong labeled data which are labeled according to label of data in subset. In other word, must compute the Gini index for the hole data set and then reduce the sum of weighted Gini Index scores for the partitions constructed with the feature [112]. It is described and formulized in CART algorithm 4.2.8 procedure. Information gain : It is used by the ID3 4.2.8, C4.5 4.2.8 and C5.0 tree-generation algorithms. Information gain is lied on the entropy from information theory. Information Gain = Entropy(parent) - Weighted Sum of Entropy(Children) [112] It actually imply the relationship between a measure of heterogeneity of a set and predictive analytics. It generate a sequence of a test that divides the training data in to pure sets with respect to the target values,then samples are labeled according to test set sequence with consideration of corresponding target values. To be more precisely to calculate information gain, the following procedure is used: In order to find out how much information is required to organize the data set into pure set, calculating entropy of main data set with respect to target values are essential [109]. Data set splits into sub sets for each feature value and then aggregate all entropy score of each sub set. It offer a measure of information which is necessary to divide the samples into pure sets after splitting them with divided feature. To obtain information gain subtract the remaining entropy value from the original entropy value calculated initially [109]. Variance reduction :It is introduced in CART where target variable is continues(regression tree), also use of many other metrics would first need discrimination before being applied [113]. Assuming that set of training instances gaining a leaf node are expressive of the queries that will be labeled by the node, it makes sense to produce regression trees in a way that decrease the variance in the target feature value of the training set at each leaf node of tree. It could be perform by apply ID3 algorithm to apply measure of variance instead of using entropy. As a result in base of variance the impurity of a node can be computed as:
59
4 Machine Learning Algorithms
V ar(t, D) =
Σni=1 (ti − t¯)2 n=1
[109] The advantages of decision tree are like; Simple to understand and interpret [114], has ability to handle numerical and categorical data [114], it needs a little data preparation tasks [114],uses the white box model [114], using statistical tests to validate model [114] , has well performance on large data sets [114], reflects human decision making more closely than other approaches [114] . In terms of its drawbacks should be mentioned that, information gain in decision tree are biased with tendency to attributes with more levels and to data involving categorical variables with various number of levels. Computing can be very complex especially when many values are uncertain or many consequence are linked. Additionally, there are some limitation regarding to decision tree as followes; First of all, it is not as precise as other approaches [114]. Secondly it is not robust approach and when a set of data little change, Consequently the tree will change significantly [114]. Thirdly the problem of learning an optimal decision tree which is known as NPcomplete is exist for several aspects of optimally and even for simple concepts [115]. Therefore, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm. To decrease the greedy effect of local-optimally some methods such as the dual information distance (DID) tree were introduced [116]. Continuously, decision trees can create very complex trees which are not extracted from training sets. It need mechanism such as Pruning to avoid this problem [112] [117]. Moreover,Decision trees can not represent issues like parity, xor and multiplexer so as a result it become significantly large and consequently we can use statistical relational learning or inductive logic programming or can change the representation of the problem [118]. Decision graph is an extension of decision tree which use disjunctions to join two more path together with minimum message length[119]. Decision graphs extend more in which they can learn new attributes dynamically and used at different places within the graph [120]. In case of more general coding scheme ,better predictive accuracy and log-loss probabilistic scoring are concluded. In general, decision graphs builds up models with fewer leaves than decision trees [120].
60
4.2 Supervised Learning algorithms It is proven that decision tree which are in base of DNF-function are similar to two layer feed-forward neural network. In addition,multivariate decision trees that are concerned with linearly separable functions at each node might be generated by feed-forward networks [1]. Because Decision tree implement any Boolean function, it is in counter with over fitting when there are too many data points that are stable with the training set. Consequently, several methods are introduced to address this challenge.The best way is to test it on test set, but if we are comparing different learning systems we can find out it is possible to select the one which performs the best on test set [1]. Another way is to split training set into 2/3 for training and 1/3 for estimating complete performance, but the difficulty corresponding to this method is that it decrease the size of training set and therefore grow up the chance of over fitting problem.Regarding to this issue,there are some validation techniques that attempt to obstacle with this problem such as; Cross-validation: First of all,divide training set into subsets and then train every subset on union of all these subsets. In this way we can determine error rate on training set [1]. Leave-one-out validation: It is as equal as cross-validation with this difference that for each pattern of training set we consider subset. Continuously, we test every subset and then count the total number of error which occur and then divide it in number of subsets to get the estimated error rate. It is useful in case of more precise estimation of error rate is essential for a classifier [1].
Another case of overfitting is when there are small amount of pattern for each node and therefore small number of samples can be chosen for each node. So, we are in counter with over fitting.In order to deal with this issue before all patterns split into different categories, we must finish testing. The leaf node can involve more than a one class but, still can decide about most numerous class [1]. To determine when can stop splitting nodes,cross-validation technique is used.In this way if as a result of splitting node cross-validation error appears,then stop this procedure of splitting. Moreover, time of stopping is so important because under fitting can leads to more error than over fitting [1].
61
4 Machine Learning Algorithms Decision tree learning
It is used in data mining, machine learning and statics that use a decision tree structure as a model to predict the target value corresponding to represented data set. In this tree each leaves represent a class label and each branches are as conjunctions of features that leads to those class labels [121]. Decision tree are used to decision making and illustrate and present decisions like in data mining decision tree describes a data but the resulting classification tree can be an input for decision making. A tree can be learned by splitting the data set into subsets based on an attribute value test. This process is repeated on each derived subset recursively which is called recursive partitioning. This procedure is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions. This process of Top-Down Induction of Decision Trees (TDIDT) is an instance of a greedy algorithm, and it is the most common strategy for learning decision trees from data up to now [121]. In data mining, decision trees can be described as the combination of mathematical and computational techniques to help the description, categorization and generalization of a given set of data. Data in a record is in this form:
(X, Y ) = {X1 , X2 , ..., Xk , Y } [121] Y is as target which is dependent variable that we try to understand and classify and generalize. Vector x is input variables. In decision tree internal nodes indicates test(on input schemes) and leaf nodes implies categories of data points. Decision trees has two main categories : The former is Classification tree analysis in which predicted outcome is the class that is concerned to data , the later one is Regression tree analysis that the predicted consequence is a real number. Some techniques which leads to construct more than one decision tree are such as: Boosted trees that runs multiple times a weaker learners by training each new sample on training data and in this way learner become stronger.Therefore, classifiers can assess and revise the weight of each sample that was predicted wrong [121]. Also, Bagging decision tree that is as
62
4.2 Supervised Learning algorithms known as Bootstrap aggregation, averages a given procedures over many samples in order to decrease its variance. As a result it makes smoother decision boundaries. The procedure of generating Bagging of Classification and Regression (CART) algorithm is as following; Firstly, Construct many random sub samples of data set with replacement, Secondly, on each sub samples train a CART model, Thirdly, Compute the average prediction from each CARD model which generated in previous step on a new data set [122]. Moreover, Random forest is a classifier with specific type of Bootstrap aggregating (Bagging). Hence, it is a result of subspace sample Bagging and decision trees is known as a random forest model. Depending on kind of prediction, the ensemble constructs a model in base of rank or the median when each models are built. It also illustrates out-of-bag error rate for each tree. It should be mentioned that it is kind of Bagging tree improvement by splitting the trees. Other technique is Rotation forest that is a set of trees in which each one contains different subset of features and runs separately. It construct an accurate and various classifiers. Similar to Bagging decision tree, Bootstrap samples are considered as training set for individual classifiers. In this way gives the features for construct a feature set for each classifier in ensemble. Therefore, feature set is divided randomly in to K subsets. Then for each subset we apply Principal Component Analysis (PCA), and new set of linear features is generated by aggregating all main samples. Then data in transformed linearly into new feature space.Classifier is trained with this data set. Accuracy of classifier is concerned with accuracy of an ensemble ,but it does not guarantee the accuracy of ensemble. So, the difference of performance between Random Projection and Sparse Random Projection is as result of accuracy. Projecting data randomly is similar to noise.Also the degree of non-sparseness of projection matrix measure the amount of noise. In order to assess the impact of PCA on Random Forest, should exchange PCA by spars-random projection and Non-parametric Discriminant Analysis (DNA). Generally, decision tree has a flow chart structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node involve a class label. Also,the first node in a tree is a root node [123]. There are many specific decision-tree algorithms. Notable ones include:
63
4 Machine Learning Algorithms ID3 (Iterative Dichotomiser 3) This algorithm construct shallowest decision tree in a depth-first way and recursively. Initially, in base of information gain we choose the best feature as a test feature [109]. Then data set partition in base of result of this root node test in some training set that involves all features which are results of test. Then the result of this test in indicated in internal nodes or child nodes and a branch from internal node to leaf node is illustrated that shows the attribute of test. Consequently,the class of this test is implied in leaf node. It should consider that finally we must have same level for all of our leaf nodes. Let illustrate an example of ID3 algorithm as following. Consider a credit risk problem with properties such as: Collateral, with possible values adequate, none , Income, with possible values “0$ to $15K”, “$15K to $35K”, “over $35K” , Debt, with possible values high, low , Credit History, with possible values good, bad, unknown . The ID3 tree in constructed in Figure 4.5.
Figure 4.5: ID3 example
64
4.2 Supervised Learning algorithms [109]
C4.5 (successor of ID3)
It is an extension of ID3 algorithm and is applied in classification problems [124]. Therefore, it is referred as statistical classifier. It generate decision tree in a same procedure as ID3 and use Information Entropy concept. S is a set of training data in classifier sample.Each sample Si consist of a p-dimensional data vector (X1,i , X2,i , ..., Xp, i) where Xj indicate attribute value of a feature of sample and relevant class in which Si falls inside it. At each node C4.5 choose node that is most effective to splitting samples in sub samples into one or more class. For division of data set it use normalized information gain (difference entropy) and decision attribute with highest Information gain is chosen. The C4.5 algorithm then returns to smaller sub lists. There are some cases related to this algorithm; The first case imply that, all samples of a list is concerned to the same class, then correspond to each class it construct a leaf node [125]. Secondly ,in a case that non of the features can supply any information gain, C4.5 construct a decision node at the top of the tree with expected value of the class. The third case is related to samples of previously invisible class.In this case again, C4.5 creates a decision node higher up the tree with the expected value of class [125]. Following example illustrates C4.5 model of forecasting weather conditions in which there are outlook such as Sunny , Overcast, rain and also features temperature, windy and playing statues.
65
4 Machine Learning Algorithms
Figure 4.6: C4.5 example
CART (Classification And Regression Tree)
It is constructed as a Binary decision tree from our data set which make splitting criteria simple.In this method each root node is an individual input and then divide data points on that variable. Then each leaf node shows outcome of that variable for prediction. This algorithm use greedy algorithm to choose data input for division. Using greedy algorithm is called recursive binary splitting.In this way to test and train splited points a cost function is used.In terms of Regression tree the cost function which minimize the cost is Sum Squared error that fall inside the rectangle: Sum(y − prediction)2 [3] where y is the target output. Example of regression tree in illustrated in Figure 4.7, which predicts the result of a cars. It is denoted that order in which variables are examined depends on the answers to previous questions. The numbers in parentheses at the leaves indicate how many data points belong to each leaf.
66
4.2 Supervised Learning algorithms
Figure 4.7: Regression tree example
In case of classification Gini Index cost function is applied on data points that imply how it combining training data to each node: G = Sum(pk ∗ (1 − pk)) [3] where G is Gini cost and pk shows training samples of class k in rectangular of interest. A node which involve all class type have G=0 and if G has 50-50 division of classes then G=0.5 which is the worst purity case. Continuously, for stopping the splitting of data points a criteria is defined which is called minimum count. It consider a minimum number and when count of training samples are less than this minimum the division stop. The count of training samples are so important because it can cause overfitting and a poor performance on the test set(e.g count=1). In order to improve the performance pruning is used. In CART algorithm simpler trees are preferred due to their
67
4 Machine Learning Algorithms low complexity. One technique for pruning is to use hold-out test set in which it evaluate of omitting a leaf node from the tree. The more complex method of pruning is cost complexity that use alpha parameter to evaluate if removing of leaf node is beneficial in base of sub-tree size [3]. Let’s define an example of classification tree. The database is splitted in base of Gini index where G=1 indicates that data is belongs to same category as previous data points and G=0 indicates that data is belongs to different category of data points.The outcome of classification tree is illustrated as below:
Figure 4.8: Classification tree example
Initially, a Training Set is created where the classification label (i.e., purchaser or nonpurchaser) is known for each data. Then, the algorithm assigns each data to one of two subsets on the some basis (i.e., income > $75, 000orincome Rx ,by
(T π )(x) = r(x, π(x)) + γΣy∈x P (x, π(x), y)v(y), x ∈ X with the help of T π [155]. In base of value, we can make evaluating decision. We seek for actions which yields a state of a highest value not highest reward.As a fact the most important component of almost all reinforcement algorithm is a method for efficiently estimating value[6]. Model: It mimics the behavior of the environment. Model generate the next state and next reward of a system base on given state and reward[6]. As a basic model for reinforcement learning contains 5 factor such as a set of environment, states and agents and also policy transitions from states to actions. Moreover, there are some rules relevant
91
4 Machine Learning Algorithms to predict immediate reward of a transition and other rules which describes about agent observations [7]. At each time t, the agent receives an observation ot , which typically includes the reward rt . It then chooses an action at from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state st+1 and the reward rt+1 associated with the transition (st , at , st+1 ) is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection[7]. The return underlying a behavior is defined as the total discounted sum of the rewards incurred:
t r = Σ∞ t=0 γ rt+1
[155]
Example of Reinforcement learning: A master chess player makes a move. The choice is informed both by planning anticipating possible replies and counter-replies and by immediate, intuitive judgments of the desirability of particular positions and moves.[6] Reinforced Learning Algorithms are as followes:
Policy Iteration algorithm: Policy π can be improved by V π to better policy π and then we 0
can compute V π and improve it again to an even better π 00 .Thus the sequence of policy and value function can be indicated as: π0 →E V π0 →I πI →E V π1 →I π2 →E ... →I π ∗ →E V ∗ where →E indicates a policy evaluation and →I indicates a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one. Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iteration. policy iteration often converges in surprisingly few iteration [156].
92
4.4 Reinforced Learning algorithms
4.4.1 Temporal difference learning
This algorithm is known as Tabular TD(0). It is represented as combination of MonteCarlo concept with dynamic programming. The TD learning algorithm is concerned to the temporal difference model of animal learning [156]. Fundamental concept of TD learning is that it try to learn from samples, Consequently it resembles Monte-Carlo method.Also it try to match adjusted prediction to other future predictions. This procedure is the form of bootstrapping, as illustrated as following: If rt be the reward (return) on time step t. Let v¯t be the correct prediction that is equal to the discounted sum of all future reward. The discounting is done by powers of factor of Υ such that reward at distant time step is less important [156].
v¯t = Σ∞ i=0 γi rt+i where 0 ≤ γ < 1. This formula can be distended
v¯t = rt + Σ∞ i=0 γi rt+i by changing the index of i to start from 0. v¯t = rt + Σ∞ i=0 γi+1 rt+i+1 v¯t = rt + γΣ∞ i=0 γi rt+i+1 v¯t = rt + γ¯ vt+1 Thus, the reward is the difference between the correct prediction and the current prediction: rt = v¯ − γ¯ vt+1 [156] Example of this algorithm is Driving to home .Look through the table 4.4.1 that imply times related to this journey:
93
4 Machine Learning Algorithms
State
Elapsed Time
Predicted Time
Predicted
to go
Total Time
Living Office
0
30
30
Reach car,training
5
35
40
Exit high way
20
15
35
Behind truck
30
10
40
Home Street
40
0
43
Arrive Home
43
3
43
Table 4.22: Driving Home
Here, reward is considered as Elapsed Time at each part of journey. We are not discounting (γ=1), and thus the return for each state is the actual time to go from that state. The value of each state is the expected time to go. Then the TD diagram is as following:
94
4.4 Reinforced Learning algorithms
Figure 4.10: TD Chart
[156]
V (st ) < −V (st ) + α[rt+1 + γV (st+1 ) − V (st )] [156]
4.4.2 Monte-Carlo algorithm Monte Carlo method requires only experience sample sequence of states, actions and reward from actual or simulated interaction with an environment. Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics,yet can still attain optimal behavior. Learning from simulated experience is also very powerful.It also only need sample transitions. It is a way of solving reinforcement learning problem based on averaging sample returns. This method can be only defined
95
4 Machine Learning Algorithms for episodic tasks to ensure that well-defined returns are available. It is only upon the completion of an episode that value estimates and policies are changed.This methods are incremental in an episode-by-episode sense, but not in a step-by-step sense. For instance in Driving to home example the chart of MC is as following:
Figure 4.11: MC Chart
[156] V (st ) < −V (st ) + α[Rt − V (st )] [156]
Unifying Monte-Carlo and TD(0)
This is achieved by T Dγ family of methods (Sutton,1984,1988). Then γ ∈ [0, 1] is a parameter that allows one to interpolate between the Monte Carlo and TD(0). Consider
96
4.4 Reinforced Learning algorithms that γ = 0 denotes TD(0),while γ= 1 which is equal to TD(1) denotes Monte Carlo method. T D(γ) update is given as some mixture of the multi-step return predictions:
s−t Rt:k = Σt+k Rs+1 + γ k+1 vˆt (Xt+k+1 ) s=t γ
[155] Also for large state spaces algorithms are ; TD(γ)with function approximation, Gradient temporal different learning, LSTD:Least-square temporal difference learning, LSPE:Least-squares policy evaluation, Comparing least-squares and TD-like methods.[157]
4.4.3 Sarsa On-policy TD control For the control problem TD prediction methods are used. To this end,pattern of generalized policy iteration(GPI) is often used for evaluating the prediction part. Like Monte Carlo methods approach full into two major class: On-policy and Off-policy. Sarsa is an On-policy TD control method category. To implement this method we need to learn action-value function instead of a state-value function.In particular for an On-policy method, must estimate Qπ (S, a) for the current behavior policy π and for all states S and actions a [156] . In order to learn state-action pairs value,we have Markov Chains with a reward process. The theorem assuming the convergence of state value under TD(0) also applies to the corresponding algorithm for action values:
Q(st , at ) < −Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 ) − Q(st , at )] [156] This update is done after every transition from a non-terminal state st .If st+1 is terminal ,then Q(st+1 , at+1 ) is defined as zero. This rule use events (st , at , rt+1 , st+1 , at+1 ), that make up a transition from one state-action pair to the next. This gives the name Sarsa for the algorithm. The general form of Sarsa control algorithm is as below: • Initialize Q(s, a) arbitrarily • Repeat (for each episode):
97
4 Machine Learning Algorithms • Initialize S • choose a from s using policy derived from Q (e.g. , − greedy) • Repeat(for each step of episode): • Take action a ,observation r,s’ • Choose a’ from s’ using policy derived from Q (e.g, − greedy) • Q(s, a) < −Q(s, a) + α[r + γQ(´ s, a ´) − Q(s, a)] • s < −´ s; a < −´ a; Until s is terminal [156] Example for this algorithm is as following; The world is like a grid in which state-of-goal is indicated as G on the lower right hand corner and start S square in the lower left-hand corner. There is a reward of negative 100 concerned to moving off the cliff and negative 1 when in the top row of the world.
Figure 4.12: Cliff board
[156] Sarsa learns the safe path, along the top row of the grid because it takes the action selection method into account when learning. Because Sarsa learns the safe path, it actually receives the highest average reward per trial.
98
4.4 Reinforced Learning algorithms
4.4.4 Learning Automata Learning Automata is a branch of adaptive control in which were originally described explicitly as finite state automata. learning automata select their current action based on past experiences from the environment.It will fall into the range of reinforced learning if the environment is stochastic and Markov Decision Process(MDP) is used. Automaton should learn to minimize the number of penalty responses, and the feedback loop of automaton and environment is called a "P-model". More generally, a "Q-model" allows an arbitrary finite input set x, and an "S-model" uses the interval [0,1] of real numbers as x,for continues database.Example of application is building a pronunciation models for spoken words [158].
4.4.5 Q-learning It is a reinforcement learning algorithm as a model-free. It works by learning an actionvalue function that eventually offer the expected utility of taking a given action in a given state and following the optimal policy [156]. A policy is defined as a rule regarding action selection by agent, with related state. When such an action-value function is learned, the optimal policy can be produced by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is can contrast the expected utility of the available actions without requiring a model of the environment. In addition, Qlearning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Q-learning finally finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable [156]. There are some component related to Q-learning like State value function : Q-learning state value function is defined as V π (s) State-action value function : Q-learning state-action value function is Qπ (s, a) Moreover,it is Convenient for gaining the optimal policy and Can determine from experience(Monte-Carlo). In order to infer the best action using Qπ (s, a) is denoted. Q-learning :off-policy : It is an off-policy means that,the policy learn about need not to be the same as the one used to select action.In particular, Q-learning learns about the greedy policy while it typically followes a policy involving
99
4 Machine Learning Algorithms exploratory actions-occasional selections of actions that are sub-optimal according to Qt . Because of this fact special care is required when introducing eligibility traces [156]. Suppose we backing up the state-action pair st , at at time t,and on successive two time steps the agent selects the greedy action,but on the third ,at time t+3, the agent selects an exploratory, non-greedy action. In learning about the value of the greedy policy at st ,at we can only use subsequent experience as long as the greedy policy is being followed.Thus ,we can use the 1-step and 2-step returns,but not in this case the 3-step return. The n-step returns for all n>= no longer have any necessary relationship to the greedy policy [156]. Use any policy to estimate Q that maximizes future reward:
Q(st , at ) = maxRt+1
It is considered as a point that, Q directly approximates Q∗ (Bellman optimally equation). Also, it is independent of the policy being followed. The only requirement is to Keep updating each (s,a) pair. Q-learning :The Value iteration function is mentioned as below:
Qt+1 (st , at ) = Qt (st , at ) + α(Rt+1 + Υmaxa Qt (st+1 , a) − Qt (st , at )) As an example look through example of SARSA algorithm 4.4.3 in Figure 4.12. It is mentioned that Q-Learning correctly learns the optimal path along the edge of the cliff, but falls off every now and then due to the -greedy action selection. Sarsa learns the safe path, along the top row of the grid because it consider the action selection method during learning process. Because Sarsa learns the safe path, it actually receives a higher average reward per trial than Q-Learning even though it does not walk the optimal path. In Following graph 4.13 the reward per trial for both Sarsa and Q-Learning is shown:
100
4.4 Reinforced Learning algorithms
Figure 4.13: reward chart
[156]
4.4.6 Deep Q-learning Training Deep learning has ability to automatically learn high-level features from a supervised signal and it is a network with huge number of neurons and layers. The recent progress in reinforcement learning (RI) have successfully combined deep learning with value function approximation by using a deep convolutional neural network to represent the action value (Q) function [157] The Q function is defined as:
Q(s, a; θ)≈Q∗ (s, a) [157] It has two type liner and non-liner. Q Network is a example of non-linear Deep Q-learning. Reinforcement learning has been developed which is practically stable in combination with Q-Network.Like Q-learning , it iteratively solves the Bellman equation by adjusting the parameters of the Q-Network towards the Bellman target. Bellman Equation :
101
4 Machine Learning Algorithms
Q(s, a) = r + Υmaxa0 Q(s0 , a0 ) [157] First, at each time-step t during an agent’s interaction with the environment it stores the experience tuple et = (st , at , rt , st+1 ) into a replay memory Dt = e1 , ..., et . Second,DQN maintains two separate Q-network Q(s, a; θ) and Q(s, a, ; θ− ) with current parameters θ and old parameters θ− respectively. The current parameters θ may be updated many times per time-step and are copied into the old parameters θ after N iterations. At every update iteration i the current parameters θ are update so as to minimize the meansquare Bellman error with respect to old parameters θ− , by optimizing the following loss function: Li = E[r + Υmaxa0 Q(s0 , a0 ; θi− ) − Q(s, a; θi )2 ] [157] Loss function(Squared error :
L = E[(r + Υmaxa0 Q(s0 , a0 ) − Q(s, a))2 ] [157]
102
5 Data Preparation 5.1 Implementing Machine Learning Algorithms In order to apply Machine learning algorithms there is a procedure to prepare, handle, and manage data features and attributes to be fit into algorithm.This process has some steps as following:
5.1.1 Define the problem In order to Implement a Machine Learning algorithm we should first determine one predictive model .To this end must determine model criteria with inductive bias which has two type like restriction bias in which it has a set of model from data set that algorithm will consider during learning and second one is preference bias that learning algorithm choose certain models. As a point inductive bias can cause underfitting and overfitting problems. Underfitting accrues when the model is to simple for indicating relationship between the descriptive features and target values .Also, overfitting is when the prediction model is complex and lead to noise.The model which has balance between underfitting and overfitting is Goldilocks model. Moreover, it is important to recognize different data sources and data types.
5.1.2 Prepare data First of all select the data and consider what type of data is available and what is missing and which data can be removed from database, then we must processing data by
103
5 Data Preparation formatting cleaning and sampling from it and handling missing values.If we need to modify target size we can use methods sampling methods. Other techniques for data preparation are Binning 5.4.2and Normalization 5.4.1. After selection of appropriate analytics solution for problem, data structure must be designed ,evaluated and extended.The ABT is a table that simply is consist of row and columns that indicate descriptive features and target feature. We need to construct ABT from raw data.An ABT shows in each row features and their target [109].
5.1.3 Modeling Models are created during training data procedure and they contain target that is correct answer.In this way training algorithms via training find patterns and in compliance with them match input data to target value. There are three main types of models:binary classification model which predict a binary outcome,for instance we can use logistic regression.The example of this type of problem could be to recognize if the email is spam or not spam. The second type of model is Multiclass classification model in which generate predictions for multiple classes, for example Multiclass Logistic Regression.The example of problem is like to determine whether the product is clothes or shows or paper. The third type is Regression model in which predict a numeric value like Linear Regression.The example of this problem is try to predict the price of a house [3].
5.1.4 Evaluation It is necessary before building prediction model evaluate tasks which lead to build this model to be sure about that model can work accurately and can overcome problems like overfitting and underfitting. To this end three concept must be considered that indicates which model is the best for a specific task and how this model can do the best performance and convince a business whom model is generated that model will meet their needs. In order to measure performance of model Hold-Out test method is offered. Therefore, from original data set one training set is generated and then from other samples which are not contains in training set one test set is constructed and
104
5.1 Implementing Machine Learning Algorithms performance of training evaluates on this test set. This method is used for peeking issue which appears in model which performance of the model is evaluated on the same data uses in train set.Also Missclassification rate is the simplest method to measure the performance of a model which is the fraction of number of incorrect prediction of model divided by total number of predictions of a model. Confusion Matrix is an useful tool to evaluate the performance of model that implies the frequency of each possible outcome of the predictions made by a model for a test set, in order to indicates the details of model performance. For Binary data sets there 4 outcomes like True Positive(TP) that shows instance that has a positive target value and is predicted to have a positive target value , False-Positive(FP) which is an instance that has a negative target value but predicted to have a positive target value , True-Negative(TN) that is sample which has a negative target value but is predicted to have a positive target value , False-Negative(FN) in which is sample with a positive target value but predicted to have a negative target value. Consequently , miss classification rate is defined as: [109] M issclassif ication =
FP + FN FP + TN + FN + TP
and classification accuracy is as following: Classif icationaccuracy =
TP + TN FP + TN + FN + TP
[109] Confusion-Matrix based performance measure is a method for evaluation performance of a predictive model.the most basic measures in this method are true positive rate (TPR), true negative rate (TNR), false negative rate (FNR), and false positive rate (FPR), that modifies the raw numbers from the confusion matrix into percentages. These measures are defined as following: TPR =
TP TP + FN
TNR =
TN TN + FP
FPR =
FP TN + FP
105
5 Data Preparation FNR =
FN TP + FN
[109] and consequently precision and recall are defined as : P recision =
Recall =
TP TP + FP
TP TP + FN
F1=measure which is an alternative to simpler miss-classification rate and is a harmonic mean of precision and recall which is defined as below: F 1measure = 2 ∗
precision ∗ recall precision + recall
[109] Other technique to evaluate the accuracy of a model performance is K-fold cross validation in which K-1 fold is considered for training data and the remaining one is allocated to test the data.Then the result records and continuously in next the next iteration 2th fold chooses as test set and remaining k-1 fold as training set. This process repeats until k evaluation performs and k set of performance measure records and ultimately all of k performance aggregates to gain a overall set of performance measure. Leave-One-Out method is an extension of K-fold cross validation in which number of folds is equal to number of instances in data set. Bootstrapping is another measure of accuracy of a model in which it chooses m data from data set as a testing set and remaining data as training set and this process repeats for k iteration.This method is suitable for small data sets. Out-Of-Time sampling is a method for large data sets in which samples for test set are selected by random from data set in one period of time and for training set, samples are selected from other period of time [109].
5.1.5 Deployment Evaluation of a model is base on patterns learned in training set will be correspond to unseen sample of data set to the model in future.So an on-going model validation method is necessary to find the data which wants to be permanent.In order to monitor on-going performance of a model, a signal that implies the changes is needed which
106
5.1 Implementing Machine Learning Algorithms can be extracted from performance of a model, distribution of the outputs of the model and distribution of descriptive features in query sample in model.For indicating changes in performance measures one way is to measure performance of a model before deployment and then measure performance of model after deployment and compare them with each other. Moreover , for reflecting model output distribution changes implies if the model is going to be permanent.To this aim distribution of model outputs on test set in evaluating the original test sets measures and then repeats this measure on new set of samples selected from model deployed [109].
Assessing Feasibility First thing that is important is key object and the second one is connection between key objects in data model and then distribution of data, time and volume that data is presented. Also data analytic must has ability of implementing solution which is provided. After that we start to design Analytic Base Table(ABT). In order to design and implement features 3 key are considered. These keys as previously mentioned are Data availability and Timing for which data become available and third key is Life duration of features [109]. Taken into consideration that there are 6 type of data as shown in following table: Numerical
True numeric values for arithmetic operations
Interval
Values that represented for ordering and subtraction
Ordinal
Values that let ordering but do not permit arithmetic
Categorical
A finite set of values that cannot be ordered and allow no arithmetic
Binary
A set of just two values
Textual
Free-form, usually short, text data Table 5.1: General Data Types
Then we categorize this types of data into 2 group like continuous and categorical data which data types numeric and interval are in continuous group and other types like ordinal ,binary categorical and textual are in categorical groups. IN ABT there are 2 kind of features like raw features that are obtained from raw data and derived features which are gained from descriptive features and are not in raw data set. Therefor, they must
107
5 Data Preparation be constructed from data in one or more raw data sources.There are various types of derived features such as: Aggregates that are defined in a group of sum, count, average, minimum and maximum, Flags which are binary features to demonstrate whether data exist in database, Ratios are considered as continuous features that keep the relationship between two or more raw data values.For example, in a mobile phone scenario, we might include three ratio features to indicate the mix between voice, data, and SMS services that a customer uses, Mappings are apply for changing continuous features into categorical features and are utilized to decrease the number of unique values that a model will have to deal with. In addition, it should be mentioned that there is not any limitation to methods in which we can mix data to derive features, Handling time in some likelihood models there is time element which must be take into account.There are two key period like Observation period that features are observed and Outcome period correspond to calculating target. Moreover, the observation period and outcome period can be measured in same time for all prediction subjects in some problems.Also they can be determined for different objects in different dates. There is other case which descriptive feature id depended to time but target is not depend on time. Other point is that there is a case when practitioner in base of rules do not use some features. The second important point is about use of personal data.For the design of ABT, three are 3 issue like the collection limitation principle, the purpose specification principle, and the use limitation principle [109].
Implementing Features After designing ABT were completed then we should implement techniques in order to extract, create and mix the features into ABT. As data manipulation keys can nominate ;joining data sources, filtering rows in a data source, filtering fields in a data source, deriving new features by combining or transforming features, and merging data sources. Data manipulation are performed by database management systems, data management tools that help to locate, share and reuse data like NWIS MAPPER, or data manipulation tools which try to organize data to read and use easily like programming softwares, and are called as an extract-transform-load (ETL) process [109].
108
5.1 Implementing Machine Learning Algorithms Data Exploration Data exploration is also other concept related to data preparation and data understanding. It has two main goal,firstly it is important to recognize characteristics of data such as type, range and their distribution and secondly to know if data is in counter with any data quality issue. Some examples of typical data quality issues is missing values for one or more descriptive features, an instance that has infinitely high value for a feature, or a sample that has an inappropriate level for a feature. Some data quality issues arise due to invalid data and will be revised as soon as possible. Others, is like valid data lead to difficulties in some learning algorithms [109].
Data Quality Report Data quality report is the most important tools regarding data exploration which involves reports table one for numerical data and one for categorical data. They describe characteristics of each feature and use statistical measure such as mean, mode, and median and variation like standard deviations and percentiles.Also standard data visualization plots such as bar plots,histograms and box plots. The table corresponding to data quality report which described continuous data must contain 1st quartile ,3th quartile, mean, minimum, median ,maximum and standard deviation statics for that feature and number of instances in the ABT, the percentage of instances in the ABT that are missing a value for each feature and also Cardinality of each feature in which measures the number of various values present in the ABT for a feature [109]. The table of data quality report that describes categorical data should include a row which contain the two most frequent levels mod and 2nd mod for the feature and the frequency of them both as raw frequencies and a proportion of the total number of instances in the dataset.Each row should also include the percentage of instances in the ABT that are missing a value for the feature and the cardinality of the feature. In an ABT for continuous data there is a histogram and for continuous feature and in case they have cardinality less than 10 there is a bar plot. Regarding to categorical data there is bar plot too. In order to get to know data we should study data quality report and examine Central tendency,v ariation and types of value for each records [109]. From histogram plots we can understand the
109
5 Data Preparation possible probability distributions of data which could be uniform distribution and show that a feature is equally likely to take a value in any of the ranges present or can be normal distribution that are known as Gaussian distribution that indicates features have strong tendency toward central value and symmetrical variation. Also this histogram which indicate normal distribution can follow uni-modal distribution because they have single peak around central tendency. Moreover, Multi-modal distribution of a feature can express that it has 2 or more very commonly ranges of values which are obviously separated. In terms of bi-modal distribution should take into account that, it is two normal distribution that are together. Standard probability distributions is associated with probability density functions in order to express the characteristics of the distribution. Th second goal of data exploration which most common of them are missing value, irregular cardinality problem and outliers is determining data quality issues and are divided into 2 group; Data quality issues due to invalid data which accrue in process of generating ABT and is correspond to calculating derived features. In order to remove this error we should reconstruct the ABT and recreate data quality report and Data quality issues due to valid data that is grow usually due to domain specific reasons and it is not necessary to do any corrective task to solve this challenges [109].
Missing values Missing values are features that generates through building ABT. The percentage of this values are indicated in columns of ABT table. They are caused by data integration or when we try to give data from a field or by some rules. In a case that missing value has high percentage in a feature(Rule of thumb is anything more than 60 %), and there is a little information related to that feature, it is better to omit that feature from ABT [109].
Irregular cardinality In data report quality CARD column shows the number of separate values which is related to a feature in ABT.Irregular cardinality appear when cardinality of a feature is not as our expectation. First of all, we must check cardinality columns of a feature
110
5.1 Implementing Machine Learning Algorithms which has cardinality of 1 and it implies that this feature has same value for every sample and does not involve any useful information for generating predictive model. Continuously,we must be ensure about the cause of cardinality of 1 and if it was due to ABT generation error then error should be revised.Otherwise, the feature with cardinality of 1 should be removed from ABT, because they will not be of any value in building predictive models.Second thing is to check categorical feature which are wrongly labeled as continuous feature in cardinality column. The cardinality value related to continuous feature are near to number of samples in dataset and if this number is significantly less than number of samples in dataset, then it should be examined whether this feature is correctly continuous or it is categorical.It should be consider that features with cardinality more than 1 are in categorical feature group. Thirdly,Irregular cardinality can be found when a categorical feature has a higher cardinality than we expect.In this case ABT must be regenerated. Finally, when number of categorical features are so high irregular features appears, anything more than 50 should be considered. Because of the fact that some of Machine Learning algorithms are in conflict with this feature, It should be noted in data quality plan [109].
Outliers are values which are related to data and they have two type such as; Invalid outlier which is known as a noise in data and it is made through an error , Valid outliers which are values that are different from other data in data set. Totally, there are 2 ways of recognizing outliers in data quality reports; First one is determine the minimum and maximum value of each feature and try to understand whether these are useful or not. Second one is, to compare median, minimum ,maximum,first quartile and third quartile values to understand the difference between them and if the difference between the third quartile and the maximum value is significantly more than the difference between the median and the third quartile, it implies that the maximum value is unusual and it is an outlier. Continuously, if difference between the first quartile and the minimum value is too much than the difference between the median and the first quartile, we can understand that minimum is an outlier. The outliers shown in box plots is used to make this comparison.In order to express outliers exponential or skewed distributions are applied.
111
5 Data Preparation
5.2 Handling data quality issues The most common issues are missing values and outliers which are due to noise in data.Because of various level of noise in different predictive model, we should do noise handling. It is good idea to choose techniques for handling these issues during data exploration [109]. In order to handle noise in classifier of data set, noise is categorized into two group; Label noise that is corresponded to class noise and occurs when an example is wrongly labeled, this appears in some cases such as data entry errors or insufficiency of the information used to label data or subjectivity via labeling process. Also, two types of Label error are contradictory example like duplicate data have different class label and misclassification like examples which are labeled as a class are different from real world data.The second category of noise is attribute noise which appears because of erroneous attribute value missing and unknown attribute values and incomplete attributes. Then for handling these noise filtering methods are used. One way is adaptation of learning algorithm in which robust learners are characterized by having less influence with noisy data. Other solution is to remove noisy data samples by identifying mislabeled training data. Edited nearest neighbor is another technique to handle noise in which remove data if it does not compatible with majority of data. AllKNN is a method to remove noise that remove all NN of a miss-classified data. Relative Neighborhood Graph Edition also uses to handle noise in which it constructs an undirected graph of data set which is called graph neighbor. The graph neighbors of data points generates its graph neighborhood and then instances miss-classified by graph neighborhood removes from data samples [159].
5.2.1 Handling Missing Values There are some approaches to handle missing values.One of them is to omit features which contain missing values from ABT.In addition,it is possible to give missing indicator features from them. Missing values can be signed by Binary features. When missing indicator features are used, the original feature is usually deleted.Another approach is complete case analysis ,that delete any instances with missing values from
112
5.2 Handling data quality issues ABT.Moreover,imputation can change missing value with some new values which are determined in base of feature values , also it can change missing values with some measures.For continues feature the most common measures are mean or median and regarding categorical feature it is mode.However,imputation is not suitable for cases that involved large number of missing values because it can change the central tendency of a feature too much.There are other complex approaches related to imputation, but it is recommended first to use simple one and then if you need you complex one.It should be taken into consideration that all imputation techniques are in counter with the problem that they changes the data in ABT and leads to change related feature and misleading to understand the relationship between a descriptive and target feature.Other techniques for handling missing values are such as Inference-base like Bayes formula or Decision trees,Identify relationships among values with Linear regression,Multiple linear regression,Non-linear regression,Nearest neighbor estimator which find the k nearest neighbor to the point and fill in the average value or the most frequent value [109].
Handling Outliers Clamp transformation is the easiest way to handle outliers which maintain all values above an upper entry an below a lower entry of that entry values , thus omitting outliers. In order to calculate upper and lower entry information of data set are used .One usual way to compute entry is to set the lower entry to the first quartile value minus 1.5 times the inter-quartile range and the upper entry to the third quartile plus 1.5 times the inter-quartile range. This works effectively and also the variation in a dataset can be different in both side of center.Another common way to calculate upper and lower entry is that we consider the mean value of the feature and then subtract or add it 2 times with the standard deviation.Moreover,it is assume that the underlying data followes a normal distribution.It is recommended that when there is probability to have outliers in our model we should use clamp transformation. To examine the relationship between pairs of features can use : Visualizing relationship between feature that demonstrate descriptive features which are useful for predicting a targets and specifying pairs of descriptive features that closely related and Visualizing pairs of continuous feature which
113
5 Data Preparation is useful to indicate the relationship between pairs of features because it leads to reduce the size of ABT. Scatter plot is based on most important feature which horizontal axis represent a features and vertical axis implies a second one.In this plot every point shows the value which is corresponded to feature that are plotted.Moreover,a scatter plot matrix imply scatter plot for entire set of features and is used to explore the relationship between groups of features.Also,it is a quick way to see the relationship in a total set of continuous features.When total number of features are more than 8 scatter plot is not useful alternatively we can use interactive tools [109].
Data Pre-processing Data pre-processing is an important subject in Machine Learning and Data mining field.This method is used for handling data combinations and missing values,etc.Analyzing data that has not been carefully screened for such problems can produce misleading results.Hence,before analyzing data we need to find out the quality of data. Data pre-processing includes Cleaning,Instance selection , Normalization , transformation , feature extraction and selection. The product of data pre-processing is the final training set.Kotsiantis et al(2006) present a well-known algorithm for each step of data pre-processing [160].
Handling imbalanced data sets High imbalance appears in real world data set when we want to detect an important and rare data.This problem is considered in information retrieval,filtering task,text classification and learning word pronunciations.Moreover,in order to address this problem for data and algorithms some solutions are introduced.For data re-sampling techniques are like random oversampling with replacement,random under-sampling,directed oversampling in which no new examples are created ,but the choice of samples to change is shown and in random,directed under sampling that examples for deleting are awarded,oversampling with new samples and mixture of these methods. Regarding to algorithms,solutions are modify the costs of the different classes ,modify the prob-
114
5.3 Visualizing pair of categorical feature abilistic estimate at the three leaf for decision tree,modify the decision entry.Also,for solving problem of class imbalance Mixture-of-experts are utilized.These methods mixed classification results.Additionally,Random under-sampling is a non-heuristic method in which try to modify class distribution in eliminating of main class examples .Random Over-Sampling is a non-heuristic method in which its aim is to adjust class distribution via the random replication of minimum number of class examples [161].
5.3 Visualizing pair of categorical feature Using collection of bar plots which is nominated as small multiples visualization is simplest way to visualize the relationship between categorical features.First implement a bar plot which shows densities of the different levels of the first features.Then, for each level of the second feature, we draw a bar plot of the first feature using only the samples in the database for which the second feature has that level. If the two features being visualized have a strong relationship, then the bar plots for each level of the second feature will look considerably different to one another and to the overall bar plot for the first feature. If there is no relationship, then we should expect that the levels of the first feature will be evenly distributed among the instances having the different levels of the second feature, so all bar plots will look much the same. Stacked bar plot is an alternative to small multiples bar plots when the number of levels of one of the features being compered is small [109].
5.3.1 Visualizing categorical and continuous features To visualizing the relationship between categorical and continuous features best way is to illustrate the histogram of continuous values for each of categorical features. In case that features are not relevant to each other histogram of each level are very similar,but if the features are related, then the shape of histograms will be different. The alternative way is box plots of values of continuous features in which are illustrated for each categorical features level. This offer a set of plot box in which imply that how continuous feature are changed for each categorical feature.When there is not any relationship between data
115
5 Data Preparation plot boxes are same.Using histogram can show more detail about relationship between categorical and continuous features , but box plot is more simple to indicate difference between Central tendency and variation [109].
5.4 Measuring Co-variance and Correlation For finding relationship between two continuous features we can use co-variance and correlation as well.For instance, sample co-variance between features a and b can be used.In most ABT there are multiple continuous features that we want to find relationship between them and two useful tools for this aim is co-variance matrix and correlation matrix [109]. A co-variance matrix contains one row and one column for each features and every element of matrix imply the co-variance between the corresponding pair of features. As a result, the elements along the main diagonal indicate the co-variance between a feature and itself, or the variance of the feature. Similarly, correlation matrix is normalize form of co-variance matrix corresponding to relationship between each pair of features. For visualizing correlation matrix scatter plot matrix is used as well.Correlation could be linear or non-linear. Linear relationship is between two feature when one of them increases or decreases and the other feature increases or decreases by a corresponding amount. Non-linear relationship is when correlation is not respond to. The most prominent point about correlation is that it does not necessary demonstrate causation. Ignoring the third important, but hidden feature is a mistake that lead to incorrectly infer causation between two features. In procedure of data preparation instead of handling noise we can use different techniques.These techniques are called as sampling techniques like Normalization , Binning and Sampling [109].
5.4.1 Normalization Continuous features in ABT which can lead difficulty for some Machine Learning algorithms in different range. Normalization technique can change values in a special range while it keep the relative differences between the values. The simplest way of dealing
116
5.4 Measuring Co-variance and Correlation with normalization is Range normalization that try to balance original values of the continuous feature into a specified range by linear scaling. We can use the formulation: a0i = (ai − min(a)/max(a) − min(a)) ∗ (high − low) + low [109] where a0i is normalized feature value and ai is original feature value. Standardize data into standard score is another way to normalize data. It measures the amount of standard deviations of feature value from the mean for that feature. To calculate it is essential to compute mean and standard deviation for the feature and normalize features value with following equation: a0i = ai − a ˜/sd(a) where˜ ai is normalized feature value,ai is original value ,˜ a is mean value and sd(a) is standard deviation of a [109].
5.4.2 Binning It is used for Changing a continuous feature into a categorical feature. To apply Binning must define some ranges which are called bins and then transform continuous data into this bins and then replace this data in data set with this bins. Two popular methods of binning are equal-width binning that all bins have equal size and equal-frequency binning. For both binning we determine how many bins must be used.Deciding about number of bins can be difficult. The general way is that,allocating small number of bins lead to high amount of information loose but, if we specify more number of bins then can keep more important features of our original continuous feature. In terms of Equal-width binning can merge bins in area with few number of instances. then result are useful for place with too many instances or clusters. In Equal-frequency binning we sort continuous data in ascending order and then place them into bins and then this features remove and instead categorical features will be placed . It should be mentioned that histogram is used to imply bins [109].
117
5 Data Preparation
5.4.3 Sampling In a case that we do not want use any features in ABT and data set is so large instead we can create a sample data set from original one this method is called sampling.This method has two type top sampling in which we choose some of data from top of data set but it is in risk of bias because the distribution of data in sample set may be vary from distribution of data in original set. so, it is not suitable technique. Random sampling is a better method in which we choose sample randomly to create new sample of data set .Therefore,data distribution in sample set is as same as data distribution in original set [109]. A sampling method which use to ensure that the related frequencies of the level of a specific stratification feature are preserved in the sampled data set,are called as Stratified sampling. The procedure of its generation is as following; First of all, data set split in to two part and then we choose some sample and merge them to give a set of this samples for each level. In each level number of samples could be different. Another method which should be taken in to account is Over-sampling or under-sampling. It is useful when we want to create a sample in which the levels of the particular categorical features are equal, without consideration of data distribution. Over sampling can address some data issue like under sampling, but contrarily .In this way,after splitting data set into various groups, the number of instances in largest group becomes the target size.Then we must choose from each group sample. Also in order to generate a sample which has greater size than the size of group that we are sampling from, we use random sampling with substitution. consequently, each instance may be appear more than once in the sampled data set. Therefore,sampling is useful for dimensional reduction of ABT and simpler exploratory analysis ,to modify distributions of target features in an ABT, also to create different portions of an ABT to utilize for training and evaluating [109]. As an example for data preparation we can look through two various data sets which are available in chapter 7.
118
6 Software Tools for Machine Learning 6.1 Machine Learning Implementation Softwares For implementing our model there are some choices such as application-based solution or programming languages [109]. The platforms which are useful in terms of applicationbased solution are like: RapidMiner studio, WeKa, Azure studio, SAS Enterprise Miner, KNIME analytic platform, IBM SPSS , R, OpenNN, Mahout, Mallet, Orange , MXnet, ELKI, MLPACK SMILE... For predictive data analytic there are 2 most common languages which are R language and Python. The advantage of programming language for designing predictive model is that, it is so flexible and also can implement everything that programmer requires.The disadvantages corresponding to programming are that it needs time and effort to learn and also, we have very little of infrastructural support like data management[109].
6.1.1 Azure studio software Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool which can use it to build, test, and deploy predictive analytics solutions on data [162]. Machine Learning Studio publishes models as web services that can easily be integrated with applications or tools such as Excel [162]. Azure Machine Learning Studio is where data science, predictive analytics, cloud resources, and data are working together. It gives interactive,visual work space to easily build,test and iterate on a predictive analysis model. With drag and drop data sets and analysis modules onto an interactive canvas, connecting them together to form an experiment, that it can be run machine
119
6 Software Tools for Machine Learning learning algorithms.You can edit the experiment, save a copy and run in Machine Learning Studio. Also, you have ability to publish your model on web to be assessed by others.Programming is not required, just visually connecting data sets and modules to generate a predictive analysis model [162]. Additionally, overview diagram offer you a high-level overview of how you can use Machine Learning Studio to develop a predictive analytics model and use it in the Azure cloud. It has a large varieties of Machine Learning algorithms, along with modules that help with data input, output,preparation and visualization. An experiment involve data sets that provide data to analytical modules, which you connect together to generate a predictive analysis model. Specifically, a valid experiment has these characteristics: The experiment has at least one data one module [162], Data sets may be connected only to modules, Modules may be connected to either data sets or other modules, All input ports for modules must have some connection to the data flow [162]. All required parameters for each module must be set.Furthermore,there are some important explanation about this software as following:
Data Sets : A data set is data that has been uploaded to Machine Learning Studio so that it can be used in the modeling process. Furthermore,some data sets are included with Machine Learning Studio for you to experiment with, and you can upload more data sets as you need them [162].
Modules : A module is an algorithm to perform on data. Machine Learning Studio has a number of modules ranging from data ingress functions to training, scoring,and validation processes [162]. Some examples of this modules are: Convert to ARFF - Converts a .NET serialized data set to Attribute-Relation File Format (ARFF), Compute Elementary Statistics - Calculates elementary statistics such as mean, standard deviation,etc, Linear Regression - Creates an online gradient descent-based linear regression model, Score Model -Scores a trained classification or regression model. A module may have a set of parameters that can be used to configure the module’s internal algorithms.When selecting a module on the canvas, the module’s parameters are displayed in the Properties pane to the right of the canvas. Then can modify the parameters in that pane to tune the model [162].
120
6.1 Machine Learning Implementation Softwares Deploying the predictive analytics web service is another ability of Azure Studio [162]. Moreover , the Machine Learning API service enables user to deploy predictive models, like those that are built into Machine Learning Studio,as scalable, fault-tolerant, web services.The web services that the Machine Learning API service creates are REST APIs that provide an inter face for communication between external applications and the predictive analytics models [162]. Azure Machine Learning has two types of web services: Firs one is Request-Response Service(RRS): A low latency, highly scalable service that provides an interface to the stateless models created and deployed by using Machine Learning Studio [162]. The second one is Batch Execution Service(BES): An asynchronous service that scores a batch for data records. There are several ways to consume the REST API and access the web service.For example,user can write an application in C#, R, or Python by using the sample code that’s generated for you when you deployed the web.Also Microsoft Excel work Book is is available in the web service dashboard in Machine Learning Studio. It is noted that Microsoft Azure support .Net, Python, R, SQL, Node.js, PHP Languages [162]. The following types of data can expand to larger data sets during feature normalization and are limited to less than 10 GB; Sparse, Categorical, Strings, Binary data. In terms of file types that are supported ,.pdf , .vsdx , .vsdm ,.vssm ,.vsd ,.vdw ,.vst ,.mpp,.mpt ,.pub ,.xls ,.xlt ,.doc ,.dot ,.ppt ,.pps ,.pot ,.xps ,.oxps ,.jpg ,.jpe ,.jif ,.jfif ,.jfi ,.png ,.tif ,.tiff .sldprt ,.slddrw ,.sldasm ,.dwfx ,.psd ,.dng is mentioned. Moreover,Text and image files,Microsoft office word and Excel ,PowerPoint files are protected file formats. Furthermore,the some modules are limited to data sets less than 10 GB such as Recommender modules, Synthetic Minority Oversampling Technique(SMOTE) module, Scripting modules: R, Python,SQL, Modules where the output data size can be larger than input data size, such as Join or Feature Hashing, Cross-validation,Tune Model Hyper parameters, Ordinal Regression,and One-vs-All Multiclass, when the number of iterations is very large.[162]
6.1.2 R Language R is a software programming language and a software environment for statistical computing and graphics.It is an open source software which support R programming language.
121
6 Software Tools for Machine Learning Also, it has GNU General Public License [163]. It can be run on platforms such as Windows, Mac, Unix/Linux, Min.JRE, Multi-Cores, Client-Server, Distributed Computing. Most of the R functionality is provided through built-in and user-created functions, and all data objects are kept in memory during an interactive session. R can be extended by creating R packages, writing R documentation files, tidying and profiling R code, interface functions. C and .FORTRAN, and adding new generics. Furthermore, in comparison with RapidMiner R has second place in popularity. R supports more input,output formats and visualization types. In addition, R is open source and R scripts can be use in could computing softwares such as Azure Studio ,RapidMiner studio [163].
6.1.3 RapidMiner studio It is another software that has option of cloud computing in order to apply ML algorithms on very huge database. It also has possibility to show different charts about output. In comparison to Azure studio RapidMiner has better graphic and ability for making different charts and bars and reports [78]. Data Management :RapidMiner support different data types that is indicated in the following table: Attributes Binomial Date Date-time File-path Integer Nomial Numeric Polynomial Real Text Time
Parent of all possible types Exactly two values(for example true/false or yes/no) Date without time(for example 23.12.2016) Both date and time(for example 23.12.2016) Nominal date type(rarely used) A whole number(for example 23,-5, or 11,024,768) All kinds of text values;includes polynomial and binomial All kinds of number values;includes date,time,integer and real numbers Many different string values(for example red,green,blue,yellow) A fractional number(for example 11.23 or -0.0001) Nominal data type that allows for more granular distinction Time without date(for example 17:59) Table 6.1: Data Types
[78] Data format can be categorical,numerical and time-series.
122
6.1 Machine Learning Implementation Softwares Visualization :It provide visualization of arbitrary models with dimensionality reduction via self-organizing map(SOM) of data set and the given model. A self-organizing map(SOM) or self-organizing feature map(SOFM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically twodimensional),discretized representation of the input space of the training samples,called a map.Because SOM provides a neighborhood function to preserve the topological properties of the input space,it is suitable for visualizing low-dimensional views of high-dimensional data,akin to multidimensional scaling.It provides various plots as visualization scheme such as Bar chart,Line plot,Bubble plot,Deviation plot,Density plot,Survey plot,Pie chart,Histogram,Box,scatter plot.[78] Supporting Algorithms :It supports different kinds of regression, classification and clustering algorithms extension: Text processing, web-Mining, Weka Extension, Text Analysis by AYLIEN, Senies Extension, Recommender Extension , RapidMiner Radoop, Python Scripting.[78] Supporting Programming Languages and Extensions :It supports R and Python scripting,Process control functions,optimization loop and branches. It has ability to data preparation and data cleaning with some approaches like Normalization and removal.[78] Deployment :It has ability of cloud computing and can connect to server. Also it is a desktop tool.[78] Costs :It is open source and one time free for cloud computing and user need subscription for this aim.[78] It is high speed and precise tool.[78]
6.1.4 KNIME Konstanz Information Miner also is suitable for ML algorithm implementation, but it has some limitation like it requires more time and memory to run huge data sets due to lack of cloud computing. Also it is not so graphical as same as RapidMiner studio or Azure studio. It has extension which called streaming with some pros and cons as following:
123
6 Software Tools for Machine Learning Pros :Less I/O overhead (process, pass & forget)- Parallelization. Cons : It has no intermediate and interactive execution-Not all nodes can be streamed. Data Management : It supports different data type such as Int, Boolean, String, Nominal, Double, Icon. Moreover, it customized types of data e.g DataCell that carries a representation of a molecule, numerical, categorical, Time-Series ,TimeAndData. It can support file formats like Csv, Excel , PMML , XML , R , PDF , Word , ARFF, XLS , MySQL , JSON , SDF. Visualization :KNIME has ability to demonstrate data in different scatter plots, Pie charts, box plots, Histograms, tag clouds, visualization of networks [164]. Supported Algorithms :It can implement different algorithms of Machine learning. Data Mining and Deep Learning such as varieties of regression,classification and Clustering algorithms [164]. Supported programming languages and extensions : It support Big Data extension, it can connect to Microsoft Access database. Also it support Hive. Additionally, KNIME contains Deep Learning Integration in KNIME labs. It can access file in HDFS. It has Multiple Apache Spark versions supported. In addition, programming languages that KNIME can support are java script , R languag, Python, Snippets. Deployment : It is a desktop tool and it involve server for handling projects and it has ability of Cloud computing. Costs: It is open source and in order to get subscription as a partner we must pay for first time e4.900 and then every year e1.600 per year. Performance: It has high performance and it is high speed.[164]
6.1.5 File Formats comparison Furthermore,in terms of file formats that each software support a comparison table is represented as below:
124
6.1 Machine Learning Implementation Softwares
Format InPut/OutPut Text file(ASCII ,dat) Binary Files Excel spreadsheet and ODS (.CSV,.delim,.DIF)
R In/Out yes yes
RapidMiner In/Out yes no
Azure In/Out no no
KNIME In/Out yes yes
yes
.CSV
yes
no/yes
yes
yes/no
no no no no no no no no no no no yes yes yes yes no yes
no no no no no no no no yes JDBC no yes no yes/no no yes yes
Network Connection(Socket)
yes
SPSS SAS(.ssd or .sas7bdat) stata EpiInfo(.REC) Minitab S-PLUS Systat(.sys,.syd) Octave DBMS ODBS(.dbf,.xls) DBF Xml SAP pdf,html Audio WEKA Images
yes yes yes yes yes yes yes yes yes yes yes yes no yes no yes yes
web pages RSS feeds web service/ web based report no JDB/yes no no no no no no no JDBC no yes yes yes yes yes yes
Table 6.2: File Format
125
7 Examples for Machine Learning Application 7.1 Implementing examples In this chapter some of Machine Learning algorithms are applied on two data sets Forest Fires and Iris [165]. Also the result of them are compared with help of RapidMiner studio and R language.
7.2 Forest Fires Data Set First of all we select RapidMiner Studio as a tool to apply Support Vector Machine 4.2.7 algorithm with Radial kernel function on Forest Fires data set [166]. The aim of this Regression task is to predict burned area of forest fires, in the northeast region of Portugal by using meteorological and other data [166].
7.2.1 Data Set Explanation It is denoted that in [Cortez and Moris ,2007] ,the output ’area’ was changed with a ln(x + 1) function. Also,for fitting the models several Data Mining methods were applied.Then output post-process with inverse of the ln(x + 1) transform.In addition,four different input setups were applied. In this experiment a 10-fold (Cross-validation)x30 runs. Moreover ,Mean Absolute Deviation(MAD) and Root Mean Squared Error(RMSE)
127
7 Examples for Machine Learning Application Regression metrics are used .A Gaussian Support Vector Machine (SVM) fed with only four direct weather conditions(temp , PH, Wind and rain) obtained the best MAD value: 12.71 + −0.01 (mean and confidence interval within 95 % using a t-student distribution). The best RMSE was attained by the naive mean predictor. Regression Error Curve(REC) shows that the SVM model predict more examples without a lower admitted error. In effect, the SVM model predicts better small fires,which are the majority. Forest fires is one of the main issue of environment which can lead to economical and ecological damages. To deal with this problem fast detection is the key element. To gain this one alternative is to use automatic tools based on local sensors like meteorological stations. This meteorological conditions such as wind and temperature affects forest fires and several fire indexes like Forest Fires weather index(FWI). In this paper several data mining methods are examine to predict areas that ra in risk of Forest Fires such as Decision Tree(DT) 4.2.8, Random Forest(RF) 4.2.8, Neural Network(NN) 4.2.3, SVM 4.2.7 and also four distinct features selection set up such as using spatial, temporal, FWI components and weather attributes. SVM is the best method which use four methodological inputs like temperature, wind, rain and humidity it predicts the area of small fires. SVM indicates 75% accuracy at finding smoke at 1-1 km pixel level. Linear regression, RF, DT detects fire in the Slovenian forest,using both Satellite based and meteorological data in this way the best model was obtained by a bagging DT, with an overall 80% accuracy. Also, Neural Network(NN) and Infrarred Scanners were combined to reduce forest fire false alarm with a 90% success [166].
Forest Fires Data Description
The Forest Fire data index(FWI) is the Canadian system for rating fire danger and it consist of 6 component as is shown in Figure 7.1.
128
7.2 Forest Fires Data Set
Figure 7.1: Forest Fire Data Components
The first 3 component are concerned to the fuel codes. FFC implies the moisture content surface litter and influences ignition and fire spread, while DMC and DC represent the moister content of shallow and deep organic layers,which affect fire intensity. The ISI correlate with fire velocity spread, while suggest more sever burning conditions. Moreover, 16 horus for FFMC, 12 days for DMC and 52 days for DC. This forest fires data is corresponded to Montesinho natural Park, from the Tras-os-Montes northest region of Portegual. Instead within a super-Mediterrean climate, the average annual temperature is within the range 8 to 12 ◦ c. Firstly data set of Forest Fire is collected from daily occurrences of fires in Montesinho and every day several features were registered such as time, date, spatial location within a 9 ∗ 9 grid. The second database contains several weather observations that were recorded with a 30 minute period by a meteorological station location in the center sheets, under distinct forests and an effort was performed to integrate them into a single data set with a total of 517 entries.
The URL of this data set is :http://www.dsi.uminho.pt/~pcortez/forestfires/. [166]
129
7 Examples for Machine Learning Application Attributes Description
In terms of Forest Fires data set attributes description can refer to [Cortez and Moris,2007].The explanation of attributes is Indicated as below table:
Attributes X Y Month Day FFMC DMC DC ISI Temp RH Wind Rain Area
Description x-axis spatial coordinate within the Montesinho park map: 1 to 9 y-axis spatial coordinate within the Montesinho park map: 2 to 9 month of the year: ’jan’ to ’dec’ day of the week: ’mon’ to ’sun’ FFMC index from the FWI system: 18.7 to 96.20 DMC index from the FWI system: 1.1 to 291.3 DC index from the FWI system: 7.9 to 860.6 ISI index from the FWI system: 0.0 to 56.10 temperature in Celsius degrees: 2.2 to 33.30 relative humidity in %: 15.0 to 100 wind speed in km/h: 0.40 to 9.40 outside rain in mm/m2 : 0.0 to 6.4 the burned area of the forest (in ha): 0.00 to 1090.84 Table 7.1: Attributes Description
(this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).[166]
In this work algorithms such as Decision Tree(Bagging) 4.2.8 , Random Forest4.2.8, Decision Tree(CARD) Pruning 4, Neural Network4.3.4 and SVM(Radial Base Function) 4.2.7 are implemented in order to compare them with each other. The R codes corresponding to this models is in source chapter A. Moreover, SVM (Radial Base Function) are implemented by RapidMiner 6 in order to illustrate how this tool can perform and analyze tasks. Let’s look through some of data of Forest Fires data set in Table7.2.1:
130
7.2 Forest Fires Data Set X
Y
MONTH
DAY
FFMC
DMC
DC
ISI
Temp
RH
7
5
mar
fri
86.200
26.200
94.300
5.100
8.600
51
7
4
oct
tus
90.600
35.400
669.100
6.700
200
33
7
4
oct
sat
90.600
43.700
686.900
6.700
14.600
33
8
6
mar
fri
91.700
33.300
77.500
9
8.300
97
8
6
mar
sun
89.300
51.300
102.200
9.600
11.400
99
Wind
rain
area
6.700
0
0
0.900
0
0
1.300
0
0
4
0.200
0
1.800
0
0 Table 7.2: Forest Fires Data Set
7.2.2 SVM Model
In this project SVM 4 with Radial kernel4.2.7 is implemented in which attribute area is set as role component of SVM model. Then cross validation method is used to evaluate the model and it executes in base of area and the performance of validation set is evaluated by rain attribute. this model uses meteorological data features. Firstly, it is necessary look through statics of data set in order to find out issues such as missing values and normalize data. Statics of Forest Fires data set is indicated in Figure 7.2 and Figure 7.3
131
7 Examples for Machine Learning Application
Figure 7.2: Forest Fires Statics
Figure 7.3: Forest Fires Statics
which indicates that there is no missing values in data set. So it is ready to use. The implementation of SVM model of Forest Fires data set is shown in following Figure 7.4 which indicates SVM model Performance.
132
7.2 Forest Fires Data Set
Figure 7.4: SVM model
[167]
Parameters of Model Now let’s define the parameters of SVM operator: • Kernel type :Imply type of the Kernel function that use in current model.The following types are supported: • dot kernel :Which is defined by k(x, y) = x ∗ y • radial kernel :It is defined by exp(−g||x − y||2 ) where g is the gamma,that plays a major role in performance of kernel. • Polynomial :This kernel is defined by k(x, y) = (x ∗ y + 1)d where d is the degree of polynomial,it is suitable when all the training data is normalized. • Neural :It is defined by 2 layer Neural Net tanh(ax ∗ y + b) where a is alpha and b is the intercept constant.The common value for alpha is 1/N ,where N is the data dimension.
133
7 Examples for Machine Learning Application • Anova :This kernel is defined by raised to power d of summation of exp(−g(x − y)) where g is gamma and d is degree. • Epachnenikov :This kernel is like (3/4)(1 − u2) for u between -1 and 1 and zero for u outside that range.It has 2 adjustable parameters kernel sigma 1 and kernel degree. • Gaussian-combination :It has adjustable parameters kernel sigma 1,kernel sigma 2 and kernel sigma3. • Multiquadratic :It is defined by square root of ||x − y||2 + c2 .It has also adjustable parameters kernel sigma 1 and kernel sigma shift. [78] In this work type of kernel is set radial because it gives less error and more accuracy on Forest Fires data set in comparison with other kernel types. The features of SVM operator are as following; Gamma which has greatly influence on performance of model and if gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting. Kernel cache parameter which specifies the size of the cache for kernel evaluations and mega bytes, C parameter is SVM complexity constants which sets the tolerance for misclassification, where higher C values allows for softer boundaries and lower C values create ’harder’ boundaries. A complexity constant that is so large can lead to over-fitting, while values that are too small may result in over-generalization, Convergence epsilon that specifies to stop iterations after a specified number of iterations and it specifies the precision on the KKT conditions in which it indicates the equality of constrains in a non-linear equation, Scale is a global parameter that check if the example value are scaled and stored for a test set, L pos is part of loos function and is related to SVM complexity constant for positive examples, L neg is part of loos function and a factor for SVM complexity constant for negative examples, epsilon which specify the insensitivity constant, no loss if the prediction lies this close to true value and it is part of the loss function, it should be consider that higher epsilon gives more memory and time but the solution is more precise, epsilon plus also specifies epsilon for positive deviation only and is part of the loss function, epsilon minus in which determine epsilon
134
7.2 Forest Fires Data Set for negative deviation only and is part of loss function, balanced cost that adopts Cpos and Cneg to the relative size of the classes, quadratic loss pos in which Use quadratic loss for positive.Its range is Boolean;default: false, quadratic loss neg that is a part of a loss function and use quadratic loss for negative deviation. [78]. It should be mentioned that L pos, L neg, epsilon, epsilon plus, epsilon minus are parameters of loss function in which SVM try to minimize it to gain more accuracy to classify data. The loss function is defined as: Hingloss = M ax(0, 1 − yn wT xn + b) where w and b are parameters of hyperplane x and yn is raw output of hing loss function not the predicted label class [168].
validation
Secondly, look through the Validation part of this model which cross-validation4 method is applied to evaluate accuracy of model. It is indicated in Figure 7.5.
Figure 7.5: SVM Validation
135
7 Examples for Machine Learning Application Let’s define the operators which is used in this model: Validation Operator : It is possible to use different parameters such as split on batch attribute in which it use the special attribute ’batch’ to partition the data instead of randomly splitting the data. Leave One Out that use a single example from the original. Example Set as the testing data(in testing sub-process)and the remaining examples as the training data(in training sub-process).[78] Number of folds: It shows the number of subset. Sampling Type: Cross validation can use different types of sampling for generating subsets.There are as following options ; Linear-Samples: It splits the Example Set into partitions without changing the order of examples,subsets with consecutive examples are created. Stratified-Sampling: The Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of class labels. Shuffled-Sampling: It generates random subsets and ensures that the class distribution in the subset is the same as in the whole Example Set. Automatic: It use stratified sampling per default.If there is not any nomial label then shuffled sampling will be used.[78] Role:It use to change the role of attributes.In this way the attributes name and shift differential are dropped because standby-pay is also given the label role.Role of attribute reflects the part that played by that attribute in an Example Set.It use to set right roles for the current process.[78] Nominal to Numerical Operator : It change the type of selected non-numeric attributes to a numeric type.[78] In this work number of folds is set to 10 and sampling type is set as shuffled because the data type is changed from nominal to numerical. Consequently the Automatic and stratified sampling does not work [78].
SVM Performance The Description of SVM performance is shown in the following Figure 7.6 which indicates the RMSE(root-mean-squared-error) and SE(squared-error) and other measures of performance. Therefore, it implies the accuracy of SVM model and how model is fitted. RMSE measures the root of mean of squared errors . The lower value of RMSE
136
7.2 Forest Fires Data Set implies that the prediction is close to actual value, indicating a better predictive accuracy. Furthermore, Spearman-rho is a non-parametric measure of rank correlation. It shows the relationship between two variables which is expressed by monotonic function. If no data repeats in data set a perfect Spearman correlation of +1 or -1 appear when each variables is a perfect monotone function of the other.When observations have same rank then Spearman correlation between two variable will be high. (Note :Monotonic function is a function between ordered data sets) which keep or reverse this order. Kendall-tau in static is used for measuring ordinal association between two measured values and is introduced as a rank correlation measure. When values have a similar rank then Kendall correlation between these two values will be high.
Figure 7.6: SVM Performance
SVM scatter plot In order to illustrate the result of SVM model as a plot ,scatter plot of this model is indicated in Figure 7.7.
137
7 Examples for Machine Learning Application
Figure 7.7: SVM scatter plot
Scatter plot indicates observation on a graph which are outcomes of SVM classifier. In this scatter plot amount of rain which are represented as blue points indicates the probability of smoking in each area with lowest amount of rain. Also point implies highest amount of rain in corresponding area and green point refers to area with middle amount of rain. It must considered that area with lower probability of rain fall are more in danger of smoking than regions with higher probability of rain fall.
SVM in R SVM model 4 implementation in R code is defined as below: 1
#SVM Model w i t h R a d i a l K e r n e l :
2
library ( caret )
3
l i b r a r y ( e1071 )
4
library ( l a t t i c e )
5
library ( ggplot2 )
6
# l o a d i n g data :
7
m y f i l e = read . csv ( " p a t h o f f i l e \ \ f i l e n a m e . csv " )
8
head ( m y f i l e )
9
set . seed ( 3 0 0 )
10
# I t shows m i s s i n g v a l u e s
11
anyNA ( m y f i l e )
138
7.2 Forest Fires Data Set 12
summary ( m y f i l e )
13
#SVM model t h a t t r a i n data i n base o f r a i n f e a t u r e o f data s e t
14
svmmodel