Notes on Machine Learning Algorithms

3 downloads 0 Views 2MB Size Report
CHAID CHi-squared Automatic Interaction Detection, (Gordon V. Kass.) • C4.5 extension of the basic ID3 algorithm, (Ross Quinlan). • address the some issues ...
Notes on Machine Learning Algorithms

ITV Applied Computing Group Sergio Viademonte, PhD.

Sergio Viademonte, PhD.

ITV DS

October 2017

Roadmap •  Artificial Intelligence •  Machine Learning •  Kind of Problems •  Types •  ML Models •  Linear Regression •  Decision Trees à Classification •  Association Rules à Association Pattern Mining, Classification •  Bayesan Networks •  Evaluation •  Exercises Sergio Viademonte, PhD.

ITV DS

Machine Learning Allows to tackle problems (tasks) that don’t have an exact algorithm solution. Ex: recommendations, predictions, clustering. Tend to become better with more data

Task

Domain objects

Features

Data

Model

Output

Knowledge Training data Knowledge Sergio Viademonte, PhD.

ITV DS

Learning Algorithm

Learning problem [1] Peter Flach

Machine Learning Top Level Problems:

•  Clustering Given a data matrix D, partition its records into sets S1…Sn such that records in each cluster are similar to each other.

Sergio Viademonte, PhD.

ITV DS

[2] Charu Aggarwal

Machine Learning Top Level Problems:

•  Classification Learning the structure of a dataset of examples, already partitioned into groups, referred as categories or classes.

Sergio Viademonte, PhD.

ITV DS

[2] Charu Aggarwal

Machine Learning Top Level Problems:

•  Association Pattern Mining (Discovery) Given a binary n x d data matrix D, determine all subsets of columns such that all the values in these columns take on the value of 1 for at least a fraction s of the rows in the matrix. The relative frequency of a pattern is referred to as its support.

Sergio Viademonte, PhD.

ITV DS

[2] Charu Aggarwal

Machine Learning Top Level Problems:

•  Outlier Detection Given a data matrix D, determine the records of D that are very different from the remaining records in D.

Sergio Viademonte, PhD.

ITV DS

[2] Charu Aggarwal

Machine Learning Top Level Problems:

•  Regression Classification when class variables are numeric, not categorical. Learn a real-valued function. Need to determine a function (or set of functions) in which the function value depends linearly on some numerical features. y = f(x) x1

y

x2

y = f(x1,x2) = x1 + x2 Sergio Viademonte, PhD.

ITV DS

?=2

Machine Learning The Learning approaches:

•  Inductive •  Learn from repeated observations

•  Deductive •  Build new concepts from existing concepts

Sergio Viademonte, PhD.

ITV DS

Machine Learning Types of machine learning approaches:

•  Supervised •  Labelled data Given a set of N cases of the form { (x1,y1), ...,(xn, yn)} where xi is a feature vector of the i-th case and yi is the label (i.e., class), a learning algorithm seeks a function g: X à Y , where X is the input space and Y is the output space. The function g is an element of some space of possible functions G, usually called the hypothesis space. It is sometimes convenient to represent g using a scoring function f : X x Y à R , such that g is defined as returning the y value that gives the highest score: g(x) = arg max f (x,y). Let F denote the space of scoring functions. Ex: Classification, Regression Sergio Viademonte, PhD.

ITV DS

Machine Learning Types of machine learning approaches:

•  Unsupervised •  Unlabelled data There is input data (X), feature vector, but no corresponding outup variables. The goal for the unsupervised learning is to model the underlying structure, or distribution, in the data. To infer a function to describe hidden structure from "unlabeled" data. Ex: Clustering Association rule learning

Sergio Viademonte, PhD.

ITV DS

Machine Learning •  Linear Classifier •  Decision Trees à Classification •  Association Rules à Association Pattern Mining, Classification •  Bayesian Network

Sergio Viademonte, PhD.

ITV DS

Machine Learning •  Linear Classifier

y

y = b0 + b1.x

Sergio Viademonte, PhD.

ITV DS

x

Machine Learning •  Linear Classifier Simple linear regression, multivariate regression, logistic regression Model the relationship between two sets of variables. Assumes classes are linear separables, e.g., there is a linear decision boundary separating the classes. The result is a linear regression equation that can be used to make predictions about the data. Linear regression: y = b0 + b1.x à solve the equation for b0 and b1. y = b0 + b1 x1 + b0 + b1 x2…b0…b1 xn. where: b0 is the y-intercept and b1 is the slope of the line. y is the dependent variable and x is the independent variable. b0 = (Σy) . (Σx2 ) – (Σx) . (Σxy) / n. (Σx2 ) - (Σx)2 b1 = n. (Σxy) – (Σx) . (Σy) / n. (Σx2 ) - (Σx)2 Sergio Viademonte, PhD.

ITV DS

Machine Learning •  Linear Classifier Does the age have anything to do wiht performance?

y = b0 + b1.x n

Idade x

Peformance y

x.y

x2

y2

1 

43

65

2795

1849

4225

2 

21

80

1680

441

6400

3 

25

79

1975

625

6241

4 

42

70

2940

1764

4900

5 

57

62

3534

3249

3844

6 

59

60

3540

3481

3600

247

416

16464

11409

29210

Σ Find b0: Find b1:

Sergio Viademonte, PhD.

b0 = ((416 × 11409) – ((247 × 16464)) / 6 (11409) – 2472) = 91.274 b1 ITV DS

= (6(16464) – (247 × 416)) / (6 (11409) – 2472) = -0.532975

Machine Learning •  Linear Classifier y 90 80 70

y = b0 + b1.x

60 50

y

y = -0.533x + 91.274 R² = 0.95955

40

Linear (y )

30 20 10 0 0

10

20

y’ = 91.274 + -0.533x

Sergio Viademonte, PhD.

ITV DS

30

40

50

60

R2 = 0.95955

70

Machine Learning Decision Trees

• 

Classification process is modeled as a set of hierarchical decisions on the feature variables, organized in a tree structure.

• 

Building trees: top down tree construction, bottom up tree pruning.

•  Splitting criteria supervised with the class label. •  Univariate and Multivariate splits.

Sergio Viademonte, PhD.

ITV DS

Machine Learning •  No linear Classifier

y

x

Sergio Viademonte, PhD.

ITV DS

Machine Learning Decision Trees Algorithm (Data Set D)

begin Create root node containing D; loop Grow it by selecting and eligible node in the tree; Split the node into two or more nodes based on the split criterion; until No more nodes for split; Prune overfitting nodes; Label each leaf node with its dominant class; end

Sergio Viademonte, PhD.

ITV DS

Decision Trees •  •  • 

Ex: To decide if stay home or play outside, based on weather conditions. Label: ToDo: play / home Features: Weather (Sunny / Rainy), Temperature (Warm, Cold)

sunny 1 Internal node is a test on an attribute

overcast

2 A branch represents an outcome of the test ex: Weather = Sunny

3 A leaf node represents a class label or class distribution

tempe rature

play

warm 4 At each node, one attribute is chosen to split the training examples into distinct classes

5 A new case is classified by following a matching path to a leaf node

Sergio Viademonte, PhD.

ITV DS

rainy

weather

play

home

cold home

Decision Trees • 

Splitting attribute

A goodness function is used to evaluate attributes for splitting. Typical goodness functions: •  error rate p = fraction of instances in a set of data points S S belongs to a class label er = 1 – p Lowest values of error rate are better Compute weighted average of ER of individuals attribute values Si

Sergio Viademonte, PhD.

ITV DS

Decision Trees • 

Splitting attribute

•  gini index (CART / IBM intelligent miner). •  •  •  • 

Developed by Corrado Gini, published in his 1912 paper "Variability and Mutability” Measure the discriminative power of a particular feature Measure how often a randomly chosen feature would be incorrectly classified The lower the gini index, better

Gini index = 1 – Σj pj2 S (S1…Sr) = Σ |Si | / |S| . G(Si) i=1

Calculate the overall Gini index, based on the target attribute G(Stg) Calculate the Gini index for each individual attribute/value G(Si) Calculate the Gain for attribute Si, G(Stg) - G(Si), chose the attribute with the largest Gain.

Sergio Viademonte, PhD.

ITV DS

Decision Trees • 

Splitting attribute •  information gain, entropy (ID3 / C4.5): Let pj, fraction of data points in class j, for the attribute value vi, than the class entropy E(vi) is defined as follows: k

E(vi) = - Σ pj log2(pj) , j =1

Lower values for Entropy are the better (value 0 implies a perfect separation) Posterior distribution p(x | a) for x given a. E(vi) = [ 0, log2(k) ]

Sergio Viademonte, PhD.

ITV DS

Decision Trees • 

Splitting attribute •  reliefF / Fisher linear discrimination Let µj and δj be the mean and standard deviation of data point belonging to class J, for a feature n, pj the fraction of data points belonging to class J. Let µ be the global mean of the data on feature n. The Fisher score F for feature n is defined as: k

2

k

2

Fn = Σ pj (µj - µ) ⁄ Σ pj δj j=1

j=1

The numerator quantifies the average interclass separation, and the denominator quantifies the average intraclass separation. Attributes with higher values of Fisher score may selected as predictors for classification algorithms.

Sergio Viademonte, PhD.

ITV DS

Decision Trees • 

Splitting attribute •  gini index

Sergio Viademonte, PhD.

ITV DS

http://dataaspirant.com/

Decision Trees • 

Splitting attribute Gini Index for Var A Var A has value >=5 for 12 records out of 16 and 4 records with value =5) = 12, 5 falls in E+, 7 falls EFor Var A >= 5 & class == positive: 5/12 For Var A >= 5 & class == negative: 7/12 gini(5,7) = 1- ( (5/12)2 + (7/12)2 ) = 0.4860

n(A= 4.2 & class == negative: 6/6 gini(0,6) = 1- ( (0/6)2 + (6/6)2 ) = 0 n(C < 4.2) = 10, 8 falls in E+, 2 falls EFor Var C < 4.2& class == positive: 8/10 For Var C < 4.2 & class == negative: 2/10 gin(8,2) = 1- ( (8/10)2 + (2/10)2 ) = 0.32 gini(Target, C) = (6/16) * 0+ (10/16) * 0.32 = 0.2

Sergio Viademonte, PhD.

ITV DS

http://dataaspirant.com/2017/01/30/

Decision Trees

Sergio Viademonte, PhD.

ITV DS

http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/

Decision Trees

Sergio Viademonte, PhD.

ITV DS

http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/

Decision Trees • 

Algorithms • 

ID3 Interative Dichotomiser 3 (Ross Quinlan)

• 

CHAID CHi-squared Automatic Interaction Detection, (Gordon V. Kass.)

• 

C4.5 extension of the basic ID3 algorithm, (Ross Quinlan). • 

address the some issues not dealt with by ID3 Overfitting the data Set how deeply to grow a decision tree. Reduced error pruning. Handling continuous attributes. Handling training data with missing attribute values

• 

C4.8 (J4.8 in Weka)

• 

C5.0 (Ross Quinlan).

• 

CART, Classification and Regression Tress (Breiman, Freedman, Olshen, Stone, 1984)

Sergio Viademonte, PhD.

ITV DS

Association Rules An association rule is an implication expression of the form X → Y, where X and Y are disjoint itemsets, i.e. X ∩ Y = ∅

I is a set of n binary attributes called items, I = {I1, I2, ... , In} Given X, Y ⊂ I, and X ∩ Y = Ø. D = {T1, T2, ..., Tn} a set of T distinct transactions, where: each transaction Ti = {Ii1, Ii2, ..., Iik}, is a set of items I, where Iij ∈ I and T ⊆ I. X is called antecedent (left-hand-side, LHS). Y is called consequent (right-hand-side, RHS).

Sergio Viademonte, PhD.

ITV DS

Association Rules The strength of an association rule can be measured in terms of its support s and confidence c. Support of an itemset is defined as the proportion of transactions in the database which contain the itemset. How often a rule is applicable to a given data set. The rule X ⇒ Y has support S if S% of the transactions in D contain X ∪ Y. Support s(X à Y) = (X U Y) / N Confidence is the ratio of the number of transaction that contain X ∪ Y to the number of transactions that contain X, given by the following expression: C = support (X ∪ Y) / support (X) How frequently items in Y appear in transactions that contain X.

Sergio Viademonte, PhD.

ITV DS

Association Rules TID

Bread

Milk

Coffe

Beer

Eggs

1

1

1

0

0

0

2

1

0

1

1

1

3

0

1

1

1

0

4

1

1

1

1

0

5

1

1

1

0

0

Tuple: {Milk, Coffe, Beer} Rule: {Milk, Coffe} -> {Beer} S {Milk, Coffe,Beer} = Σ Tuple / n = 2/5 = 0.4 C {Milk, Coffe} ->{ Beer} = 2/3 = 0.67 Sergio Viademonte, PhD.

ITV DS

Association Rules A common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks: 1.Frequent Itemset Generation, whose objective is to find all the item-sets that satisfy the minsup threshold. These itemsets are called frequent itemsets. 2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent itemsets found in the previous step. These rules are called strong rules.

Some AR generator algorithms: •  •  •  •  • 

AIS (Agrawal, Imielinsk and Swami, 1993) SETM (Houtsma and Swami, 1993) DHP (Park, Chen and Yu, 1995) AprioriTid (Agrawal et al., 1998) Apriori (Agrawal et al., 1996).

Sergio Viademonte, PhD.

ITV DS

Machine Learning Bayesian Networks Bayes Rule: For any two events, A and B, P(A|B) = p(B|A) x p(A) / p(B) where you read 'p(A)' as "the probability of A” 'p(A|B)' as "the probability of A given that B has occurred". In Belen it rains 50% of the time. It is cloudy 80% of the time (sometimes it is cloudy without rain). You know, of course, that 100% of the time, if it is raining, then it is also cloudy. What do you think the chances are of rain, given that it is just cloudy?

p(Rain | Cloudy) = p(Rain) p(Cloudy |Rain) / p(Cloudy ) = 0.5 x 1.0 / 0.8 = 0.625 = 5/8. 5/8 of the time, in Belem, if it is cloudy, then it is rainy. Sergio Viademonte, PhD.

ITV DS

Machine Learning Bayesian Networks P(Y/X) where Y = spam email and X the feature vector with 2 boolean variables: ticket and lottery.

ticket

lottery

0 0 1 1

0 1 0 1

P(Y = spam | ticket, lottery) P(Y = nspam | ticekt, lottery) 0.31 0.65 0.80 0.40

0.69 0.35 0.20 0.60

Posterior probability

Decision rule: P(Y = spam | ticket, lottery) > 0.5

Sergio Viademonte, PhD.

ITV DS

Machine Learning Bayesian Networks •  •  •  • 

Sergio Viademonte, PhD.

ITV DS

50% of your patients smoke 1% have TB 5.5% have lung cancer 45% have some form of mild or chronic bronchitis

Source: https://www.norsys.com/tutorials/netica/tut_a1.htm

Machine Learning Bayesian Networks

Sergio Viademonte, PhD.

ITV DS

Source: https://www.norsys.com/tutorials/netica/tut_a1.htm

Machine Learning Evaluation Confusion Matrix: Predicted Predicted Negative Positive ----------------------------------------Actual TN FP Negative -----------------------------------------Actual FN TP Positive -----------------------------------------TN = True Negative - negative correctly classified as negative FN = False Negative - positive misclassified as negative FP = False Positive - negative misclassified as positive TP = True Positive - positive correctly classified as positive

Sergio Viademonte, PhD.

ITV DS

Machine Learning Evaluation - Classification accuracy is the proportion of correctly classified examples: number of correctly classified instances / total number of instances Accuracy = (TP + TN) / (TP + FP + TN + FN) error rate = 1 - Accuracy - Sensitivity (also called true positive rate (TPR), hit rate and recall) is the number of detected positive examples among all positive examples, e.g. the proportion of sick people correctly classified as sick. Sensitivity = TP / (TP + FN) - Specificity is the proportion of detected negative examples among all negative examples. Specificity = TN / (TN +FP) - Precision is the number of positive examples among all examples classified as positive. -  Precision = TP / (TP + FP) Sergio Viademonte, PhD.

ITV DS

Machine Learning Summary between ML algortihms and types of problems they are suitable to be applied*: Type of problems Type of ML algorithms

Clustering

Classification / Regression

Association Patterns

Outliers detection

Linear Classifiers (Linear Regreesion, Multivariate and Logistic Regression) Decision Trees Association Rules Bayesian Networks Representative (distance) based algorithms

Hierarchical clustering algorithms * This is not an exaustive list of algorithms neither type of problems. Different types of problems can be presented by different literature, and the same for the algorithms. Sergio Viademonte, PhD.

ITV DS

References: [1],[2].

Bibiliography [1] Flach, P. (2012). Machine Learning. The Art and Science of Algorithms that Make Sense of Data. [2] Aggarwal, C. (2015). Data Mining: The textbook. [3] Bishop, C. (2006). Pattern Recognition and Machine Learning. [4] Hastie, T., Tibshirani, R. and Friedman, J. 2Ed (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction. [5] Cristianini, M., Shawe-Taylor, J. (‎2000). An Introduction to Support Vector Machines and Kernelbased Learning Methods. [6] Duda, R., Hart, P. and Stork, D.(2001). Pattern Classification [7] Blum, A., Hopcrofy, J. and Kannan, R. (2016). Foundations of Data Science. [8] Viademonte, S. and Burstein, F. (2005). From Knowledge Discovery to Computational Intelligence: A Framework for Intelligent Decision Support Systems. Intelligent Decision-Making Support Systems (I-DMSS):Foundations, Applications and Challenges (2005). Eds. Jatinder Gupta, Guisseppi Forgionne, Manuel Mora. Springer-Verlag, Decision Engineering Series, UK. [9] Viademonte, S. (2004), “A Hybrid Model for Intelligent Decision Support: Combining Data Mining And Artificial Neural Networks”, PhD in Computing, Faculty of Information Technology, Monash University, Melbourne, Australia. Thesis´s URL: http://arrow.monash.edu/vital/access/manager/Repository/monash:6409

Sergio Viademonte, PhD.

ITV DS