Introduction To Machine Learning

Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course

Goals of This Class ● Learn to take a real-life problem and apply machine learning to make predictions. ● Learn to implement machine learning solutions using TensorFlow ● Learn how to evaluate the quality of your solution ● Machine learning is a very broad field -- we only just touch upon some of the most common machine learning algorithms

Some Sample Applications of Machine Learning

Sample Applications of Machine Learning ● ● ● ● ● ● ● ● ●

Medical applications such as disease prediction Speech recognition and understanding Recommendation systems Malware and spam detection Image understanding and annotation AI for games Translating between languages Predicting likelihood of earthquakes Matching resumes with jobs

Google Products Using Machine Learning

Google Assistant

Google Photos: Searching Images via Text

Gmail: Smart Reply

Google Play Music: Recommending Music

Game Playing: Alpha Go

Combined Vision and Translation

What is Machine Learning?

What is Machine Learning (ML)? There are many ways to define ML. ● ML systems learn how to combine , keys=["White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"]) # Sample of creating a categorical columns with a hash bucket education = tf.contrib.layers.sparse_column_with_hash_bucket( "education", hash_bucket_size=50) # Sample of creating a cross gender_x_education_x_race = tf.contrib.layers.crossed_column( [gender, education, race], hash_bucket_size=1000)

Overfitting If we make a very complex model then can perfectly (or near perfectly) fit the training data we just memorize versus the goal of generalizing. Remember our goal is to build a system to deal with new data!

Setting Aside Validation Data Train model on Training Data

Evaluate model on Validation Data Select features, learning rate, batch size, ... according to results on Validation Data

Pick model that does best on Validation Data Check for generalization ability on Test Data

Ensure Validation Data is Representative This is an example of what happens if you partition data without first randomizing it. Validation data is NOT representative and thus not a good estimate of the classifier/regressor’s performance.

Things You Need to Decide ● Learning Rate ○ Very important. Typically change by powers of 10 until the model is training reasonably well and then fine tune ● Number of Steps to Train ○ Time to train is proportional to this (for a fixed set of features) so you want to make this as small as you can but still important that you don’t undertrain. ● Batch Size ○ not that sensitive, that can be the last thing to vary ● What features to use, feature normalization, when to introduce buckets and crosses

Learning Rate Too High

Learning Rate Way Too Low

Learning Rate Could Still Be Higher

Good Learning Rate NOTE: This model is still training and not yet overfitting so increase the number of steps!

Training Curve Showing Overfitting A model with same data and learning rate trained for 500 (versus 50) iterations. Now we see it overfitting

Things You Need to Decide ● Learning Rate, Steps to Train, Batch Size ● What features to use, feature normalization, when to introduce buckets and crosses ● When the model is more complex you also need to introduce ways to prevent overfitting. ○ Early Stopping, L2-regularization, or dropout ● Ways to Reduce Model Size ○ Smaller Buckets, Fewer Features, L1 Regularization

Linear Classifier

1 0 loan amount (x1)

Probability Output

Income (x2)

Convert Real-Valued to Probability Using:

LogOdds (wTx + b)

LinearClassifier vs LinearRegressor ● Use Regressor to predict a real-valued feature (minimize RMSE) linear_regressor = tf.contrib.learn.LinearRegressor( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

● Use Classifier to predict a True (1), False (0) feature (minimize log loss) linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

LinearClassifier vs LinearRegressor ● Choices in feature engineering the same ● Process of selecting and using validation (and test data) is the same ● Tuning learning rate, number of steps, batch size, regularization are the same ● Evaluation metrics change ● Instead of RMSE interested in things like accuracy, ROC curve (trade-off in false positive vs false negative rate), AUC (area under ROC) ● AUC gives probability a random + example is predicted with a higher probability than a random - example. So 0.5 random guess and 1.0 is a perfect model.

Sample ROC Curve

0.0 (Not Spam)

1.0 (Spam) FPR = 7/19, TPR = 6/7

FPR = 1/19, TPR = 3/7

ROC Curves for Models from Lab 3

Model Model Model Model Model

size size size size size

original: 533 no reg: 429 l2: 429 l1, l2: 119 l1 strong, l2: 70

LinearClassifier With > 2 Classes ● Example using LinearClassifier to learn 10 classes (digits 0, …, 9) linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=feature_columns, n_classes=10, optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

● Here the labels must be 0, …, 9 (or a sparse feature with 10 values). ● Now optimize softmax loss which is a generalization of log loss when you have a probability distribution of more than just two values ● Again we need to modify the visualizations a bit looking at a confusion matrix versus an ROC curve.

Confusion Matrix

DNN: Add a Non-Linearity Output

Non-Linear Transformation Layer (a.k.a. Activation Function) Hidden Layer (Linear)

Input

By convention we combine the Activation function into the hidden layer making it a non-linear function. So the network to the left is draw as:

Non-linearity in DNN let’s it learn to do this If you want to predict city-mpg from the compression-ratio a single linear function would not fit well but you can get a pretty good fit by dividing compression ratio into two buckets and then learn a linear model for each bucket.

Deep Neural Networks -- Add Layers ●

●

●

Training done via BackProp algorithm which is an extension of SGD The hidden layers closer to the output capture higher level features (since they learn over the features from the previous layer) For this network in TensorFlow you’d have : ○ hidden_units=[4, 3]

Output

Hidden2

Hidden1

Input

DNN Classifier or Regressor in TF DNNclassifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns, n_classes=10, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, ) DNNregressor = tf.contrib.learn.DNNRegressor( feature_columns=feature_columns, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, )

DNN Reduces Feature Engineering ● It can learn to bucketize real-value features ● It can learn crosses ● As you add model weights it takes more data and time to train and overfitting becomes more of an issue ● Use L2 regularization or drop out to control overfitting ● Along with other hyperparameters (e.g. learning rate, num steps) you need to pick the DNN configuration (how many levels of hidden units and how many units at each level).

Embeddings as a Tool ● Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other ● Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric ● Jointly embedding diverse data types (e.g. text, images, audio, …) define a similarity between them

An Embedding Layer in a DNNRegresor Regression problem to predict home sales prices:

DNNRegressor

Sale Price 3 Dimensional Embedding

Input to Embedding Layer is Sparse Vector Encoding

... Words in real estate ad

Latitude

Longitude

An Embedding Layer in a DNNClassifier Multiclass Classification to predict a handwritten digit

Predicted probability for the 10 classes

DNNClassifier

0 1 2 3 4 5 6 7 8 9

“One-hot” target prob dist. (sparse)

3 Dimensional Embedding

0 1 2 3 4 5 6 7 8 9

Target Class Label

Input to Embedding Layer is Sparse Vector Encoding

... Raw bitmap of the hand drawn digit

... Other features

Introduction To Machine Learning - Google Services

Introduction To Machine Learning - Google Services

Suggest Documents

Introduction to Machine Learning

INTRODUCTION TO MACHINE LEARNING

Introduction to Machine Learning

Introduction to machine learning

Introduction to Statistical Machine Learning

Introduction To Machine Learning - PDFKUL.COM

[PDF] Introduction to Machine Learning - Google Sites

Introduction to Machine Learning What is Machine Learning ?

Introduction to Convex Optimization for Machine Learning

Introduction to Machine Learning: Class Notes 67577

Introduction to Machine Learning - Stanford Artificial Intelligence ...

An Introduction to Machine Learning with Kernels

Introduction to Machine Learning: Class Notes 67577

Introduction to Machine Learning - Stanford Artificial Intelligence ...

Introduction to Machine Learning - Alex Smola

Introduction to Machine Learning - Stanford Artificial Intelligence ...

Introduction to Machine Learning - Stanford Artificial Intelligence ...

An introduction to quantum machine learning

Introduction to Machine Learning - Alex Smola

Cloud Discover: Machine Learning - Google Services

Advanced Solutions Lab Machine Learning ... - Google Services

Advanced Solutions Lab Machine Learning ... - Google Services

Cloud Discover: Machine Learning - Google Services