Training: A Gradient Step starting point value of weight w loss. (negative) gradient next point .... Models tend to conv
Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course
Goals of This Class ● Learn to take a real-life problem and apply machine learning to make predictions. ● Learn to implement machine learning solutions using TensorFlow ● Learn how to evaluate the quality of your solution ● Machine learning is a very broad field -- we only just touch upon some of the most common machine learning algorithms
Some Sample Applications of Machine Learning
Sample Applications of Machine Learning ● ● ● ● ● ● ● ● ●
Medical applications such as disease prediction Speech recognition and understanding Recommendation systems Malware and spam detection Image understanding and annotation AI for games Translating between languages Predicting likelihood of earthquakes Matching resumes with jobs
Google Products Using Machine Learning
Google Assistant
Google Photos: Searching Images via Text
Gmail: Smart Reply
Google Play Music: Recommending Music
Game Playing: Alpha Go
Combined Vision and Translation
What is Machine Learning?
What is Machine Learning (ML)? There are many ways to define ML. ● ML systems learn how to combine , keys=["White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"]) # Sample of creating a categorical columns with a hash bucket education = tf.contrib.layers.sparse_column_with_hash_bucket( "education", hash_bucket_size=50) # Sample of creating a cross gender_x_education_x_race = tf.contrib.layers.crossed_column( [gender, education, race], hash_bucket_size=1000)
Overfitting If we make a very complex model then can perfectly (or near perfectly) fit the training data we just memorize versus the goal of generalizing. Remember our goal is to build a system to deal with new data!
Setting Aside Validation Data Train model on Training Data
Evaluate model on Validation Data Select features, learning rate, batch size, ... according to results on Validation Data
Pick model that does best on Validation Data Check for generalization ability on Test Data
Ensure Validation Data is Representative This is an example of what happens if you partition data without first randomizing it. Validation data is NOT representative and thus not a good estimate of the classifier/regressor’s performance.
Things You Need to Decide ● Learning Rate ○ Very important. Typically change by powers of 10 until the model is training reasonably well and then fine tune ● Number of Steps to Train ○ Time to train is proportional to this (for a fixed set of features) so you want to make this as small as you can but still important that you don’t undertrain. ● Batch Size ○ not that sensitive, that can be the last thing to vary ● What features to use, feature normalization, when to introduce buckets and crosses
Learning Rate Too High
Learning Rate Way Too Low
Learning Rate Could Still Be Higher
Good Learning Rate NOTE: This model is still training and not yet overfitting so increase the number of steps!
Training Curve Showing Overfitting A model with same data and learning rate trained for 500 (versus 50) iterations. Now we see it overfitting
Things You Need to Decide ● Learning Rate, Steps to Train, Batch Size ● What features to use, feature normalization, when to introduce buckets and crosses ● When the model is more complex you also need to introduce ways to prevent overfitting. ○ Early Stopping, L2-regularization, or dropout ● Ways to Reduce Model Size ○ Smaller Buckets, Fewer Features, L1 Regularization
Linear Classifier
1 0 loan amount (x1)
Probability Output
Income (x2)
Convert Real-Valued to Probability Using:
LogOdds (wTx + b)
LinearClassifier vs LinearRegressor ● Use Regressor to predict a real-valued feature (minimize RMSE) linear_regressor = tf.contrib.learn.LinearRegressor( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
● Use Classifier to predict a True (1), False (0) feature (minimize log loss) linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
LinearClassifier vs LinearRegressor ● Choices in feature engineering the same ● Process of selecting and using validation (and test data) is the same ● Tuning learning rate, number of steps, batch size, regularization are the same ● Evaluation metrics change ● Instead of RMSE interested in things like accuracy, ROC curve (trade-off in false positive vs false negative rate), AUC (area under ROC) ● AUC gives probability a random + example is predicted with a higher probability than a random - example. So 0.5 random guess and 1.0 is a perfect model.
Sample ROC Curve
0.0 (Not Spam)
1.0 (Spam) FPR = 7/19, TPR = 6/7
FPR = 1/19, TPR = 3/7
ROC Curves for Models from Lab 3
Model Model Model Model Model
size size size size size
original: 533 no reg: 429 l2: 429 l1, l2: 119 l1 strong, l2: 70
LinearClassifier With > 2 Classes ● Example using LinearClassifier to learn 10 classes (digits 0, …, 9) linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=feature_columns, n_classes=10, optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
● Here the labels must be 0, …, 9 (or a sparse feature with 10 values). ● Now optimize softmax loss which is a generalization of log loss when you have a probability distribution of more than just two values ● Again we need to modify the visualizations a bit looking at a confusion matrix versus an ROC curve.
Confusion Matrix
DNN: Add a Non-Linearity Output
Non-Linear Transformation Layer (a.k.a. Activation Function) Hidden Layer (Linear)
Input
By convention we combine the Activation function into the hidden layer making it a non-linear function. So the network to the left is draw as:
Non-linearity in DNN let’s it learn to do this If you want to predict city-mpg from the compression-ratio a single linear function would not fit well but you can get a pretty good fit by dividing compression ratio into two buckets and then learn a linear model for each bucket.
Deep Neural Networks -- Add Layers ●
●
●
Training done via BackProp algorithm which is an extension of SGD The hidden layers closer to the output capture higher level features (since they learn over the features from the previous layer) For this network in TensorFlow you’d have : ○ hidden_units=[4, 3]
Output
Hidden2
Hidden1
Input
DNN Classifier or Regressor in TF DNNclassifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns, n_classes=10, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, ) DNNregressor = tf.contrib.learn.DNNRegressor( feature_columns=feature_columns, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, )
DNN Reduces Feature Engineering ● It can learn to bucketize real-value features ● It can learn crosses ● As you add model weights it takes more data and time to train and overfitting becomes more of an issue ● Use L2 regularization or drop out to control overfitting ● Along with other hyperparameters (e.g. learning rate, num steps) you need to pick the DNN configuration (how many levels of hidden units and how many units at each level).
Embeddings as a Tool ● Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other ● Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric ● Jointly embedding diverse data types (e.g. text, images, audio, …) define a similarity between them
An Embedding Layer in a DNNRegresor Regression problem to predict home sales prices:
DNNRegressor
Sale Price 3 Dimensional Embedding
Input to Embedding Layer is Sparse Vector Encoding
... Words in real estate ad
Latitude
Longitude
An Embedding Layer in a DNNClassifier Multiclass Classification to predict a handwritten digit
Predicted probability for the 10 classes
DNNClassifier
0 1 2 3 4 5 6 7 8 9
“One-hot” target prob dist. (sparse)
3 Dimensional Embedding
0 1 2 3 4 5 6 7 8 9
Target Class Label
Input to Embedding Layer is Sparse Vector Encoding
... Raw bitmap of the hand drawn digit
... Other features