Online Embedding Compression Using Low Rank

0 downloads 0 Views 2MB Size Report
Online Embedding Compression. Using Low Rank Matrix. Factorization. Anish Acharya*, Rahul Goel, Angeliki Metallinou, Inderjit Dhillon. {achanish,goerahul ...
Online Embedding Compression Using Low Rank Matrix Factorization Anish Acharya*, Rahul Goel, Angeliki Metallinou, Inderjit Dhillon {achanish,goerahul,ametalli}@amazon.com, [email protected]

Deep NLP Model Compression • Embedding Layer is the largest memory bottleneck in NLP models. • Compressing Embedding Layer gives large memory savings Reduce Precision : QuanDzaDon – Save network weights in lower precision Reduce Parameters: Hashing, BuckeDng Pruning Binary Networks Low Memory Footprint and Inference Latency Introduce significant Accuracy Loss

Quantized Distillation: Uses teacher enforcing to train a quantized model (student) using large unquantized model (teacher) Low Memory Footprint and Inference Latency Accuracy Preserving Extremely expensive training

Our Method • Can reduce memory footprint by configurable parameter. To achieve a p fraction parameter reduction of m*n Embedding, k should be: • Accuracy Preserving (More than 90% compression with under 1% accuracy loss) • No Additional Training cost [Regains accuracy within a few epochs of finetuning] • Low Inference Latency: comparable to quantization based methods. • Provable Numerical Bounds

Our Approach • Train Model in Full Precision • Take SVD of the learned embedding layer and break the embedding layer into two low rank layers • Continue Training [Finetuning] with low rank layers

Suggest Documents