Online Embedding Compression Using Low Rank Matrix Factorization Anish Acharya*, Rahul Goel, Angeliki Metallinou, Inderjit Dhillon {achanish,goerahul,ametalli}@amazon.com,
[email protected]
Deep NLP Model Compression • Embedding Layer is the largest memory bottleneck in NLP models. • Compressing Embedding Layer gives large memory savings Reduce Precision : QuanDzaDon – Save network weights in lower precision Reduce Parameters: Hashing, BuckeDng Pruning Binary Networks Low Memory Footprint and Inference Latency Introduce significant Accuracy Loss
Quantized Distillation: Uses teacher enforcing to train a quantized model (student) using large unquantized model (teacher) Low Memory Footprint and Inference Latency Accuracy Preserving Extremely expensive training
Our Method • Can reduce memory footprint by configurable parameter. To achieve a p fraction parameter reduction of m*n Embedding, k should be: • Accuracy Preserving (More than 90% compression with under 1% accuracy loss) • No Additional Training cost [Regains accuracy within a few epochs of finetuning] • Low Inference Latency: comparable to quantization based methods. • Provable Numerical Bounds
Our Approach • Train Model in Full Precision • Take SVD of the learned embedding layer and break the embedding layer into two low rank layers • Continue Training [Finetuning] with low rank layers