Jun 3, 2009 - Take Home Message: Directly estimate small model. Memory efficient. Counts are effective. NAACL HLT 2009. Amit Goyal (University of Utah).
We would like to thank Ronald Garcia, Jeremiah Willcock, Doug Gregor, Jaakko Järvi,. Dave Abrahams, Dave Musser, and Alexander Stepanov for many ...
dedicated node of a (closed) multiprocessor system. When simulating large ... unlimited in complexity Grid systems scenarios. 3. ... bandwidth in the simulated system the time needed to effectively ... physical host acted as a bottleneck to the simul
ments in both perplexity and word error rate. 1. INTRODUCTION. Language modeling is known to be a very important aspect of speech recognition. Almost all ...
reduced-precision NNs without focusing on low-level op- timizations, and accordingly we used an HLS-based plat-. arXiv:1708.00052v1 [cs.CV] 31 Jul 2017 ...
Jun 6, 2016 - arXiv:1601.06602v3 [cs.LG] 6 Jun 2016 ... making it applicable to online and streaming anomaly detection problems. Learning ... to ca and normal score if it determines the degree of certainty to which x belongs to cn. Scoring ... detect
{dlwh,klein}@cs.berkeley.edu. Abstract ..... Of course, there is no reason why all changes should be .... In practice, o
Large-Scale Cognate Recovery. David Hall and Dan Klein. Computer Science Division. University of California at Berkeley.
To test the presence of these âanchorâ words in a known language .... tem that takes as input a URL, fetches the HTML of this URL using the UNIX com- ... guistic analysis9 performs tokenization, morphological analysis, part-of-speech tag-.
ciency of university students. The ultimate ... of the Education Faculty at Stellenbosch University, where new ..... Dundee: University of Abertay, 2000, pp. 123â.
prove the level of objectivity, reduce the associated manual workload ... Since students regard oral proficiency as an important ... instructions, records the students' answers, and controls the in- ... the resources that are required to develop appl
Denkowski, Michael and Alon Lavie. 2012. Challen- ges in Predicting Machine Translation Utility for. Human Post-Editors. Proc of AMTA 2012. San. Diego, CA.
NRCSâ, , (May, 2004). Pennsylvania Department of Environmental Protection (PDEP) (2000) ...
tests of an integrated aircraft power network, Modelica has been proved to be .... actuators. By replacing the variable frequency generator in the proposed ...
Munich, Germany; 2University Pompeu Fabra, Technology Department, ..... into numerical gene-expression values (Hegde et al., 2000; Angulo and Serra,.
Providing translation of the English sentences into Hindi. .... The overall system architecture is kept extremely simple. .... Two properties are satisfied by SSF.
Jun 27, 2007 - to discuss, test, and evaluate approaches to software interoperability. ... ommendations of the workshop and to answer the question of ...... matrix is used to connect the populations during model setup. .... aid in this communication,
Jan 16, 2018 - AMPs with 3 or 4 cysteines, we ran Rosetta47 modeling with all possi- ble combinations. .... In order to select and process AMPs, a workflow was implemented. (Figure 1) to collect .... Figure created in PyMol. 6 |. KOZIC ET AL.
by Brian J. Henz and Dale R. Shires. ARL-TR-3680 ..... Stack of six-noded wedge elements for PhoenixFlow thermal/cure analysis. ............17. Figure 13.
decided to take up the building of a machine translation system as an ... Providing translation of the English sentences into Hindi. ... Telugu and English-Tamil.
Nov 6, 1990 - PATH: the route along which an entity (a theme) travels .... and COMMUNICATIVE-EVENTs â illustrate the problems any ontology builder.
A language model (LM) is a crucial part of many natural lan- guage processing tasks, such as speech recognition, statistical machine translation and information ...
Apr 28, 2016 - tools that permit accurate extraction of high-quality data from the ... have been limited in scope, automation, and accuracy of data retrieval.
general framework for building integrated natural-language and multimodal dialog ...... GATE, the General Architecture for Text Engineering (Bontcheva et al.
NAACL HLT 2009. 3rd June 2009. NAACL HLT 2009. Amit Goyal (University of Utah). Streaming for large scale Language Modeling (1 / 17) ...
Streaming for large scale NLP: Language Modeling Amit Goyal, Hal Daume´ III and Suresh Venkatasubramanian School of Computing University of Utah email: {amitg,hal,suresh}@cs.utah.edu web: http://www.cs.utah.edu/˜amitg/ NAACL HLT 2009 3rd June 2009
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (1 / 17)
Overview Problem Large amounts of data in many NLP problems Many such problems require relative frequency estimation Computationally expensive on huge corpora
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (2 / 17)
Overview Problem Large amounts of data in many NLP problems Many such problems require relative frequency estimation Computationally expensive on huge corpora Canonical Task Language Modeling: large-scale frequency estimation Proposed Solution Trades off memory usage with accuracy of counts using Streaming Employs small memory-footprint to approximate n-gram counts
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (2 / 17)
Overview Problem Large amounts of data in many NLP problems Many such problems require relative frequency estimation Computationally expensive on huge corpora Canonical Task Language Modeling: large-scale frequency estimation Proposed Solution Trades off memory usage with accuracy of counts using Streaming Employs small memory-footprint to approximate n-gram counts Findings Scales to billion-word corpora using conventional 8 GB machine SMT experiments show that these counts are effective NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (2 / 17)
Large Scale Language Modeling Goal: Building higher order language models (LMs) on huge data sets
Number of unique n−grams (millions)
Difficulties: Increase in n =⇒ Increase in number of unique n-grams =⇒ Increase in memory usage 20 5−grams 4−grams 3−grams 2−grams 1−grams
15
10
5
0
0
5
10 15 20 Number of words (millions)
25
Example 1500 machines got used for a day to compute 300 million unique n-grams from tera bytes of web data [Brants et al. (2007)] NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (3 / 17)
Related Work
Prefix trees to store LM probabilities efficiently [Federico and Bertoldi, SMT workshop at ACL 2006]
Bloom and Bloomier filters: Compressed n-gram representation [Talbot and Osborne; ACL 2007] [Talbot and Brants; 2008]
Distributed word clustering for class-based LMs [Uszkoreit and Brants; ACL 2008] NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (4 / 17)
Zipf’s law Phenomena Number of unique n-grams is large Low frequency count n-grams contribute most towards LM size 8
Frequency of ngrams (log−scale)
10
5−grams 4−grams 3−grams 2−grams 1−grams
6
10
4
10
2
10
0
10 0 10
NAACL HLT 2009
2
4
6
10 10 10 rank of sorted ngrams (log−scale)
Amit Goyal (University of Utah)
8
10
Streaming for large scale Language Modeling (5 / 17)
Zipf’s law Phenomena Number of unique n-grams is large Low frequency count n-grams contribute most towards LM size 8
Frequency of ngrams (log−scale)
10
5−grams 4−grams 3−grams 2−grams 1−grams
6
10
4
10
2
10
0
10 0 10
2
4
6
10 10 10 rank of sorted ngrams (log−scale)
8
10
Key Idea: Throw away rare n-grams NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (5 / 17)
n-gram Pruning Methods Count pruning Discards all n-grams whose count < pre-defined threshold Entropy pruning Discards n-grams that change perplexity by less than a threshold
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (6 / 17)
n-gram Pruning Methods Count pruning Discards all n-grams whose count < pre-defined threshold Entropy pruning Discards n-grams that change perplexity by less than a threshold SMT experiments with 5-gram LM on large data:
Model Exact 100 count cutoff 5e-7 entropy
NAACL HLT 2009
Size 367.6m 1.1m 28.5m
BLEU 28.7 28.0 28.1
Amit Goyal (University of Utah)
Pruning method loses 0.7 BLEU points compared to exact model Decrease =⇒ 300 times smaller model
Streaming for large scale Language Modeling (6 / 17)
Problem
Difficulties with scaling pruning methods for large-scale LM: Computation time and memory usage to compute all counts is tremendous Requires enormous initial disk storage for n-grams
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (7 / 17)
Proposed Solution
Assume that multiple-GB models are infeasible Goal: Directly estimate a small model instead of first estimate a large model and then compress it Employ deterministic streaming algorithm [Manku and Motwani, 2002] NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (8 / 17)
Streaming Given: Stream of n-grams of length N. Running Example: n=5 and N=106
Algorithm can only read from left to right without going backwards Store only parts of input or other intermediate values Typical working storage space size O(log k N) NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (9 / 17)
Algorithm: Lossy Counting [Manku and Motwani, 2002]
Step 1: Divide the stream into windows using ∈ (0, 1) Window size= 1 ; Total N windows Running Example: Set =0.001; N = 106 Window size=103 ; Total 103 windows
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (10 / 17)
Algorithm in Action · · ·
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (11 / 17)
Algorithm in Action · · ·
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (11 / 17)
Algorithm continued · · ·
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (12 / 17)
Algorithm continued · · ·
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (12 / 17)
Algorithm Guarantees s ∈ (0, 1) is support. In practice, s = 10 Running Example: =0.001, s = 0.01
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (13 / 17)
Algorithm Guarantees s ∈ (0, 1) is support. In practice, s = 10 Running Example: =0.001, s = 0.01 All n-grams with actual counts > sN (104 ) are output Returns no n-grams with actual counts < (s)N (9000) All reported counts ≤ actual counts by at most N (1000) Space used by the algorithm: O( 1 log(N))
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (13 / 17)
Algorithm Guarantees s ∈ (0, 1) is support. In practice, s = 10 Running Example: =0.001, s = 0.01 All n-grams with actual counts > sN (104 ) are output Returns no n-grams with actual counts < (s)N (9000) All reported counts ≤ actual counts by at most N (1000) Space used by the algorithm: O( 1 log(N)) In practice, set s = to retain all generated counts n-grams appearance more valuable than their counts
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (13 / 17)
Evaluating stream n-gram counts Data: English side of Europarl (EP): 38million words Portions of Gigaword i.e. afe and nyt + EP (EAN): 1.4billion words Accuracy: Ratio of # of sorted Top K stream n-grams found in # of Top K sorted true n-grams (Higher is better)
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (14 / 17)
Evaluating stream n-gram counts Data: English side of Europarl (EP): 38million words Portions of Gigaword i.e. afe and nyt + EP (EAN): 1.4billion words Accuracy: Ratio of # of sorted Top K stream n-grams found in # of Top K sorted true n-grams (Higher is better)
50e-8 20e-8 10e-8 5e-8
5-gram produced 245k 726k 1655k 4018k
Acc 0.29 0.33 0.35 0.36
Table: Evaluating quality of 5-gram stream counts for different settings of on EAN corpus NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (14 / 17)
Evaluating stream n-gram counts Data: English side of Europarl (EP): 38million words Portions of Gigaword i.e. afe and nyt + EP (EAN): 1.4billion words Accuracy: Ratio of # of sorted Top K stream n-grams found in # of Top K sorted true n-grams (Higher is better)
50e-8 20e-8 10e-8 5e-8
5-gram produced 245k 726k 1655k 4018k
0.29 0.33 0.35 0.36
Table: Evaluating quality of 5-gram stream counts for different settings of on EAN corpus NAACL HLT 2009
Top K 100k 500k 1000k 2000k 4018k
Acc
Accuracy 0.99 0.93 0.72 0.50 0.36
Table: Evaluating top K sorted 5-gram stream counts for =5e-8 on EAN corpus
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (14 / 17)
SMT Experimental Setup Training set: Europarl (EP) French-English parallel corpus: Million sentences Language Model data: EP and afe + nyt + EP (EAN) Development and Test set: News corpus of 1057 and 3071 sentences Evaluation on uncased test-set using BLEU metric (Higher is better)
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (15 / 17)
SMT Experimental Setup Training set: Europarl (EP) French-English parallel corpus: Million sentences Language Model data: EP and afe + nyt + EP (EAN) Development and Test set: News corpus of 1057 and 3071 sentences Evaluation on uncased test-set using BLEU metric (Higher is better) Models Compared: 4 baseline LMs (3, 5-gram on EP and EAN) Count and Entropy pruning 5-gram LMs Stream count LMs computed with two values of : 5e − 8 and 10e − 8 on EAN corpus
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (15 / 17)
Baselines: Large LMs effective Stream counts findings:
2.8 3.0
2.8 2.8 2.9 2.9
Amit Goyal (University of Utah)
Effective as pruning methods
0.7 Bleu worse to exact Memory Efficient 7 and 9-gram are also possible
Streaming for large scale Language Modeling (16 / 17)
Discussion Take Home Message:
Directly estimate small model Memory efficient Counts are effective
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (17 / 17)
Discussion Take Home Message:
Directly estimate small model Memory efficient Counts are effective
Future Directions: Use these LMs for speech recognition, information extraction etc. Streaming in other NLP applications Build streaming class-based and skip n-gram LMs
NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (17 / 17)
Discussion Take Home Message:
Directly estimate small model Memory efficient Counts are effective
Future Directions: Use these LMs for speech recognition, information extraction etc. Streaming in other NLP applications Build streaming class-based and skip n-gram LMs
Thanks! Questions? NAACL HLT 2009
Amit Goyal (University of Utah)
Streaming for large scale Language Modeling (17 / 17)