123â132, Lund, Sweden, Oct 20-22, 2008. [6] K. Patrick, F. Raab, ... http://www.fitbit.com. [24] Tamara Denning, Adrienne Andrew, Rohit Chaudhri, Carl Hartung,.
For example consumers often flip a book in a bookstore to get a âfeelâ for a book. During a visual ..... We would like to thank Andy Cockburn and Joshua Savage.
analysis of [GraySlb] the deadlock frequency grows with the fourth power of the transaction size ) Hence, since LLTs acc
sume that only p2 , and of course pë , receive vë ... tion A before deciding (where A is the duration ... tation phase, details not present in the design phase.
a system similar to the ubiquitous Adobe Acrobat Reader. Our results show that Flipper allows for faster visual searches than the other two techniques, and is ...
among themselves and to nearby hosts in a range of IP addresses. The Global ... characteristics of such a decentralized coordinates-based ap- proach under ...
Oct 15, 1999 - manner 12, 33] through gossip messages 21, 17], ... sites labeled S1 S2 ::: Sn. We assume a fail-stop ... To this end Wuu and Bernstein define.
a process server, which handles execution and termination .... partition per view of client requests. The first .... backups). To get the best of both approaches, PEX uses a hybrid protocol. ..... We also measured the time it took to run uptime using
Software for Monitoring and Controlling Power Systems. 1. Kenneth P. ..... The greedy bandwidth consumption approach underlying TCP ensures that when this.
David Gries. Computer Science. Cornell University. 1. Introduction. Good morning! I thank you for inviting me to give this presentation. It is a pleasure to be here.
Network games approach some of the traditional algorithmic ... Work in price of stability has focused on network .... gap, and a network service provider game.
troduced Software Guard Extensions (SGX) to furnish data to the Ethereum smart .... cess the online resources of requesters (e.g. online accounts) by ingesting ...
1. INTRODUCTION. Pervasive computing is there and growing, but people still have ..... applications, such as those using Microsoft COM, or the Gnome. Bonobo ...
[Quinlan, 1986] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81â106, 1986. [Schetinin, 2003] V. Schetinin. A learning algorithm for.
1 Imperial College London. 2 Cornell University and the ... Bitcoin network. Our experiments show that, not counting net
May 1, 2014 - We currently have a functional technical prototype, running as a Node.JS web application. Pact's main focu
[email protected]. ABSTRACT. With the availability of affordable new desktop fabrication techniques such as 3D printing and laser cutting, physical.
Cornell University. Mark Bickford. ATC-NY. Danny Dolev. Hebrew University of Jerusalem. Abstract. The paper presents and evaluates Nysiad,1 a system that.
{nhnguyen,caruana}@cs.cornell.edu. Abstract. In this paper we address the problem of combining mul- tiple clusterings without access to the underlying features ...
Cornell University. Doug L. James. Figure 1: Crash! Our physically based sound renderings of thin shells produce characteristic âcrashingâ and ârumblingâ ...
opinions in on-line text is of critical importance in building successful multi- ... implemented the scheme in GATE, a General Architecture for Text Engineering.
checking Consumer Reports regarding dishwasher brands. But now, we .... Positive: love, wonderful, best, great, superb,
Feb 19, 2014 - Achieve: Evaluating the Impact of. Progress Logging and Social. Feedback on Goal Achievement. Abstract. Goal progress logging and social ...
Jun 12, 2007 - P-Ring: An Efficient and Robust P2P Range Index. Structureâ. Adina Crainiceanu. United States Naval Academy [email protected]. Prakash ...
Smoothing § Need better estimators than MLE for rare events § Approach – Somewhat decrease the probability of previously seen events, so that there is a little bit of probability mass left over for previously unseen events » Smoothing » Discounting methods
Add-one smoothing § Add one to all of the counts before normalizing into probabilities § MLE unigram probabilities corpus length count ( wx ) in word tokens P( wx ) = N § Smoothed unigram probabilities vocab size count ( wx ) + 1 P( wx ) = (# word types) N +V § Adjusted counts (unigrams) N c = (ci + 1) N +V * i
Too much probability mass is moved § Estimated bigram frequencies § AP data, 44 million words – Church and Gale (1991)
§ In general, add-one smoothing is a poor method of smoothing § Often much worse than other methods in predicting the actual probability for unseen bigrams
r = fMLE
femp
fadd-1
0
0.000027
0.000137
1
0.448
0.000274
2
1.25
0.000411
3
2.24
0.000548
4
3.23
0.000685
5
4.21
0.000822
6
5.23
0.000959
7
6.21
0.00109
8
7.21
0.00123
9
8.26
0.00137
Methodology: Options § Divide data into training set and test set – Train the statistical parameters on the training set; use them to compute probabilities on the test set – Test set: 5%-20% of the total data, but large enough for reliable results
§ Divide training into training and validation set » Validation set might be ~10% of original training set » Obtain counts from training set » Tune smoothing parameters on the validation set
§ Divide test set into development and final test set – Do all algorithm development by testing on the dev set – Save the final test set for the very end…use for reported results
Don’t train on the test corpus!! Report results on the test data not the training data.
Good-Turing discounting § Re-estimates the amount of probability mass to assign to N-grams with zero or low counts by looking at the number of N-grams with higher counts. § Let Nc be the number of N-grams that occur c times. – For bigrams, N0 is the number of bigrams of count 0, N1 is the number of bigrams with count 1, etc.
§ Revised counts:
N c +1 c = (c + 1) Nc *
Good-Turing discounting results § Works very well in practice § Usually, the GT discounted estimate c* is used only for unreliable counts (e.g. < 5) § As with other discounting methods, it is the norm to treat Ngrams with low counts (e.g. counts of 1) as if the count was 0
Unknown words § Closed vocabulary – Vocabulary is known in advance – Test set will contain only these words
§ Open vocabulary – Unknown, out of vocabulary words can occur – Add a pseudo-word
§ Training the unknown word model???
Evaluating n-gram models § Best way: extrinsic evaluation – Embed in an application and measure the total performance of the application – End-to-end evaluation
§ Intrinsic evaluation – Measure quality of the model independent of any application – Perplexity » Intuition: the better model is the one that has a tighter fit to the test data or that better predicts the test data
Perplexity For a test set W = w1 w2 … wN, PP (W) = P (w1 w2 … wN) -1/N
=N
1 P(w1w2 ...wN )
The higher the (estimated) probability of the word sequence, the lower the perplexity. Must be computed with models that have no knowledge of the test set.
Combining estimators § Smoothing methods – Provide the same estimate for all unseen (or rare) n-grams with the same prefix – Make use only of the raw frequency of an n-gram
§ But there is an additional source of knowledge we can draw on --- the n-gram hierarchy – If there are no examples of a particular trigram,wn-2wn-1wn, to compute P(wn|wn-2wn-1), we can estimate its probability by using the bigram probability P(wn|wn-1 ). – If there are no examples of the bigram to compute P(wn|wn-1), we can use the unigram probability P(wn).
§ For n-gram models, suitably combining various models of different orders is the secret to success.
Simple linear interpolation § Construct a linear combination of the multiple probability estimates. – Weight each contribution so that the result is another probability function.
P(wn | wn−2 wn−1 ) = λ3 P(wn | wn−2 wn−1 ) + λ2 P(wn | wn−1 ) + λ1P(wn ) – Lambda s sum to 1.
§ Also known as (finite) mixture models § Deleted interpolation – Each lambda is a function of the most discriminating context
Backoff (Katz 1987) § Non-linear method § The estimate for an n-gram is allowed to back off through progressively shorter histories. § The most detailed model that can provide sufficiently reliable information about the current context is used. § Trigram version (high-level):
P(wi | wi −2 wi −1 ), if C (wi −2 wi −1wi ) > 0
Pˆ (wi | wi −2 wi −1 ) =
α1 P( wi | wi −1 ), if C ( wi −2 wi −1wi ) = 0 and C ( wi −1wi ) > 0
α 2 P( wi ), otherwise.
Final words… § Problems with backoff? – Probability estimates can change suddenly on adding more data when the back-off algorithm selects a different order of n-gram model on which to base the estimate. – Works well in practice in combination with smoothing.
§ Good option: simple linear interpolation with MLE n-gram estimates plus some allowance for unseen words (e.g. Good-Turing discounting)