Scalability is one of the main concerns of our project. Our implementation is able to scale up for ... repetitive tweets
Finding Near Duplicates in Short Text Messages in Singlish Using MapReduce and Phonetic Matching Jophia Yi Wen Soh
Kenny Zhuo Ming Lu
School of Information Technology Nanyang Polytechnic 180 Ang Mo Kio Avenue 8 Singapore 569830 Email:
[email protected]
School of Information Technology Nanyang Polytechnic 180 Ang Mo Kio Avenue 8 Singapore 569830 Email: kenny
[email protected]
Abstract—We consider the near duplicate detection problem in the context of English tweets localized with Singapore dialects. We study and implement the detection algorithm using MinHash and SimHash hashing algorithms in MapReduce framework. To handle localized terms (a.k.a. Singlish), we extend the algorithm by applying a phonetic based matching to the out-of-vocabulary terms. The empirical result shows that this approach increases the accuracy.
I. I NTRODUCTION Twitter is one of the major social media source that is indispensable for many analytic applications. Many analyses of twitter data require tweets to be categorized. This is to improve the accuracy of the analysis and clarity of the data visualization. In this work, we set target on detecting nearduplicating tweets found in Singapore cyber space. This subset of tweets exhibits interesting features thanks to Singlish, a localized dialect of English with mixed vocabulary and grammar structures influenced by Chinese Mandarin, Chinese dialects, Bahasa Malay and Indian Tamil. Near-duplicate detection has been a well studied topic [12], [8], [10]. It has been applied in domains of document crawling, indexing, retrieval, classification and plagiarism detection, etc. There have been many works which address duplication detection [12], [8], [10]. To the best of our knowledge there is no work addressing localized English variants such as Singlish. Scalability is one of the main concerns of our project. Our implementation is able to scale up for terabytes of data with a MapReduce [5] oriented design. In this paper, we specifically exclude duplicates caused by retweeting which are indicated via the “RT” keyword in the text or the Twitter API result attribute. We focus on identifying repetitive tweets generated by third party apps (such as games, mobile advertisement app) and human spamming. Our contributions are as follow, 1) We formulate a near-duplicate detection algorithm in MapReduce using MinHash and SimHash. 2) We develop a localized dictionary to handle Singlish phrases. 3) We propose a novel phonetic based matching algorithm for Singlish phases based on regular expression patterns. The paper is organized as follows. In Section II, we go through a set of challenging examples, which will motivate the need of the algorithm and its extensions. In Section III we present a concise formulation of the MinHash-SimHash algorithm in MapReduce Framework. In Section IV we extend the algorithm to handle the Singlish terms arising from the tweets. In Section V, we present the experiment results. In
Section VI we discuss about the related works. We conclude in Section VII. II. E XAMPLES To motivate the key ideas, let’s consider some examples which are regarded as near duplicate. The following sets of tweets generated by some game application. 1 has reached level 25 in Race or Die Online for the iPhone ! Check i t out : http ://b it . ly/2drIz6 2 has reached level 250 in Race or Die Online for the iPhone ! Check i t out : http :// b it . ly/2drIz6 3 has reached level 5 in Race or Die Online for the iPhone ! Check i t out : http :// b it . ly/2drIz6
There are merely differences among them as they are results of some third party game application reporting the player’s progress. This information is not interesting for many of the followers nor generates too much business analytic values (except for the game publisher and developers). Standard algorithms such as MinHash [3] and SimHash [4] are able to identify the similarity among the above tweet snippets. In essence, both algorithms operate by converting the input into a set of features. The similarity comparison is performed over the feature set. Since it is a set, the order of the features is not taken into consideration. Suppose we map each word into a feature, the algorithms are unable to distinguish contrived cases like the following, 1 Before I die , I ’m going to bang a black guy in a bear costume in the lib r ar y . Not too much I ’m asking for . 2 I ’m going to bang a black guy in bear costume in the lib r ar y before I die . I ’m not asking for too much.
To eliminate false-positivity, we consider taking bi-gram or three-gram features when we apply the algorithm. A more detail discussion comes shortly in Section III-C. The task is getting subtle when we consider tweets that contain Singlish phrases. Consider the following 1 [ English ] The soup was f an tas tic ! 2 [ Singlish ] The soup was shiok ! 3 [ English ] That place is so is o lated th at you could hardly see anyone there . 4 [ Singlish ] That place is so ulu th at you could hardly see anyone there . 5 [ English ] Excuse me, do you think i t would be possible for me to enter through th is door to the HR department ? 6 [ Singlish ] Scuse me, is i t possible for me to pass th is door to the HR department ? 7 [ English ] The Malaysian government concludes th at no one is s t i l l alive on MH370. 8 [ Singlish ] The Msian gahment concludes th at no one is s t i l l alive on MH370.
where the odd rows are the tweets in standard English, the even rows are the Singlish counter-part. It is obviously challenging for the “vanilla” MinHash-SimHash algorithm to handle these Singlish phrases. One may suggest that false-negative instances as such can be fixed by either adjusting the threhold of the similarity comparison or adopting a pre-processing phase where the dialect terms are translated statically to the English equivalent. However, adjusting the threshold will introduce more false positive than true postive and lead to the drop of accuracy. Using a static dictionary for translation has a limitation on the fixed size of the vocabularies and time lag of the introductory of new phrases. Our implementation incorporates not only a pre-processing phase to normalize known Singlish terms to their English equivalents, but also a phonetic-based post-processing matching phase for the out-of-vocabulary Singlish terms which can be derived phonetically from the English counter parts. For instance, terms such as “Scuse”, “Msian” and “gahment” will be handled without a need of having their entries in the static dictionary. The gist of the idea is to use regular expression patterns to model the pronounciation of these words and compare them via regular expression matching. The matching result will be used to moderated the duplicate detection scores generated by MinHash-SimHash algorithm.
the same cluster to have minor difference, we set the number of hash functions to be 10. B. SimHash In Figure 2, we describe the SimHash algorithm in ML style of pseudo code. let x = e′ in e denotes a let-binding expression, where x = e′ means one or more bindings. For each binding of x = e′ , if the evaluation e′ leads to a value, the value will be bound to the variable x which can be used in the evaluation context in scope. [·] denotes a list where elements are indexed by integers. v[i] returns the i-th element of the list v. [ei |i ∈ {0, .., 63}] constructs a list of size 64 where the i-th element is ei . We overload lists containing 64 bit elements as 64-bit integers. Function hash64 takes a feature as the input and returns a 64-bit integer as the output digest. Function sum takes a set of integers and returns sum of all elements. The simhash function applies the hash function hash64 to all the features. The resulting intermediate digests are aggregated into a final digest as follows. Let i ∈ {0, ..., 63}, the ith bit in the final digest is determined by the counting the numbers of 0s and 1s at ith bit position of the intermediate digests. The ith bit in the final digest is 1 if there are more 1s than 0s, otherwise the bit is 0. One of the important result of SimHash is that given two similar feature sets, f v1 and f v2 , and a hashing function h that hashes a feature into a 64-bit digest, simhash(f v1 ) and simhash(f v2 ) are differing in a few bits.
III.
M AP R EDUCE - BASED N EAR D UPLICATE D ETECTION USING M IN H ASH AND S IM H ASH In this section, we consider a MapReduce-based algorithm for near duplication detection. The main idea is to use MinHash for clustering, and SimHash for similarity calculation. Naively if we directly apply SimHash [4] to calculate the distance between any two tweets, it requires 2n comparison operations given n tweets. We follow Manku [8] and Seshasa [12]’s works to develop the near duplicate detection algorithm in the MapReduce framework. For scalability, we apply the MinHash [3] to cluster tweets sharing some common features into groups. As a result, we only need to perform the comparison among any two tweets falling into the same cluster. Let k be the number of the clusters, the total time required is reduced to n + k ∗ 2n/k . Before we go into details of the MapReduce based algorithm, let’s have a walk through of the MinHash and SimHash algorithms.
C. Feature Selection As mentioned in Section II, MinHash and SimHash will not take into account the orderings among the features. For instance, let f1 and f2 be features, we have minhash([f1 , f2 ]) = minhash([f2 , f1 ]) simhash([f1 , f2 ]) = simhash([f2 , f1 ]) As a result, using the individual words in a tweet as features will cause two tweets with same vocabulary set and different structures sharing the same digest, for example, refer to Section II. In the context of MinHash clustering, this is still logical as the final result is to be decided by the pair-wise SimHash comparison. In the context of SimHash, feature selection needs to be considered with extra care. We consider multiple options, i.e. single word features, bigram features, and n-gram features. From this point onwards, the string to feature set conversions in minhash(·) and simhash(·) are left implicit. Strings are converted into list of words before being applied to minhash, and converted to list of bi-grams before being applied to simhash.
A. MinHash We describe the standard MinHash algorithm in Figure 1. {·} denotes a set. Function min takes a set of integers and returns the minimum value. The MinHash algorithm approximates the Jaccard distance by applying a set of randomly generated hash functions to the feature sets. The idea is to simulate different orderings among the features. The minimum values of the orderings are selected as representative shingles to compute the final hash result. This W is obtained by applying a bit-wise exclusive-or operation ( ) to all shingles. As a known result of the MinHash algorithm, the more hash functions used in the computation, the better the approximation to the actual Jaccard distance. In the context of checking near duplicate among tweets, we set the number of hash functions used in the MinHash clustering as 10. This is based on the assumption the average length of English words is 5.1 characters [2]. With the maximum length of a tweet is 140 characters, taking into the preceding and trailing spaces into account, we deduce that the maximum number of words (features) in a tweet is around 20. To restrict tweets falling into
D. Hamming Distance The Hamming Distance function compares the given two digests by counting the number of difference bits. The definition is straight-forward, hence we omit the details here. E. MinHash and SimHash in MapReduce MapReduce was first proposed by [5]. It became one of the most popular programming design patterns for its highlevel parallelization. We present a simplified formulation in Figure 3 to illustrate the gist of MapReduce. A task can be implemented using the MapReduce framework if the input consists a set of homogeneous elements {i1 , ..., in } and there is a uniform operation map which transfers all the inputs into a set of homogeneous intermediate output {(k1 , v1 ), ..., (kn , vn )}, where ki is the key used for grouping in the latter step, and 2
minhash(f eatures) = hash f unctions Fig. 1.
=
W {min({g(f )|f ∈ f eatures})|g ∈ hash f unctions}
{g|g is a distinctive hash function mapping features to integers}
MinHash Algorithm
simhash(f eatures) =
Fig. 2.
bit(c)
=
[if(c[i] == 1) 1 else (−1)|i ∈ {0, ..., 63}]
vector(bits)
=
[sum({b[i]|b ∈ bits})|i ∈ {0, ..., 63}]
SimHash Algorithm
(k1 , v1 ) = map(i1 ) ... (kn , vn ) = map(in ) Fig. 3.
let codes = {hash64(f )|f ∈ f eatures} bits = {bit(c)|c ∈ codes} vectors = vector(bits) in [if(vectors[i] < 0) 0 else 1|i ∈ {0, ..., 63}]
)
reduce(ki )([vi1 , ..., vim ]) for i ∈ {1, .., n}
A task implemented in MapReduce (simplified)
vi is the value. The final output is computed by aggregating {vi1 , ..., vim }, which are values sharing the same key ki . The MinHash-SimHash combo naturally fits into the MapReduce framework. In Figure 4, we sketch a near duplicate detection algorithm in MapReduce using MinHash and SimHash. The map function computes the MinHash and SimHash digests out of the text of the tweet. The MinHash digest is used as the key in the reducing step. The reduce function aggregates all SimHash digests that share the same MinHash digest. The aggregation performs a pair-wise hamming distance among the SimHash digests and filters out those results falling below the predefined threshold.
the Out-of-vocabulary (OOV) terms from the text. 1) OOV: A word w is considered as an out-of-vocabulary (OOV) term if • w’s inverse document frequency (IDF) is equal or greater to some pre-defined threshold, and • w is tagged as a noun or an adjective by the POStagger. We apply a phonetic-based matching algorithm to the OOVs and the English counter parts. Given an OOV term whose preceding word is wp and the trailing word is wt , we search for its English counter part in the compared tweet by looking for words that are surrounded by wp and wt . If the phonetic similarity between the OOV term and the English counter part falls below some pre-defined threshold, they are considered the same and the overall hamming distance score shall be reduced. 2) Phonetic-based Matching: Now let’s have a walkthrough of our novel phonetic matching algorithm based on regualar expression matching. In general, there are phoneticbased algorithms to measure the similarity of terms based on pronunciation. Our phonetic-base matching algorithm is a special instance of such kind. Given an English term and a Singlish term, the matching algorithm is to determine the similarity by 1) Normalizing the inputs into sequences of sound symbols, i.e. consonants and vowels. 2) Assuming the Singlish term possesses fewer syllables than its English equivalent, the algorithm construct a regular expression pattern associate with confidence level from the English term. 3) Computing the final confidence score by running the regular expression pattern over the Singlish syllables. For instance, consider the English term “government” and its Singlish counter-part “gahmen”. The latter is a short-hand due to some lazy pronunciation among Singaporean. The
IV. E XTENSIONS A. Preprocessing As a standard practice, we normalize the input before submitting them through the algorithm main pipeline. The steps include removing the punctuations, the hash-tags, the URLs and converting the words into lower cases. In addition, we replace Singlish terms by their English equivalent (if a definite translation exists) by looking up from a pre-compiled static dictionary. As motivated in the earlier sections, a pre-compiled static dictionary is never sufficient to catch the growing Singlish terms. To pushing the limit a bit further, we consider the postprocessing phase which allows us to apply a phonetic matching step to cover the unknown Singlish terms whose are derived phonetically. B. Postprocessing To improve the accuracy, we post-process the tweet pairs whose hamming distances exceed marginally the pre-defined threshold. The objective is to recover the false negatives introduced due to the limitation of the pre-compiled static dictionary used in the pre-processing phase. The idea is to capture a the newly invented Singlish terms by searching for 3
(k1 , v1 ) = map(tweet1 ) ... (kn , vn ) = map(tweetn ) map(tweet)
=
reduce(key)(values) =
)
reduce(ki )([vi1 , ..., vim ]) for i ∈ {1, .., n}
(minhash(tweet.text), (tweet.id, simhash(tweet.text))) ) ( (tid1 , simhash1 ) ∈ values∧ (tid1 , tid2 ) (tid2 , simhash2 ) ∈ values ∧ tid1 6= tid2 ∧ hammingDist(simhash , simhash ) ≤ threshold 1
Fig. 4.
A MinHash-SimHash in MapReduce
algorithm first normalizes the terms into [’g’, ’o’, ’v’, ’er’, ’n’, ’m’, ’e’, ’n’, ’t’ ] and [ ’g’, ’ah’, ’m’, ’e’, ’n’ ], where ’g’ and ’v’ are the consonants and ’o’ and ’er’ are the vowels. Note that for simplicity, we use the ASCII symbols to represent the syllables instead of the IPA phonetic symbols. For every sound symbol in the English sequence, we derive the set of similar sounds. This is implemented by a simple lookup from a SoundMap. A SoundMap assigns the confidence of similarity between two sounds. The confidence score ranges from 0 to 1. E.g. the confidence score assigned to ’g’ and ’g’ is 1, the score assigned to ’g’ and ’k’ is 0.6. In the current prototype, the SoundMap is generated by manual collection and examination. There is opportunity to apply statistical techniques such as SVM, which we consider as future work. For instance, the similar sound set of ’g’ is {g : 1, gh : 0.9, j : 0.4, q : 0.4, k : 0.6}
rm = (m : 1|ǫ)
2)
3)
(1)
(9)
rt = (d : 0.6|t : 1|th : 0.8|ǫ)
(10)
Concatenate all the choice patterns by interleaving with the wild card pattern. (. : −1)∗ , which is a zero or more repetition of the wild card literal . associated with negative score −1. We will explain why it is given a negative confidence in a short while. For brevity, we write (−1)∗ to as a short hand for (. : −1)∗ . For instance,
Note that we use r1 r2 to denote the concatenation of r1 and r2 . The formal definition of the r will be given shortly. As the next step, we match the sound sequence of the Singlish term against the above pattern generated from the English word. Before we dive into the details, let’s consider the simple ordinary regular expression matching problem. Regular expression has been used for sequence validation and extraction in many real world applications such as awk, grep and sed. In Figure 5, we define the syntax of the regular expressions. The matching problem of regular expression is given a word w and a regular expression r, to validate whether w matches with r. Let match(w)(r) be a regular expression match algorithm which yields true or false given the input word w and regular expression r. For example, matching the string “abaac” with the regular expression (a|ab)(baa|a)(ac|c)
(2)
(3)
Let’s use rg to denote the above choice pattern (3). We apply the same step to obtain ro , rv , rer , rm , re , rn , rt , which are defined in the following equations (4) to (10) ro = (o : 1|u : 0.6|al : 0.8|aw : 0.6|or : 0.8| ol : 0.9|ow : 0.9|ul : 0.4|ough : 0.9|ǫ) (4)
yields true and matching “abaac with the same pattern will trigger a failure. Regular expression matching is ambiguous if we consider the sub-matches, i.e. it may yield more than one valid match. For instance, reusing the above regular expression, we use i ∈ {1, 2, 3} refers to the sub-expressions, (a|ab),
rv = (f : 0.8|ph : 0.8|v : 1|w : 0.2|wh : 0.2|ǫ) (5) rer = (er : 1|el : 0.8|ir : 0.9|ur : 0.9|ǫ)
rn = (n : 1|kn : 0.9|ǫ)
rg (−1)∗ ro (−1)∗ rv (−1)∗ rer (−1)∗ rn (−1)∗ rm (−1)∗ re (−1)∗ rn (−1)∗ rt (11)
Note that we use POSIX style of regular expression syntax notation, in which | denotes a choice operator. Include the empty sequence ǫ into the choice pattern, (2) is augmented as follows, (g : 1|gh : 0.9|j : 0.4|q : 0.4|k : 0.6|ǫ)
(7)
re = (a : 0.9|e : 1|i : 0.9|y : 0.6|er : 0.9|ir : 0.9|ǫ) (8)
where l : n denotes a sound literal l associated with confidence level n. When n = 0, l : n is simplified as l. We construct a regular expression pattern by running through the following steps 1) Turn every set of similar sounds with positive confidence into a choice pattern with each sound becoming a literal associated with the confidence score. e.g. the set of in (1) is converted into (g : 1|gh : 0.9|j : 0.4|q : 0.4|k : 0.6)
2
(6) 4
Words:
::= | |
w
TABLE I.
ǫ l∈Σ lw
Empty word Letters Concatenation
Configuration Vanilla Static dict Phonetic Static dict & Phonetic
Regular expressions: ::= | | | | |
r
Fig. 5.
r|r rr r∗ ǫ φ l∈Σ
Choice Concatenation Kleene star Empty word Empty language Letters
::= |
Empty sequence Sound Symbol Concatenation c v
Consonants Vowels
Regular expressions: r
::= | | | | |
r|r rr r∗ ǫ φ l:n
Choice Concatenation Kleene star Empty sequnce Empty language Sound symbol with confidence score
Score: n ∈ {−1.0} ∪ [0.0, 1.0] Fig. 6.
Accuracy (%) 99.9659533 99.9659533 99.9662932
96.141404
97.1210778
99.9662932
’g’ matches rg yield 1.0 as the cumulative score. Note that the pattern is ambiguous because rg possesses ǫ which means we can also match ’g’ with the subsequence wild card pattern (. : −1)∗ , which yields -1.0 as the cumulative result. We carry matching the rest of the sound sequence [’ah’, ’m’, ’e’, ’n’ ] with the remaining pattern. Our algorithm finds all the possible matches. The match yielding the highest cumulative score will be selected as the final result. Running through the remaining sound sequence with the above pattern, we can the highest match score as 3.0, since ’ah’ has to match with the wild card pattern. We divide the final score by the size of the Singlish sequence to obtain 0.6 as the score. 3) Integration: In Figure 7, we present the MinHashSimHash algorithm integrated the pre- and post-processing steps in MapReduce. The norm function used in the map step performs the normalization described in in IV-A. In the reduce step, we introduce three threshold values. thresholdl is the strict threshold which is the same as the original threshold mentioned in Figure 7. thresholdh is the hypothesized threshold where thresholdh ≥ thresholdl . We collect tweet pairs whose hamming distances fall below thresholdl . In addition, we also further examine the tweet pairs whose hamming distances is greater than thresholdl but is still below thresholdh . The extended examination applies the function phoneticM atch to the tweet pairs. The function search for OOV words in the tweet pairs. The resulting score is a value ranging between 0 and 1. The lower the score, the higher the phonetic similarity between the OOVs and the English counter parts. We multiply the phonetic matching result with the hamming distance to obtain the final score. Implementation details can be found in Appendix B.
Sound Symbols: l
Precision (%) 97.1505984 97.1505984 97.1210778
rg = (g : 1|gh : 0.9|j : 0.4|q : 0.4|k : 0.6|ǫ) (13)
Sound Sequences: ::= ǫ | l | ls
Sensitivity (%) 96.0688474 96.0688474 96.141404
Since
Regular Expression Syntax
s
E XPERIMENT R ESULTS
Regular Expression Phonetic Pattern
(baa|a) and (ac|c). And we use (i, wi ) to denote the submatches. The running example actually have two possible matches {(1, ab), (2, a), (3, ac)} and {(1, a), (2, baa), (3, c)}. Now let’s take one step further to extend the regular expression matching algorithm to handle sound symbols with confidence scores. In Figure 6, we extend the regular expression pattern with support of phonetic symbols and confidence scores. It is clear that the extension is very minimal. The input word becomes sound sequence in which the elements are either consonants or vowels. The literal regular expression pattern becomes l : n, which is a sound symbol l associated with the confidence score n. Recall the running example, g : 1 is a sound symbol g associated with score 1.0. The wild card pattern . : −1 is used to capture sound symbols that could not be matched by any of the similar sound set generated by the target English term. The associated confidence score -1 is to penalize as these “unmatched” sounds. The matching algorithm of the regular expression sound pattern will output and additional result as the cumulative score for the match. Recall our running example, we would like to match [ ’g’, ’ah’, ’m’, ’e’, ’n’ ] with rg (−1)∗ ro (−1)∗ rv (−1)∗ rer (−1)∗ rn (−1)∗ rm (−1)∗ re (−1)∗ rn (−1)∗ rt
V. E XPERIMENT In Table I, we evaluate the performance of the proposed algorithm by measuring the sensitivity, precision and accuracy. We benchmark the vanilla MinHash-SimHash algorithm against the other configurations consists of extensions. The set of data we use in the experiment includes the tweet corpus captured between December 2013 and April 2014 specifically on the topic of GCE O’ Level result 2014 in Singapore. The data set consists of 8191 tweets. The sensitivity, precision and accuracy are computed by benchmarking the set of potential near duplicates generated by the algorithm against the set of generated by manual identification. In this experiement, we only consider using the bigram feature set during the SimHash computation. Based on some prelimanary test, we set 6 to tbe threshold of the hamming distance results. In addition, the phonetic matching extension re-verifies those potential duplicates whose hamming distance scores fall into the range between 6 and 8.
(12) 5
(k1 , v1 ) = map(tweet1 ) ... (kn , vn ) = map(tweetn )
)
reduce(ki )([vi1 , ..., vim ]) for i ∈ {1, .., n}
= (minhash(norm(tweet.text)), (tweet, simhash(norm(tweet.text)))) (t1 , simhash1 ) ∈ values∧ (t2 , simhash2 ) ∈ values ∧ t1 .id 6= t2 .id∧ hDist = hammingDist(simhash , simhash )∧ 1 2 (t1 .id, t2 .id) reduce(key)(values) = (hDist ≤ thresholdl )∨ hDist ≤ thresholdh ∧ phoneticM atch(t1 , t2 ) ∗ hDist ≤ thresholdl map(tweet)
Fig. 7.
Integrating Phonetic Matching into the MinHash-SimHash algorithm in MapReduce
As reported by Table I, the static dictionary preprocessing extension alone does not contribute any improvement to the algorithm. With the phonetic matching extension enabled, the algorithm is able to identify more true positive, hence the sensitivity increases. It leads to a drop in the precision, thanks to the introduction of the false positive. In overall the accuracy increases as there are more true positive introduced than false positive. One thing to note that the increase seems marginally thanks to high proportion of true negativity. We are planning to include more data sets in the benchmark in the near future.
As the future work, we are planning to look into incorporation of static modeling technique such as SVM to generate an “organic” sound map database for our phonetic matching algorithm. R EFERENCES [1] Apache Hadoop. http://hadoop.apache.org/. [2] Average word length in English. http://www.wolframalpha.com/input/?i=average+english+word+length. [3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 327–336, New York, NY, USA, 1998. ACM. [4] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. ACM. [5] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008. [6] Patrick A. V. Hall and Geoff R. Dowling. Approximate string matching. ACM Comput. Surv., 12(4):381–402, December 1980. [7] Bo Han and Timothy Baldwin. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 368–378, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [8] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 141–150, New York, NY, USA, 2007. ACM. [9] L. Philips. Hanging on the metaphone. Computer Language Magazine, 7(12):38–44, 1990. [10] Bingfeng Pi, Shunkai Fu, Weilei Wang, and Song Han. Simhash-based effective and efficient detecting of near-duplicate short messages. In Proceedings of the Second Symposium International Computer Science and Computational Technology, ISCSCT ’09, pages 20–25, 2009. [11] David Pinto, Darnes Vilari˜no, Yuridiana Alem´an, Helena G´omez, and Nahun Loya. The soundex phonetic algorithm revisited for sms-based information retrieval. In Proceeding of II Spanish Conference on Information Retrieval, CERI 2012, 2012. [12] Shreyes Seshasai. Efficient near duplicate document detection for specialized corora, 2008. Master thesis. [13] Martin Sulzmann and Kenny Zhuo Ming Lu. Regular expression sub-matching using partial derivatives. In Danny De Schreye, Gerda Janssens, and Andy King, editors, PPDP, pages 79–90. ACM, 2012. [14] Twitter Scalding. https://github.com/twitter/scalding.
VI. R ELATED W ORK Seshasai’s work [12] gave a clear comparison among different techniques of checking document similarity. In addition, he studied a number of techniques for optimizing SimHash performance, such as feature selection and weight settings. We are considering incorporating the weight assigning technique as one of the possible future extension. Han and Baldwin [7] address the issue of unparsable texts found in SMS and twitter by creating a pre-processor to normalize these unparsable tokens (AKA ill-formed words) into their original counter-parts. In [11] Pinto et al evaluate the effectiveness of using phonetic algorithms such as Soundex in information retrieval using SMS as queries. Their targeted languages are English and Spanish. We are planning to applying our phonetic matching approach in the domain of IR. Soundex [6] and Metaphone [9] are two classic techniques for phonetic matching and searching for names. The gist of their idea is to build index of words based on the pronunciation of the consonants of the words. Our phonetic matching is inspired by them. While focusing on the comparison instead of searching, we take into account of the vowels. We use regular expression as a domain specific language to denote the sound pattern. The sound sequence matching is naturally followed from the regular expression matching algorithm. VII. C ONCLUSION We applied the MinHash and SimHash algorithm to search for near duplicate tweets in Singlish using MapReduce. The experiment results confirm that this algorithm combo is very accurate. We further improve the algorithm by incorporating a phonetic matching post-processing step. Empirical results suggest that there is an effective improvement. 6
1 object MinHash { 2 val random = new Random(2147483647) 3 val coefs = for { ← Stream . from(1) } yield random . nextInt ( ) 4 val hash funs = for { (a , b) ← coefs . zip ( coefs . drop (1) ) 5 } yield ( (x : Int ) ⇒ ( a + b∗x) ) 6 7 def hash min( f : Int ⇒ Int ) ( xs : List [ Int ] ) : Int = { 8 def hash second min ( a : Int , b : Int ) : Int = a . min( f (b) ) 9 xs . foldLeft ( Int . MaxValue ) ( hash second min ) 10 } 11 12 def shingle (n : Int ) (nums: List [ Int ] ) : List [ Int ] = 13 for { f ← hash funs . take (n) . toList 14 } yield hash min ( f ) (nums) 15 16 def minhash [A] ( fv : List [A] , num of hash : Int ) : Int = { 17 val is = fv .map( . hashCode ( ) ) 18 val sh = shingle (num of hash ) ( is ) 19 sh . foldLeft (0) ( (x , y) ⇒ x ˆ y) 20 } 21 } Fig. 8.
1 Tsv( ”input . tx t ” ) ) . read . 2 .map( ’ tex t → ’ n text ) { tex t : String ⇒ tex t . s p l i t ( ”““s+” ) . toList } 3 .map( ’ n text → ’minhash ) { ls : List [ String ] ⇒ 4 MinHash . minhash ( ls ,10) 5 } 6 .map( ’ n text → ’simhash ) { ls : List [ String ] ⇒ 7 val fv = toFeature ( ls ) 8 SimHash. simhash ( fv ) ( hashcode64 ) 9 } 10 . groupBy ( ’minhash ) { 11 x⇒x . toList [ ( String ,Long) ] 12 } ( ( ’ rowid , ’simhash ) → ’ id hash ) 13 .map( ’ id hash → ’ id hash ) { 14 id hashes : List [ ( String ,Long) ] ⇒ findNearDup ( id hashes ) 15 } 16 . flatMap ( ’ id hash → ( ’ orig id , ’dup id ) ) { 17 tup2 : ( ( String , Long) , ( String , Long) ) ⇒ tup2 match { 18 case ( ( id , hash ) , ( dup id , dup hash ) ) → ( id , dup id ) 19 } 20 } 21 . project ( ’ orig id , ’dup id ) 22 . write (Tsv( ”output ” ) )
Minhash algorithm in Scala Fig. 10. Near Duplicate detection in MapReduce using Scalding (Simplified)
1 object SimHash { 2 def simhash [A] ( fv : List [A] ) ( hash :A ⇒ Long) : Long = 3 { 4 val hashCodes = fv .map( hash ) 5 val bitSets = 6 hashCodes .map( x ⇒ (0 to 63) .map( z ⇒ i f ( tes tB it (x , z) ) 1 else −1 ) ) 7 val zeros = (0 to 63) .map ( ⇒ 0 ) . toList 8 val wtVector = bitSets . fold ( zeros ) ( . zip ( ) .map(xy ⇒ xy . 1+xy . 2) ) 9 (0 to 63) . zip ( wtVector ) . foldLeft (0L) ( (x , bv) ⇒ sign (bv . 2) (x , bv . 1) ) 10 } 11 12 def tes tB it (num:Long, i : Int ) : Boolean = ( (num & (1L