2010 International Conference on Pattern Recognition
Detect Visual Spoofing in Unicode-based Text* Bite Qiu, Ning Fang, Liu Wenyin Department of Computer Science, City University of Hong Kong
[email protected],
[email protected],
[email protected] “а” is mapped to a string “xn--80a”. Restriction techniques are usually deployed in domain name registration level. Top-level-domain registry may apply policies [8] to restrict the usage of homoglyphs in IDN and specify methods to monitor homographic domains. In addition, Liu et al. [5] proposed to color suspicious characters by a fixed color palette or an adaptive color palette when mix scripts are found within a single word. Finally, browser vendors may integrate above defenses into their browsers and provide options for users to customize the desired security level [7]. Above defenses take security actions for IDN without knowing whether it is a real attack or not. Even though homoglyphs are potentially exploited for deceptive usage, it is a mistake to conclude that all homoglyphs are malicious or spoofing. For example, homoglyphs of character ‘l’ should be considered as spoofing if it exists in context of “PayPal”, but may not be considered as spoofing in another semantic-less context, such as “letter-l”. Both transformation and restriction techniques carry inconvenience to end users and domain name owners. Moreover, these IDN defenses leave plain text in web content unprotected. Therefore, we proposed a context-aware method to detect malicious homoglyphs, which allows browsers to take smarter security actions that tackle real attacks and minimizes the disturbance to end users. A Unicode visual spoofing string is usually produced by replacing one or more characters of the legitimate string with their homoglyphs in the characters’ general context. We assume that the frequency of occurrences of a legitimate string on the Web is higher than that of its spoofing string, as a kind of prior knowledge. By taking the prior knowledge into consideration, we employ a Bayesian framework to detect a suspicious string. In the Bayesian framework, the similarities of homoglyphs, which are adopted from Fu etc. [6], are also included. Through a series of evaluations, the model is witnessed to be effective to identify the suspicious characters as spoofing or not.
Abstract Visual spoofing in Unicode-based text is anticipated as a severe web security problem in the near future as more and more Unicode-based web documents will be used. In this paper, to detect whether a suspicious Unicode character in a word is visual spoofing or not, the context of the suspicious character is utilized by employing a Bayesian framework. Specifically, two contexts are taken into consideration: simple context and general context. Simple context of a suspicious character is the word where the character exists while general context consists of all homoglyphs of the character within Universal Character Set (UCS). Three decision rules are designed and used jointly for convicting a suspicious character. Preliminary evaluations and user study show that the proposed approach can detect Unicode-based visual spoofing with high effectiveness and efficiency.
1. Introduction There are many similar-looking characters in the Universal Character Set (UCS), which can cause severe web security problems. Unicode-based web homoglyph fraud (a.k.a. homograph attack [1]) is just one of such examples. A homoglyph is one of two or more characters with shapes that are either identical or cannot be differentiated by instant visual inspection [2]. Therefore, the homoglyph becomes useful in many content-based attacks, such as phishing attacks and spam attacks. A real case is that a faked “paypаl.com”, in which the second ‘а’ is a Cyrillic letter (U-0430) instead of the Latin ‘a’ (U-0061), was successfully registered in 2005. We classify visual spoofing defenses in IDN into 2 categories: transformation and restriction. Punycode, as proposed by Unicode Consortium [3][4], is a widely adopted transformation technique. It allows the nonASCII characters to be transformed uniquely and reversibly into ASCII characters. For example, Cyrillic
* The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 117907].
1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.480
1953 1949
2. The approach
calculated; if the resulting probability of SC[i], denoted as P(SC[i]|SC), is larger than any of the homoglyphs in its generally context, SC[i] is considered as a legitimate character, otherwise it is a spoofing character. ⎧⎪legitimate ∀cg ∈GC(SCi [ ]), P( SCi[ ]| SC) > P( cg | SC) , (5) SCi [ ] =⎨ else ⎪⎩ spoofing where, P(SC[i]|SC) and P(cg|SC) can be derived with the following formula: P( SC | x) ⋅ P( x) P( x | SC) = = A⋅ P( SC | x) ⋅ P( x) P( SC) , (6) where, x∈GC(SC[i]), i denotes the position of the suspicious character. A is the constant 1/P(SC). P(SC|x) can be derived as: f (ww[i]=x ) ⋅ Sim(x, SC[i]) P( SC | x) = f (ww[i]=cg ) ⋅ Sim(cg , SC[i])
2.1. Definitions The proposed approach takes full advantage of the context of a suspicious character/string with a probabilistic Bayesian Model, and the approach can mark (e.g., color) spoofing characters in a given string and prevent end users from being deceived. Specifically, we define and distinguish two types of contexts: simple context and general context. The simple context of a character is defined as the set of Unicode characters that a word includes where the character exists. The general context of a character is defined as the set of homoglyphs of the character in UCS. They are denoted as follows: SC ( c ) := {cs | cs ∈ w, c ∈ w, cs ≠ c}
,
GC ( c ) := {cg | c g = c, c g ∈ UCS , c ∈ UCS } ,
(1) (2)
∑
where, c denotes a suspicious character; w denotes the word where character c exists; cs denotes a neighboring character around character c in w and cg denotes a homoglyph of c; SC(c) denotes the set of neighboring characters of c in w; GC(c) is the set of homoglyphs of c. The symbol ‘ ∼ ’ denotes visually similar relation under a specified threshold. The simple context and the general context of a suspicious character are used as prior knowledge to calculate prior distribution. The prior distribution of c is the proportion of the occurring frequencies of the original word w (where c exists) in the total occurring frequencies of all the similar words (which are referred to as new words in the rest of this paper), where c is replaced by its homoglyphs.
{
},
New words := ww[ i ]= cg | ∀cg ∈ GC ( c )
(
(
Prior distribution := p f ( w ) , f ww[i ]= cg
)) ,
cg∈GC(SC[i])
, (7) where, f(ww[i]=x) is the frequency of ww[i]=x, which is valued as the number of returning results from Google with ww[i]=x as the query, and Sim(x, cg) is the visual similarity between x and cg and is derived with an application proposed by Fu et al. [6]. The value of P(x) can be obtained as follows: f (x) P( x) = f (c) c∈UCS . (8)
∑
2.3. Decision rules based on context In fact, there is some heuristics to examine all possible words. Hence, three heuristic decision rules are developed to prune away the unnecessary computations. A regular text (e.g., URL/webpage/e-mail) generally includes only one or a few languages; its simple context is thus limited to a small set of Unicode groups/subgroups. For example, English usually adopts Latin scripts, and Chinese uses CJK scripts. Therefore, we defined the first rule of verifying a suspicious character as follows: Rule 1: A legitimate string tends to involve a limited number of Unicode groups/subgroups in UCS, usually only one group/subgroup. If the Unicode group of a character is different from that of its neighboring characters in its simple context, the character is judged as spoofing, or legitimate otherwise. The rule is denoted as follows, ⎧legitimate UG( c) ≠ UG( cs ) , (9) c=⎨ else ⎩ spoofing
(3)
(4) where, ww[i]=Cg denotes a new word in which w’s ith character is replaced with cg; f(·) denotes the occurring frequency of the corresponding word; p(·) is the function of probabilistic density function, that is, the normalized occurring frequency.
2.2. Bayesian inference For a given suspicious word, we iteratively check each character in the word. If certain character is determined as spoofing, it will be highlighted to warn users. Each character SC[i] (i.e., the ith character in the simple context/word), will be examined in the following way: for each homoglyph cg in the general context of SC[i], the probability of cg conditioned on the simple context of SC[i], represented as P(cg|SC), is
1954 1950
where, function UG(c) denotes the Unicode group/subgroup of c. The above rule is not sufficient to detect spoofing, because there are many legitimate usages of mixed scripts. Especially, it is quite common to mix English words (with Latin characters) with other languages, including languages using non-Latin scripts. Even in English, legitimate product/organization names may contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, and HλLF-LIFE. Moreover, English also adopt some words from other languages, e.g., “résumé” and “naïve”. Based on the above analysis, we define the second rule as follows, which is also a complement of Rule 1. Rule 2: Although a suspicious character belongs to a different Unicode group from its neighboring characters, if there is no visually similar character in the latter group to the suspicious character, the suspicious character will be judged as legitimate, otherwise, as spoofing. The rule can be described as follows.
Unicode visual spoofing detection. Moreover, any new rule can also be generated and added to this approach according to user’s experience and knowledge. In addition, a larger simple context sometimes is necessary, such as the phrase or the whole sentence where the suspicious character exists.
3. Experiments and Evaluation In this paper, we frequently need to know the visual similarity of two readable characters. This can be done by an image similarity assessment algorithm. As the dataset of UCS is quite large, consisting of tens of thousands of readable characters, considering the overall performance (less than 1 second to process a short paragraph, not discussed in this paper due space limit), we adopt a simple, but fast and effective pixeloverlapping algorithm that proposed by Fu et al. [6]. In this paper, we adopt the threshold of visual similarity as 0.9. That is, two Unicode characters with visual similarity over 0.9 will be considered as homoglyph of each other, thus they will be a member of general contexts of each other. For example, under the threshold of 0.9, Latin character ‘a’ (U-0061) have four members in its general context: ‘a’ (U-0061), ‘а’ (U0430), ‘a’ (U-FF41) and ‘ạ’ (U-1EA1). Figures 1-3 show the detection results of our prototype system on a sentence at level 1, 2, and 3, respectively. We substitute certain characters in the sentence with five different Unicode visual spoofing characters respectively. In the result column, the judged spoofing characters are marked in red color. Level 1 indicates that only Rule 1 is used to judge a suspicious character; level 2 indicates that both Rule 1 and Rule 2 are used; level 3 indicates that all 3 rules are used. In Figure 1, three spoofing characters, i.e., ‘a’, ‘m’, ‘b’, ‘е’ , are detected correctly, and one spoofing characters, i.e., ‘I’, is missing. Meanwhile, there are also a number of false alarms, i.e., “中国银 行 ”. In Figure 2, three spoofing characters, i.e., ‘е’ (Unicode 0x0435), ‘ m ’ (Unicode 0xFF4D), ‘ b ’ (Unicode 0xFF42), are detected correctly, and one spoofing characters, i.e., ‘I’ (Unicode 0x0049), is missing. In Figure 3, all the four spoofing characters are detected correctly. The precision and the recall of level 3 is always higher than or equal to level 2, and that of level 2 is higher than or equal to level 1.
⎧legitimate UG( c) ≠ UG( cs ) and ∀c ' ∈UG( cs ) , c ∝ c ' , (10) c=⎨ else ⎩ spoofing
where, c’ denotes a character in a Unicode group; ∝ represents a dissimilar relationship. Actually, homoglyphs may be found within the same Unicode group/subgroup. For example, lowercase letter ‘l’ and digit ‘1’ in ASCII are visually confusable, which causes difficulty to distinguish a legitimate “paypal” and a faked string “paypa1” (letter ‘l’ is replaced by digit ‘1’). However, if we search the legitimate word “paypal” and its faked word “paypa1” in Google, the numbers of returning results are significantly different (326,000,000 and 2,480, respectively). Therefore, in this case, the faked word can be detected based on the above Bayesian inference (in Section 2.2) in the third rule as follows. Rule 3: Usually a legitimate Unicode-based word can be discovered in much more web pages than its visual spoofing words. For a suspicious character c in a word SC, if P(c|SC) is larger than a threshold or is the largest among all P(cg|SC), c is judged as legitimate, otherwise as spoofing. The rule is denoted as follows, ⎧ UG( c) = UG( cs ) , and ⎪legitimate ∀cg ∈GC ( c) , P( c | SC) > P cg | SC , c=⎨ ⎪ else ⎩ spoofing
(
)
(11)
where, P(c|SC) denotes the posterior probability of the suspicious character c; UG(c)=UG(cs) indicates that the Unicode group/subgroup of the suspicious character c is the same with that of most neighboring characters cs included in its simple context. To detect a suspicious word, the above three rules will be used jointly to improve the performance of
Figure 1: Results of detection at level 1
1955 1951
notably improve effectiveness and efficiency in detecting text based homoglyph attacks.
4. Conclusion In this paper, we first define the simple context and the general context of a suspicious character, based on which the probability of the character occurring in its context can be calculated. A Bayesian framework is then used to calculate the posterior distribution of the suspicious character. If the probability of the suspicious character is above a threshold or maximal among all the probabilities of its homoglyphs, the character is detected as legitimate character, otherwise, as spoofing. We also use three decision rules to improve the performance of spoofing detection in a practical prototype system. The proposed context-based approach can be easily applied as a browser plug-in to benefit end users. It is different from existing solutions, which either give some restrictions to users, agents, programmers, and registrar organizations, or map the Unicode scripts into a uniform format but lose some of the original semantics of characters. There is no restriction to users or loss in semantics in our approach. Preliminary evaluations and user study show that the proposed approach can improve the accuracy of Unicode visual spoofing detection and assist human’s judgment of Unicode visual spoofing detection effectively and efficiently.
Figure 2: Results of detection at level 2
Figure 3: Results of detection at level 3
In another case, we select top 20 famous domain names from: http://www.alexa.com/topsites. All words generated from the replacements of homoglyphs are completely detected in level 1, level 2, and level 3 respectively. The numbers of false alarms are also recorded in different levels. Table 1 lists only 5 of them. Table 1: Number of false alarms of top 5 domain names after replacements of all similar characters in terms of different levels replacement Level Level Level Domain name number 1 2 3 google.com 22000 2000 2000 0 yahoo.com 2000 0 0 0 facebook.com 187500 0 0 0 youtube.com 16000 0 0 0 live.com 2310 210 210 0 In Table 1, replacement number denotes the number of words after replacements of all homoglyphs; Level X denotes the number of false alarms in the corresponding level. Since letter ‘I’ (Unicode 0x0049) is visual similar with letter ‘l’ (Unicode 0x006C) in threshold 0.9, and both letters belong to the same Unicode group, i.e., Latin script, two words, that is, “googIe” and “Iive”, are judged as legitimate in both Level 1 and Level 2, but they are judged correctly as spoofing in Level 3. In addition, we conducted a user study based on two data sets to examine how satisfactory can our method help users to improve effectiveness and efficiency in visual spoofing detection. One dataset is preprocessed with coloring hints generated by our method. Seven participants are required to find spoofing characters in both datasets. Result shows 100% precision for both datasets, however, the average recall value improved from 49% to 93% under the help of colored hints. Based on the result of user study, we can conclude that machine-generated hints can assist end users to
References [1] E. Gabrilovich, A. Gontmakher. The Homograph Attack. Communications of the ACM .45(2), 2002. [2] http://en.wikipedia.org/wiki/Homoglyph . [3] http://www.unicode.org/reports/tr39/ . [4] http://www.unicode.org/reports/tr36/ . [5] W. Liu, Y. Fu, X. Deng. Expose Homograph Obfuscation Intentions by Coloring Unicode Strings. APWeb 2008, pp. 275-286. [6] Y. Fu, X. Deng, W. Liu. REGAP: A Tool for Unicodebased Web Identity Fraud Detection. Journal of Digital Forensic Practice (JDFP), Vol. 1, No. 2, 2006. [7] http://en.wikipedia.org/wiki/IDN_homograph_attack [8] http://www.faqs.org/rfcs/rfc3743.html
1956 1952