Understanding Passwords of Chinese Users

1

Understanding Passwords of Chinese Users: Characteristics, Security and Implications Ding Wang Student Member, IEEE, Haibo Cheng, Qianchen Gu, Ping Wang Senior Member, IEEE Abstract—While a lot has changed in computer security over the last twenty years, textual passwords remain the dominant mechanism of authentication over computer systems and are likely to persist in the foreseeable future. Though much attention has been paid to passwords chosen by English users, relatively little is known about passwords selected by non-English users, especially by those who use hieroglyphic characters as their native languages. In this work, we initiate a systematic study into the fundamental properties that characterize passwords of Chinese users, the largest Internet population in the world. We, for the first time, uncover several striking findings on the basis of a corpus of 100 million real-life Chinese web passwords (as well as 30 million English ones), the largest corpus of user-generated passwords ever studied. We further conduct a series of experiments on these datasets by employing two state-of-the-art password cracking techniques, i.e., PCFGbased ones and Markov-Chain-based ones. Remarkably, our results reveal a “reversal principle”: when the guess number allowed is small, Chinese web passwords are much weaker than their English counterparts, yet this relationship will be reversed when the guess number is large. This implies that at somewhere these two groups will be of similar security strength, which well reconciles two conflicting claims about the strength of Chinese web passwords made by Bonneau in IEEE S&P’12 and Li et al. in USENIX SEC’14, respectively. At ten million guesses, the success rate of our improved PCFG-based attack (using Duowan as the training set) against the remaining five Chinese datasets is from 33.2% to 49.8%, which means that our improved attack can crack 92% to 188% more passwords than the best record reported by Li et al. in 2014. We believe that our work contributes to a much better understanding of the characteristics and security of Chinese passwords, and will serve as a groundwork for world-wide security administrators and Chinese individual users to more informedly secure their service accounts. Index Terms—User authentication, Password security, Semantic pattern, Probabilistic context-free grammar, Markov model.

F

1

I NTRODUCTION Growth begins only when we begin to accept our own weakness. –Jean Vanier. Befriending the Stranger

Password authentication is the dominant form of access control in computer systems. Though the security pitfalls of textual passwords were revealed as early as four decades ago [1] and various alternative authentication methods (e.g., graphical passwords [2] and multi-factor authentication [3]) have been proposed since then, they stubbornly survive and reproduce with every new computer system while Internet technologies have advanced by leaps and bounds in other areas. For one reason, textual passwords offer advantages, such as low deployment cost, easy recovery and remarkable simplicity, which cannot always be matched by their alternatives [4], [5]. The matter is further complicated by the shortage of effective methods to quantify the obscure costs of replacing passwords [6], while marginal gains are often insufficient to reach the activation energy that is necessary to overcome the significant transition costs. Thus, passwords remain the primary method for user identification and are likely to persist in the foreseeable future. Despite its ubiquity, password authentication is stuck with an inherent tension [7]: truly random passwords are hard for users to memorize, while user-memorable passwords may be highly predictable. To deal with this notorious “securityusability” dilemma, researchers have devoted significant efforts [8]–[11] to the following two types of studies. Type1 research aims at evaluating the strength of a password • D. Wang, H. Chen, Q. Gu and P. Wang are with the School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China. Email: {wangdingg, hbchen, cqgu, pwang}@pku.edu.cn Manuscript received 27 April 2015; revised

.

dataset by gauging its Shannon entropy as in NIST-800-63 [12], or by measuring its “guessability”. The latter notion characterizes the fraction of passwords that, at a given number of guesses, can be cracked by password cracking algorithms like the probabilistic context-free grammars (PCFG) [13] and Markov-Chain-based attacks [9]. Type-2 research attempts to reduce the use of weak passwords, and mainly two approaches have been utilized: proactive password checking [14] and password strength meter [15]. The former checks the user-selected passwords and only accepts the ones that comply with the system policy (e.g., at least 8 characters long). The latter is typically a visual feedback of password strength, often presented as a colored bar to help users create stronger passwords [16]. Most of today’s leading sites employ a combination of these two approaches to restrict users from choosing weak passwords. In this work, though we mainly focus on type-1 research, one can see that our results are also helpful for type-2 research. 1.1 Motivations Existing literature mainly focuses on passwords chosen by English users, or more precisely, by netizens that use the Latin alphabet, yet little attention has been paid to the characteristics and strength of passwords generated by those who use other native languages. For instance, password “wanglei123” is currently deemed “Strong” by many highprofile password strength meters such as those of AOL, Google, IEEE, and Sina weibo (known as “Chinese Twitter”). However, there is no doubt that this password is highly prone to guessing, for “wanglei” is an extremely common Chinese Pinyin (i.e., the Chinese phonetic alphabet) name but not a random string of length seven. Failing to catch this is equal to blindly overlooking the weaknesses of Chinese passwords, posing the corresponding account at high risks.

2

Not surprisingly, so far there has been no satisfactory answer to the following fundamental questions: Are there any characteristics that differentiate Chinese web passwords from English web passwords? What’s the security strength of Chinese web passwords? Are they weaker or stronger than English web passwords? As there have been 649 million Chinese Internet users by the end of 2014 [17], which account for more than a quarter (and also the largest proportion) of the world’s Internet population, and this number still grows with its momentum, it is of great importance to answer the aforementioned questions to provide both security administrators and Chinese users with necessary security guidance. For instance, if the answer to the first question is affirmative, then it highly indicates that the traditional password creation policies and strength meters originally designed for English users can not be readily applied to Chinese users. To the best of our knowledge, Li et al.’s work [18] may be the closest to what we will investigate in the current paper. However, as we will show in this work, it imprudently renders two of its five Chinese datasets at best useless and at worst negative when performing data cleansing. As a result, the effectiveness of the results reported in [18] will be greatly impaired. Besides this, many fundamental characteristics, such as length distribution, frequency distribution and Pinyin-name-based semantic patterns, have not been explored in their work. What’s more, they proposed an improved PCFG-based cracking algorithm and at 10 billion guesses, their best success rate is about 17.3%, which is significantly lower than what will be reported based on our improved PCFG-based algorithm in this work. Compared with our attacking results based on the Markovbased algorithm, this gap will be even more prominent. Most importantly, Li et al. invalidated the remarks made in [8] that Chinese passwords are among the strongest ones to guess than those of many other populations like English users, and further concluded that “the strength of the passwords of Chinese and English users is similar”. However, we will demonstrate that Li et al.’s conclusion is still biased. Last but not the least, no implications of the characteristics and strength of Chinese web passwords has been explored in [18], and we will fill the gap in this work. We also note that, Ma et al.’s work [9] has employed six password datasets, three of which are from Chinese websites, yet it mainly deals with the effectiveness of probabilistic password cracking models and pays little attention to the characteristics of Chinese web passwords. 1.2 Contributions To answer the above three fundamental questions, we leverage about 100 million publicly available passwords from six popular Chinese sites and more than 30 million passwords from three English sites, the largest corpus of user-chosen passwords ever studied. Benefiting from the plain-text form of these 130 million passwords, we seek for fundamental properties that characterize user-chosen passwords, and for the first time manage to identify several striking characteristics of Chinese web passwords in comparison to English web passwords. Particularly, by investigating the password character distributions in terms of inversion number, we are able to provide a better understanding of to what extent user passwords are influenced by their native languages.

As Chinese users are familiar with Pinyin and each has a Pinyin name (e.g., “wanglei” and “zhangwei”), is there a high probability that they insert their Pinyin names into their passwords? We establish it in this work: every one in nine. Surprisingly, the proportion of English passwords that include an English name (at least five characters long) reaches a staggering 24.3%, i.e., one in four. To the best of our knowledge, we for the first time explore the name patterns in a large-scale empirical password study. Further, we confirm that English users excessively employ common English words to build passwords. In contrast, being incompetent to understand English words, Chinese users excessively employ digits, especially dates and simple digit patterns — every one in six users inserts a 6-digit birthdate into her password. Surprisingly, despite these notable differences in password composition, passwords chosen by these two distinct groups of users (i.e., in terms of language and location) are of quite similar frequency and length distributions. While there are distinctive characteristics between passwords of English users and Chinese users, an interesting question naturally arises: which group of passwords is generally more secure? In this work, we employ two stateof-the-art password-cracking algorithms (i.e., PCFG-based and Markov-Chain-based [9]) to measure the strength of Chinese web passwords. Based on the identified characteristics of Chinese passwords, we improve the original PCFGbased algorithm to more accurately capture user-generated passwords that are of a monotonically long structure (e.g., “1qa2ws3ed” and “1a2b3c4d”). We further adjust the PCFGs learned from the training set by adding about 98.6K Chinese Pinyin names and 24.4K six-digit birthdays. At ten million guesses, our improved approach is able to crack 92% to 188% more passwords than the results reported by Li et al. [18] in 2014. Remarkably, our extensive experiments and comprehensive comparisons reveal a “reversal principle”: when the guess number allowed is small, Chinese passwords are much weaker than their English counterparts, yet this relationship will be reversed when the guess number is large, thereby reconciling the contradictory claims made in [8], [18]. Furthermore, we highlight some critical implications that our above findings are likely to have for password cracking, password strength meters and password creation policies. As far as we know, we, for the first time, provide a largescale empirical evidence (i.e., on the basis of 6.43 million CSDN passwords and 16.26 million Dodonew passwords) supporting for the hypothesis proposed in [16], [19] that users rationally choose more complex and secure passwords for accounts associated with higher value.

2 R ELATED WORK It has long been an interesting (yet challenging) research area to analyze user-generated passwords, dating back at least to Morris and Thompson’s seminal work in 1979 [1]. They analyzed a corpus of 3,000 passwords and reported some basic password statistics like password lengths (e.g., 71.12% were no more than 6 characters long) and frequency of nonalphanumeric characters (e.g., 13.93% of passwords). This work has been followed by a great number of studies, some notable ones include [8], [13], [15], [20], [21]. We present an overview of prior research on password characteristics and cracking, while other topics such as password usability and management [22], [23] are out of the scope of this work.

3

2.1 Password characteristics The literature features a wealth of analysis of password characteristics on English web passwords, including statistics on structural patterns and much deeper semantic patterns. Basic statistics. In 1990, Klein [20] collected 13,797 computer accounts from his friends and acquaintances around US and UK, and observed that users tend to choose passwords that can be easily derived from dictionary words: a dictionary of 62,727 words is able to crack 24.21% of the collected accounts and 51.70% of the cracked passwords are shorter than 6 characters long. In 2004, Yan et al. [7] found that user-chosen passwords are likely to be dictionary words since users have difficulty in memorizing random strings, and that the lengths of passwords in their study (involving 288 participants) are on average between 7 and 8. In 2012, Bonneau [8] conducted a systematic analysis of 70 million Yahoo passwords. This work examines dozens of subpopulations based on demographic factors (e.g., age, gender and language) and site usage characteristics (e.g., email and retail), and finds that “even seemingly distant language communities choose the same weak passwords”. Particularly, Chinese passwords are among the most difficult ones to crack. In 2014, however, Li et al. [18] argued that Bonneau’s dataset is not representative of general Chinese users, for Yahoo users are those who are familiar with English. Accordingly, Li et al. leveraged a corpus of five datasets from Chinese sites and observed that Chinese users like to use digits when creating passwords, as compared to English users who like to use letters to build passwords. However, as an elementary defect, two of their Chinese datasets have not been cleaned properly (see Section 3.2), which might lead to inaccurate measures and biased comparisons. More importantly, several critical password properties (such as the most popular passwords, the length and frequency distributions of passwords) remain to be explored. In 2014, Ma et al. [9] investigated password characteristics about the length and the structure of six datasets, three of which are from Chinese websites. Nonetheless, this work mainly focuses on the effectiveness of probabilistic password cracking models and pays little attention to the deeper semantic patterns of passwords (e.g., no information is provided about the role of Pinyins, names or dates). Semantic patterns. In 1989, Riddle et al. [24] found that birth dates, personal names, nicknames and celebrity names are common in user-chosen passwords. In 2004, Brown et al. [25] confirmed this by conducting a thorough survey that involved 218 participants and 1,783 passwords, and they reported that the most frequent entity in passwords is the self (accounts for 66.5%), followed by relatives (7.0%), lovers and friends; also, names (32.0%) were found to be the most common information used, followed by dates (7.2%). In 2012, Veras et al. [26] examined the 32 million RockYou dataset by employing visualization techniques and observed that 15.26% of passwords contain sequences of 5-8 consecutive digits, 38% of which could be further classified as dates. They also found that repeated days/months and holidays are popular, and when non-digits are paired with dates, they are most commonly single-characters, or names of months. In 2014, Li et al. [18] noted that Chinese users tend to insert Pinyins and dates into their passwords. However, many other important semantic patterns (e.g., Pinyin name,

place and mobile number) are left unexplored. In addtion, as we will show in this work, due to an imprudent processing of data cleansing and inappropriate choices of password cracking algorithms, Li et al.’s measurement of the strength of Chinese passwords are largely biased. 2.2 Password cracking A crucial research area in password cryptography is to study the security strength of user-chosen passwords. To avoid using the brute-force attack, earlier work (e.g., [20], [24]) uses a combination of ad hoc dictionaries and mangling rules to model the common password generation practice and see whether user passwords can be successfully rebuilt in a period of time. This technique greatly improves the cracking efficiency and has given rise to automated password cracking tools such as John the Ripper (JTR) [27]. Borrowing the idea of Shannon entropy, the NIST Electronic Authentication Guideline [12] attempts to use the concept of password entropy for estimating the strength of password creation policy underlying a password system. Password entropy is calculated mainly according to the length of passwords. Florencio and Herley [28], and Egelman et al. [16] improved this approach by adding the size of the alphabet into the calculation and called the resulting value log2 ((alpha.size)pass.len ) the bit length of a password. However, previous ad hoc metrics (e.g., password entropy and bit length) have recently been shown far from accurate by Weir et al. [29], and they suggested that the approach based on simulating password cracking sessions is more promising. They also developed a novel method that first automatically derives word-mangling rules from password datasets, and then instantiates the derived grammars by using string segments from external input dictionaries (e.g., the famous Dic-0294 wordlists) to generate guesses in decreasing probability order [13]. This PCFG-based cracking approach is able to crack 28% to 129% more passwords than JTR [27] when given the same number of guesses. It has been considered as one of the state-of-the-art password cracking techniques and used in a number of recent works [9], [11]. Different from the PCFG-based approach, Narayanan and Shmatikov [30] suggested a template-based model in which the Markov-Chain theory is used for assigning probabilities to letter-based string segments, and it substantially reduces the password search space. This approach was tested in an experiment against 142 real-life user passwords and was able to break 67.6% of them. In 2014, by using various normalization and smoothing techniques from the natural language processing (NLP) domain, Ma et al. [9] systematically evaluated the Markov-based model and found that, in some cases, it performs significantly better than the PCFGbased model at large guesses (e.g., 230 ) when parameterized appropriately. In this work, we will perform extensive experiments by using both attacking models (combined with the identified characteristics of Chinese passwords) to evaluate the strength of Chinese passwords.

3

C HARACTERISTICS

OF C HINESE PASSWORDS In this section, we first describe the six Chinese web password datasets and three English password datasets, a total corpus of 130 million passwords that will be used in our analysis and later experiments, and then reveal several important findings about Chinese password characteristics.

4

TABLE 1 Data Cleansing of the nine datasets Dataset Web service Location Language Tianya Social forum China Chinese 7k7k Gaming China Chinese Dodonew Gaming&E-commerce China Chinese 178 Gaming China Chinese CSDN Programmer forum China Chinese Duowan Gaming China Chinese Rockyou Social forum USA English Yahoo Portal(e.g., E-commerce) USA English Phpbb Programmer forum USA English ∗ We remove 13M duplicate accounts from 7k7k, because

3.1

Original Miscellany Length>30 All removed After cleansing Unique PWs 31,761,424 860,178 5 2.71% 30,901,241 12,898,437 19,138,452 13,705,087 10,078 71.66%∗ 5,423,287 2,865,573 16,283,140 10,774 13,475 0.15% 16,258,891 10,135,260 9,072,966 0 1 0.00% 9,072,965 3,462,283 6,428,632 355 0 0.01% 6,428,277 4,037,605 5,024,764 42,024 10 0.83% 4,982,730 3,119,060 32,603,387 18,377 3140 0.07% 32,581,870 14,326,970 453,491 10,657 0 2.35% 442,834 342,510 255,421 45 3 0.02% 255,373 184,341 we identify that they are copied from Tianya as we will detail in Section 3.2.

Ethics consideration and dataset descriptions

The nine datasets described here are different in terms of service, size, language and user localization (see Table 1). They were hacked by external attackers or disclosed by anonymous insiders, and were subsequently made public on the Internet. We realize that though publicly available, these datasets are private data. Therefore, we only report the aggregated statistical information, and treat each individual account as confidential such that using it in our research will not increase risk to the corresponding victim, i.e., no personally identifiable information can be learned. Furthermore, these datasets may be exploited by attackers as training sets or cracking dictionaries, while our use of them is both beneficial for the academic community to understand password choices of Chinese netizens and for security administrators to secure user accounts. The first three datasets are all from US. The Rockyou dataset [31] includes over 32M passwords and was hacked from the social application site rockyou.com in Dec. 2009 by an SQL injection attack. The Phpbb dataset contains around 255K passwords leaked from phpbb.com, a forum on the development of PHP scripting language, in Jan. 2009. The Yahoo dataset [32] consists of about 442K passwords leaked by the hacker group named D33Ds in July 2012. The following six datasets were all leaked from Chinese sites in Dec. 2011 when a series of security breaches happened [33], and we collected them at that time. The 6.42M CSDN passwords were hacked from csdn.net, a popular community for software developers in China. The 31.76M Tianya passwords were leaked from tianya.cn, an influential Chinese BBS forum. The remaining four datasets are all from popular Chinese gaming websites, of which the Dodonew dataset deserves attention, for it is from a site with ecommerce services and its accounts are generally perceived to be of important value. As expected, this dataset is the strongest one among all datasets (see Section 4). Note that three pairs of datasets (i.e., Tianya vs. Rockyou, Dodonew vs. Yahoo, and CSDN vs. Phpbb) will be used for strength comparison in Section 4, for each pair is of the same service. 3.2

Data cleansing

Before examining the password characteristics, we perform the process of data cleansing. We get rid of email addresses and user names from the original data. We note that the remaining data consists of some strings whose lengths are over 100 (e.g., there are dozens of passwords in Rockyou with a length up to 128), which highly indicates that they are junk information. We remove such strings and others that

include characters beyond the 95 printable ASCII symbols. We further remove strings whose length is over 30, because after having manually scrutinized the original datasets, we find that these long strings do not seem to be generated by human beings, but more likely by password managers. Moreover, such unusually long passwords are often beyond the scope of attackers who specially care about costeffectiveness. Though the fraction of excluded passwords is negligible, this step of cleansing unifies the input and largely simplifies the later data processing. Of particular importance is our observation that, the Tianya dataset and 7k7k dataset largely overlap with each other. We were first puzzled by the fact that the password “111222tianya” originally lay in the top-10 most popular list of both datasets. We manually scrutinize the original datasets (i.e., before removing the email addresses and user names) and are surprised to find that there are around 3.91 million (actually 3.91*2 million due to a split representation of 7k7k accounts, as we will discuss later) joint accounts in both datasets, and we realize that someone probably have copied these joint accounts from one dataset to the other. Now, a natural question arises: From which dataset have these joint accounts been copied? We believe that these joint accounts were copied from Tianya to 7k7k mainly for two reasons. Firstly, it is unreasonable for 0.34% users in 7k7k to insert the string “tianya” into their 7k7k passwords. The following second reason is quite subtle yet convincing. In the original Tianya dataset, we find that the joint accounts are of the form {user name, email address, password}, while in the original 7k7k dataset such a joint account is divided into two parts: {user name, password} and {email address, password}. The password “111222tianya” occurs 64822 times in 7k7k and 48871 times in Tianya, and one gets that 64822/2 < 48871. Therefore, it is more plausible for someone to copy some (i.e., 64822/2 of a total of 48871) accounts using “111222tianya” as the password from Tianya to 7k7k, rather than to copy all the accounts (i.e., 64822/2) using “111222tianya” as the password from 7k7k to Tianya and further reproduces 16460(= 48871 − 64822/2) such accounts. After removing 7.82 million joint accounts from 7k7k, we found that all of the passwords in the remaining 7k7k dataset occur even times (at least two). This is expected, for we observe that in 7k7k half of the accounts are of the form {user name, password}, while the rest are of the form {email address, password}, and it is likely that both forms are directly derived from the form {user name, email address, password}. For instance, both {wanglei, wanglei123} and {[email protected], wanglei123} are actually derived

5

from the single account {wanglei, [email protected], wanglei123}. Consequently, we further divide 7k7k into two equal parts and discard one part. The detailed information on data cleansing is summarized in Table 1. As far as we know, Li et al.’s work [18] is the only one that has exploited the datasets Tianya and 7k7k. However, contrary to what we have done in this work, Li et al. think that the 3.91M joint accounts are copied from 7k7k to Tianya. Their only reason is that, when dividing these two datasets into the reused passwords group (i.e., the joint accounts) and the not-reused passwords group, they find that “the proportions of various compositions are similar between the reused passwords and the 7k7k’s not-reused passwords, but different with Tianya’s not-reused passwords”. However, they have never explained what these “various compositions” are. Such vague statements also cannot answer the critical question: why are there so many 7k7k users using “111222tianya” as their passwords? Hence, we believe that they should have removed 3.91*2 million joint accounts from 7k7k but not 3.91 million ones from Tianya. In addition, they fail to note that all the passwords in 7k7k occur even times, which is extremely abnormal. As a result, Li et al. render two of their five Chinese datasets at best useless and at worst negative, because contaminated data would highly lead to inaccurate results and unreliable comparisons. 3.3 Password characteristics If passwords are randomly generated, it is likely that this will produce a uniform distribution of characters. However, random passwords are difficult for human beings to memorize, and thus common users rarely choose meaningless random strings. It is widely hypothesized that user-generated passwords are greatly influenced by their native languages (and culture background), yet so far few relative empirical results have been given. To fill this gap, here we first illustrate the character distributions of the nine password datasets, and then measure the closeness of passwords with their native languages in terms of inversion number of the character distributions (in descending order). 12%

Tianya

Dodonew

CSDN

7k7k

178 Duowan

10%

Rockyou

Yahoo

Phpbb

8% 6% 4% 2% 0% a i n e o h g l u w y s z q x c d j m b t

r f k p v

Fig. 1. Letter distributions of user-chosen passwords Fig. 1 shows that passwords from different language groups have significantly varied letter distributions. What’s quite surprising is that, even though generated and used in vastly diversified web services, passwords among the same language group have quite similar letter distributions. This suggests that, when given a password dataset, one can largely determine what’s the native language of its users by investigating its letter distribution. Arranged in descending order, the letter distribution of all Chinese

passwords is aineohglwuyszxqcdjmbtfrkpv, while this distribution for all English passwords is aeionrlst mcdyhubkgpjvfwzxq. While some letters (e.g., ‘a’, ‘e’ and ‘i’) occur frequently in both groups, some letters (e.g., ‘q’ and ‘r’) only occur frequently in one group. Such information can be exploited by attackers to reduce the search space and optimize their cracking strategies. Note that, all the percentages here are handled case-insensitively (e.g., the percentage of letter ‘a’ in Tianya is computed as #of occurences of lower/uppercase letter ‘a′ in Tianya = 11.29%). #of total occurences of all letters in Tianya While users’ passwords are greatly affected by their native languages, the letter frequencies in general language usages may not well represent the frequencies of letters that are used in passwords. According to Huang et al.’s work [34], the letter distribution of Chinese language (i.e., written Chinese texts like literary work, newspapers and academic papers), which are converted into Chinese Pinyin, is inauhegoyszdjmxwqbctlpfrkv. This shows that some letters (e.g., ‘i’ and ‘u’), which is very popular in Chinese user passwords, appear much less frequently in written Chinese texts. Similar observation also holds for English passwords, where the letter distribution of English language (i.e., etaoinshrdlcumwfgypbvkjxqz) is obtained from www.cryptograms.org/letter-frequencies.php. To further explore the closeness of passwords with their native languages and with the passwords from other datasets, we measure the inversion number of the letter distribution sequence (in descending order) between two password datasets (as well as languages), and the results are summarized in Table 2. “Pinyin fullname” is a dictionary consisting of 2,426,841 unique Chinese full names (e.g., wanglei and zhangwei), “Pinyin word” is a dictionary consisting of 127,878 unique Chinese words (e.g., chang and cheng), and these two dictionaries will be detailed later. Note that the inversion number of sequence A to sequence B is equal to that of B to A. For instance, the inversion number of inauh to aniuh is 3, which is equal to that of aniuh to inauh. Table 2 shows that, the inversion number of letter distributions between passwords from the same language group is generally much smaller than that of passwords from different language groups. This value is also distinctly smaller than that of the letter distributions between passwords and their native language (see the bold values in Table 2). All these indicate that passwords from different languages are intrinsically different from each other in letter distributions, and that passwords are close to their native language yet the distinction is still noticeable (measurable). Fig. 2 depicts the length distributions of passwords. Irrespective of the web service, language and culture differences, the most common password lengths of every dataset are between 6 and 10, among which length-6 or 8 takes the lead. Merely passwords of these five lengths can account for more than 75% of every entire dataset, and this value will rise to 90% if we consider passwords with lengths of 5 to 12. As expected, very few users prefer passwords longer than 15 characters. Notably, people seem to prefer even length over odd length. Another interesting observation is that, CSDN exhibits only one peak in its length distribution curve and has much fewer passwords (i.e., only 2.16%) in lengths below 8. This is likely due to the fact that this site has enforced a minimal length-8 policy since an early stage.

6

TABLE 2 Inversion number of the letter distributions (in descending order) between the datasets (“PWs” stands for passwords) Tianya Dodonew 178 CSDN 7k7k Duowan Tianya 7k7k Dodonew 178 CSDN Duowan All Chinese PWs Chinese language Pinyin fullname Pinyin word Rockyou Yahoo Phpbb All English PWs English language

0 15 22 42 15 17 14 40 32 37 100 100 113 100 99

15 22 0 23 23 0 31 42 14 21 10 15 13 12 41 52 39 40 38 49 105 94 101 92 112 105 105 94 96 99

42 31 42 0 41 35 32 56 48 47 134 130 141 134 125

15 14 21 41 0 12 15 45 39 42 95 95 106 95 96

17 10 15 35 12 0 9 49 39 44 99 97 110 99 98

All Chin- Chinese Pinyin Pinyin All Eng- English Rockyou Yahoo Phpbb ese PWs language fullname word lish PWs language 14 40 32 37 100 100 113 100 99 13 41 39 38 105 101 112 105 96 12 52 40 49 94 92 105 94 99 32 56 48 47 134 130 141 134 125 15 45 39 42 95 95 106 95 96 9 49 39 44 99 97 110 99 98 0 44 34 43 104 102 115 104 101 44 0 38 27 118 114 123 118 113 34 38 0 31 124 122 135 124 123 43 27 31 0 115 113 124 115 112 104 118 124 115 0 12 23 0 47 102 114 122 113 12 0 15 12 39 115 123 135 124 23 15 0 23 44 104 118 124 115 0 12 23 0 47 101 113 123 112 47 39 44 47 0

TABLE 3 Top 10 most popular passwords of each dataset Rank Tianya 7k7k Dodonew 178 CSDN 1 123456 123456 123456 123456 123456789 2 111111 0 a123456 111111 12345678 3 000000 111111 123456789 zz12369 11111111 4 123456789 123456789 111111 qiulaobai dearbook 5 123123 123123 5201314 123456aa 00000000 6 123321 5201314 123123 wmsxie123 123123123 7 5201314 123 a321654 123123 1234567890 8 12345678 12345678 12345 000000 88888888 9 666666 12345678 000000 qq66666 111111111 10 111222tianya wangyut2 123456a w2w2w2 147258369 Sum of top10 2,297,505 440,300 533,285 793,132 670,881 Total accounts 30,901,241 5,423,287 16,258,891 9,072,965 6,428,277 Percent of top10 7.43% 8.12% 3.28% 8.74% 10.44% 35%

Tianya

30%

178

25%

7k7k

20%

Rockyou

15%

Phpbb

Duowan Rockyou Yahoo Phpbb 123456 123456 123456 123456 111111 12345 password password 123456789 123456789 welcome phpbb 123123 password ninja qwerty 000000 iloveyou abc123 12345 5201314 princess 123456789 12345678 123321 123321 12345678 letmein a123456 rockyou sunshine 111111 suibian 12345678 princess 1234 12345678 abc123 qwerty 123456789 338,012 669,126 4,476 7,135 4,982,730 32,581,870 442,834 255,373 6.78% 2.05% 1.01% 2.79%

Dodonew

Yahoo

Tianya

Dodonew

CSDN

7k7k

Rockyou

0.001

Duowan Frequency

Percentage

0.01

CSDN

178 Duowan

Yahoo

Phpbb

10-4 10-5

10% 10-6

5%

10-7

0% 5

10

15

20

25

30

Length

Fig. 2. Length distributions of passwords investigated Fig. 3 portrays the frequency vs. the rank of passwords from different datasets in a log-log scale. We first sort each dataset according to the password frequency in descending order, and then each individual password will be associated with a frequency fr and a rank r. Interestingly, the curve for each dataset closely approximates a straight line, and this trend will be more pronounced if we take all the nine curves as a whole. This well corroborates the Zipf theory [35]: fr and r follow a relationship of the type logfr = logC −s · logr, where C and s are constants. Particularly, s is the absolute value of slope of the Zipf linear regression line and slightly less than 1.0. The Zipf theory indicates that the popularity of user-generated passwords decreases polynomially with the increase of their rank. This further implies that a few passwords are overly popular, while the majority are sparsely scattered in the password space.

1

10

100

1000

104

105

106

Rank

Fig. 3. Frequency distributions of passwords investigated Table 3 illustrates the top-10 most frequent passwords from different services. The most frequent password of all datasets is “123456”, with CSDN being the only exception, which is likely due to the CSDN password policy requiring that no password shall be shorter than 8 characters. “111111” follows on the heel. Other popular Chinese passwords include “123123”, “123321”, which are all composed of digits and in simple patterns such as repetition and palindrome, while popular ones in English datasets tend to be meaningful letter strings (e.g., the eternal theme of love — frankly, “iloveyou” or euphemistically, “princess”). Our results confirm the folklore that “back at the dawn of the Web, the most popular password was 12345. Today, it is one digit longer but hardly safer: 123456.” It is worth noting that, for each digit, there are many Chinese characters that have similar sounds and this makes

7

TABLE 4 Password patterns with digits (The percentage is taken by dividing the corresponding total accounts.) Patterns D LD Sum of top2

Tianya 63.77% 14.71% 78.48%

7k7k 59.62% 17.98% 77.60%

Dodonew 30.76% 43.50% 74.25%

178 48.07% 31.12% 79.20%

CSDN 45.01% 26.14% 71.15%

it easy to obtain digit sequences that sound like meaningful phrases. Hence, it is unsurprising to see that “5201314”, which sounds like “I love you forever and ever”, ranks the 7th and 8th most popular one in Tianya and Duowan, respectively. Moreover, Chinese passwords are highly concentrated, for only the top-10 most popular ones amount to as high as 6.78%∼10.44% of each entire dataset, with Dodonew being the mere exception. However, even Dodonew achieves 3.24%, while the English datasets are all below 2.80%. As we have seen that digits are popular in top-10 passwords of Chinese datasets, whether are they popular in the whole datasets? To answer this question, we investigate the frequencies of password patterns that involve digits, and results on the top 2 most frequent ones are shown in Table 4. More results (i.e., on the top-10 most frequent ones) can be found in the supplemental material. The first column of the table denotes the pattern of a password as in [13] (i.e., L denotes a lower-case sequence, D for digit sequence, U for upper-case sequence, S for symbol sequence, and the structure pattern of password “Wanglei123” is ULD). We see that an average of more than 50% of Chinese web passwords are only composed of digits, while this value of English datasets is only 15.77%. In contrast, English users prefer the pattern LD. Note that, all the percentages hereafter in this work are taken by dividing the corresponding total accounts (e.g., the percentage at the upper-left corner of Table 4 is #of passwords with pattern D 19706174 computed as #of total passwords in Tianya = 30901241 = 63.77%). It is somewhat surprising to see that, the sum of merely the two digit-related patterns (i.e., D and LD) accounts for over 70% of every Chinese datasets, indicating that Chinese users excessively employ digits to build their passwords. This is probably due in large part to the fact that most Chinese users are unfamiliar with English language (and Roman letters on the keyboard). If this is the case, is there any meaningful information underlying these digit sequences? To gain an insight into the underlying semantic patterns, we construct several dictionaries of different semantic categories and investigate their prevalence (see Table 5). “English word lower” is from http://www.mieliestronk.com/ wordlist.html and it contains about 58,000 popular lowercase English words. “English lastname” is a dictionary consisting of 18,839 last (family) names with over 0.001% frequency in the US population during the 1990 census, according to US Census Bureau [36]. “English firstname” contains 5,494 most common first names (1,219 male and 4,275 female names) in US [36]. “English fullname” is a cartesian product of “English firstname” and “English lastname”, consisting of about 1.04 million most common English full names. To get a Chinese full name dictionary, we make use of the 20 million hotel reservations dataset [37] leaked in Dec. 2013. The Chinese family name dictionary includes 504 family names which are officially recognized in China. Since the first names of Chinese users are widely distributed and can be almost any combinations of Chinese words,

Duowan Avg. Chinese 52.84% 52.93% 23.97% 23.72% 76.81% 76.66%

Rockyou 15.94% 27.70% 43.64%

Yahoo 5.89% 38.27% 44.16%

Phpbb Avg. English 12.06% 15.77% 19.14% 27.78% 31.20% 43.55%

we do not consider them in this work. As the names are originally in Chinese, we transferred them into Pinyin without tones by using a Python procedure from https:// pypinyin.readthedocs.org/en/latest/ and removed the duplicates. We call these two dictionaries “Pinyin fullname” and “Pinyin familyname”, respectively. “Pinyin word lower” is a Chinese word dictionary known as “SogouLabDic.dic”, and “Pinyin place” is a Chinese place dictionary. Both of them are from [38] and also originally in Chinese, and we translate them into Pinyin in the same way as we tackle the name dictionaries. “Mobile number” consists of all potential Chinese mobile numbers, which are 11-digit strings with the first seven digits conforming to pre-defined specific values and the last four digits being random. As for the birthday dictionaries, we use patterns to match digit strings that might be birthdays. For example, “YYYYMMDD” stands for a birthday pattern that the first four digits indicate years (from 1900 to 2014), the middle two represent months (from 01 to 12) and the last two denote dates (from 01 to 31). “PW with a l+ -letter substring” is a subset of the corresponding dataset (see Table 5) and consists of all passwords that include a letter substring no shorter than l, and similarly for “PW with a l+ -digit substring”. Table 5 shows the various semantic patterns existing in Chinese and English web passwords. We can see that, a large fraction of English users tend to use raw English words as their password building blocks. More specially, 25.88% English users insert a 5-letter or longer (denoted by 5+ -letter) word into their passwords, and this figure accounts for more than a third of the total passwords with a 5+ -letter substring. In contrast, few Chinese users choose raw Pinyin words or English words to build passwords, yet they prefer Pinyin names, especially full names. Surprisingly, of all the Chinese passwords (22.42%) that include a 5+ -letter substring, more than half (11.24%) include a 5+ -letter Pinyin full name. There is even a non-negligible proportion (i.e., 4.10%) of English passwords that contain a 5+ -letter full Pinyin name, and a reasonable explanation is that many Chinese users have created accounts in these English sites. For instance, the popular Chinese name “zhangwei” appears in both Rockyou and Yahoo. We also note that English names are also widely used in English passwords, yet full names are less popular than last names and first names. As far as we know, for the first time we have explored the name patterns in a large-scale empirical password study. Equally surprisingly, we find that, on average, 16.99% of Chinese users simply insert a six-digit birthday into their passwords. Besides, about 30.89% of Chinese users employ a 4+ -digit date as their password building blocks, which is 3.59 times higher than that of English users (i.e. 8.61%); there are 13.49% of Chinese users inserting a four-digit year into their passwords, which is about 3.55 times higher than that of English users (3.80%, which is comparable to the

8

TABLE 5 The prevalence of various dictionary words in user passwords (The percentage is taken by dividing the total accounts.) Dictionary English word lower(len ≥ 5) English firstname(len ≥ 5) English lastname(len ≥ 5) English fullname(len ≥ 5) English name any(len ≥ 5) Pinyin word lower(len ≥ 5) Pinyin familyname(len ≥ 5) Pinyin fullname(len ≥ 5) Pinyin name any(len ≥ 5) Pinyin place(len ≥ 5) PW with a 5+ -letter substring Date YYYY Date YYYYMMDD Date MMDD Date YYMMDD Date any above PW with a digit + PW with a 4 -digit substring PW with a 6+ -digit substring PW with a 8+ -digit substring Mobile Phone Number(11-digit) PW with a 11+ -digit substring

Tianya 2.08% 1.11% 2.16% 4.03% 4.60% 7.34% 1.35% 8.39% 8.56% 1.24% 18.51% 14.38% 6.06% 24.99% 21.29% 36.61% 89.49% 81.64% 75.59% 28.04% 2.90% 4.71%

7k7k Dodonew 2.05% 3.69% 0.93% 2.23% 2.34% 4.48% 4.30% 6.14% 4.65% 6.32% 8.56% 10.82% 1.64% 2.34% 9.87% 12.91% 10.05% 13.31% 1.27% 1.64% 19.99% 26.95% 12.82% 12.45% 5.42% 3.93% 19.97% 17.08% 15.89% 12.70% 30.39% 26.66% 88.42% 88.52% 76.98% 71.90% 68.32% 61.16% 27.56% 26.53% 1.76% 2.63% 2.09% 3.39%

178 0.83% 0.53% 1.93% 4.99% 5.20% 10.24% 2.24% 11.81% 12.11% 1.58% 19.38% 10.06% 3.94% 16.46% 13.09% 27.07% 90.76% 78.76% 70.02% 26.37% 3.97% 5.08%

results reported in [39]). We note that there might be some overestimates, for there is no way to tell apart whether some digit sequences are dates or not, e.g., 010101 and 520520. These two sequences may be dates, yet they are also likely to be of other semantic meanings (e.g., 520520 sounds like “I love you· · · ”). Nevertheless, it doesn’t affect our conclusion that birthdays play a vital role in Chinese user passwords. Another interesting observation is that, about 3% Chinese users just use their 11-digit mobile numbers as passwords, making up 39.59% of all passwords with a 11+ -digit substring. While there are few passwords longer than 10, if an attacker can determine that the victim uses a long password, she is likely to succeed by just trying the victim’s 11-digit mobile number. This reveals a practical attacking strategy against long Chinese passwords. Note that there are some un-avoidable ambiguities when determining whether a text/digit sequence belongs to a specific dictionary, and improper resolution of these ambiguities would lead to an overestimation or underestimation of human choices. Here we take the dictionary “YYMMDD” for illustration. For example, both 111111 and 520521 fall into “YYMMDD” and are excessively popular, yet it is more likely that users choose them simply because they are easily memorable repetition numbers or meaningful strings, and considering them as ordinary dates would lead to an overestimation. Nevertheless, they can really be dates (e.g., 111111 stands for “Jan. 1th, 2011” and 520521 stands for “May 21th, 1952”), and completely excluding them from “YYMMDD” would result in an underestimation of the usages of dates. Thus, assuming that user birthdays are randomly distributed, we assign the expectation of frequency of dates (denoted by E), instead of zero, to the frequency of these abnormal dates. We manually identify 17 abnormal dates each of which is originally with a frequency greater than 10E and appears in every top-1000 list of the six Chinese datasets. In this way, the dilemma can be largely resolved. We similarly tackle 21 abnormal items in the dictionary “MMDD”. As for the other 19 dictionaries in Table 5, few abnormal items can be identified, and thus they are processed as usual. We conjecture that, the two characteristics (i.e., excessively high usages of long Pinyin names and birthdays) of Chinese

CSDN Duowan Avg. 3.41% 2.37% 1.47% 1.19% 3.65% 2.77% 6.58% 5.07% 6.87% 5.18% 11.51% 9.92% 2.47% 1.88% 13.14% 11.29% 13.46% 11.53% 2.12% 1.48% 28.03% 21.70% 16.91% 14.33% 8.78% 6.17% 24.45% 22.59% 20.67% 18.28% 35.30% 33.58% 87.10% 89.26% 78.38% 80.60% 69.87% 73.10% 49.73% 31.03% 3.75% 2.44% 7.57% 3.35%

Chinese 2.41% 1.24% 2.89% 5.18% 5.35% 9.73% 1.99% 11.24% 11.50% 1.55% 22.42% 13.49% 5.72% 20.92% 16.99% 31.60% 88.93% 78.04% 69.68% 31.54% 2.91% 4.36%

Rockyou 23.54% 18.80% 20.16% 13.05% 27.67% 3.33% 0.05% 4.79% 4.80% 0.20% 71.69% 4.34% 0.10% 7.53% 3.24% 11.33% 54.04% 24.72% 17.77% 6.88% 0.07% 0.75%

Yahoo 29.49% 15.21% 20.82% 11.35% 26.51% 2.99% 0.07% 4.17% 4.18% 0.18% 75.93% 4.30% 0.05% 4.46% 1.23% 8.77% 64.74% 21.85% 8.48% 2.50% 0.01% 0.17%

Phpbb Avg. 24.60% 9.20% 15.22% 8.25% 18.71% 2.50% 0.07% 3.35% 3.36% 0.16% 68.66% 2.77% 0.09% 3.59% 1.55% 6.45% 46.14% 19.33% 11.28% 3.73% 0.02% 0.18%

English 25.88% 14.40% 18.73% 10.88% 24.30% 2.94% 0.06% 4.10% 4.11% 0.18% 72.09% 3.80% 0.08% 5.20% 2.01% 8.85% 54.97% 21.97% 12.51% 4.37% 0.03% 0.37%

web passwords would pose a potential for an attacker to greatly reduce her search space, for birthdays and popular names generally are drawn from a much smaller space than random strings should be. If this is the case, then it is fair to say that these two characteristics are just two serious weaknesses of Chinese passwords. In the following, we will establish this conjecture by a series of experiments.

4 S TRENGTH OF C HINESE PASSWORDS In this section, we employ the state-of-the-art password attacking algorithms (i.e., PCFG-based and Markov-Chainbased [9]) to evaluate the strength of Chinese passwords, for the traditional entropy-based metric has been demonstrated far from accurate [29]. We further investigate whether the characteristics identified in Sec. 3.3 can be practically exploited to reduce password space and facilitate guessing. We note that probability-threshold graphs can provide cracking results on the full spectrum of passwords, yet their theoretical basis is left as “interesting future research” [9]. Moreover, they only approximate the likelihood of passwords and thus cannot yield precise results. For these reasons, as did in [10], [29], [39], we herein employ guess-number graphs. 4.1 PCFG-based attacks PCFG-based model was first introduced by Weir et al. [13], and it has been revealed to be one of state-of-the-art password cracking algorithms by recent research (e.g., [9], [11]). Unlike JTR [27] that uses ad hoc mangling rules, PCFGbased approach learns the way users create their passwords and automatically derives mangling rules from a training set. Firstly, it divides all the passwords in a training set into segments of similar character sequences and obtains the corresponding base structures and their associated probabilities of occurrence. For example, password “wanglei@123” is divided into the L segment “wanglei”, S segment “@” and D segment “123”, and its base structure is L7 S1 D3 . The L7 S1 D3 probability of L7 S1 D3 is #of#of base structures . Such information is used to generate the probabilistic context-free grammar. Then, one can derive password guesses in decreasing order of probability, where the probability of any guess is the product of the probabilities of the productions used in its

9

60%

50%

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

Dodonew 178 CSDN 7k7k Duowan_rest

ô ô ô ì ô ì ô ô ì ô ô ì ôì ì ô ì ô ì ô ô ô ì ô æ ì æ ô æ ì ô æ ì ì æ ô æ æ ô ì æ æ ô ì æ ô ì æ ô æ ì æ ô æ ì ô æ ì æ ô ç ô ì æ ç ì ô æ ç ô ì æ ç æ ô ì æ ç æ ô ì ç æ ì ô æ ç æ ì ç æ æ ç ôô ì æ ç ì ô æ æ ç ì ô æ ô ç æ ì ô ç æ ç ì æ ô ç æ ô ç æ ì ô ç æ æ ô ç æ ì ô ç æ ô ì ç æ ì ç æ ô ç æ ì æ ç ô æ ç ì ô æ ç ô æ ì ç ô ç ì æ ç ì ì æ ç ì ç ô æ ç ì ô ç æ ç ô ì ç ç ô æ ç ô ç ç æ ç ç ô ç ç ç æ ô ç ç ì ç æ ì ç ô ç ì ç ì æ ç ì ç æ ì ç æ ô ç æ ì æ ç æ æ æ ì ç æ ç ô ì ç æ ì ô ç ç ô æ ì ç ô ô ì æ ô ç ô ç ô æ ì ô ô ç ô ô ç ì æ æ ç ô ææ ç ô ç ì æ ç ææ æ ò ô ç òò ç æææææ ò ì ò ç æ ò ô ææ ò ç ò æ ò ô ò ô ç ò ì æ ô ò ò ô ò æ ô ò ò ç æ ò ô æ ò æ ô ò ì ô ò ç æ ô ò à ì ò ô ò à ò ô ô æ ò ç à ò à à òòò à ôôôôôôôç ì òò çì æô ô ò à ô ô ì ç òò à ç æ ò ç ô ì à ç ì òò ç æ à ì ç ôôôô à ì òòò ì ç ò à ò ô ææ ç æ òò ç ì ç æ æ àà ç ç à æ à ç òòò æ ç ì à ç ôôô ç æ ç à ç ô òò ç æ ç ì ò à ç ô ç ç ò à æ ç ô à ç ìì òò çç à ôô ç æ ò à ç ô ò à ç ìì æ ôô à ìì ò ç ìì ì ì ôôô æ ò ì ç àà ì ô ç æ ì à ò ç æ à ç ì ì ç ì æ à ôô ì ò ç à ô æ à ì ç à ô ç ì ô æ ç à ò ô ç à ì ô ç à ì æ ç ô ì ì à ì ç òò ç ô à ì ç æ ì à ç à ì ô æ ç ì ò à ô æ ç ì à ç òò ì ò ô æ ò ì ò à ì ç òò à à ì ô ç à ì æ òòò ç à ì æ ç à æ ç à ì ç æ ì à ç ì àà òòòòòòòòòòòòò ç ì à òòò æ ô ç ì à ì ô æ ç ì òòòòò à ô ç æ ì ò à ì æ ô ç ì à òòòò ç ì ô ò æ ì à ò ç ô ò à ì æ ç ì ô ò ò æ à ç ô ì æ à ô ç ì æ ô ç à òòòòò ô ç ì æ ò à ì ô ç ò æ ì ô æ à ç ì òòò ç ô æ ì ô ò ç æ ì à ò ç ô æ ç ô æ òò ì ô à ç æ ò ì ô ç ì æ ô ç òò à ì æ ô ç ì æ ò ô ì ç ì ò æ ô à ì ç ô ì æ ç ì ò ì ô ò ç æ à ò ç æ ô ì æ ç ô ì ò ì ò ç ò à ô æ ò ç æ ì ô ò ì ç æ ò ô ò ì æ ç ô à æ ç ò ì ô æ ì ò ò ç æ ì ô à ô æ ç ò ì ô æ ì ç ò ì æ ò ô ç ì à æ ô ì ò ç ì æ à ô ì ò ò ç æ ì à ô æ ò ç ì àà ô à æ à ì ç à ò ô à ì æ ò ç ì ô æ à ì ò ç ô æ à ç æ ò ô ò ì à ç æ ì ô ò à æ ç æ ô ì ò ì àà ç æ ô àà ì ò ì ç æ ô ì ò ç æ ààààààà ô æ ò ç àà ô ì æ ç ì ò ô à ì æ ò ç ì àà ô ì æ ò ç ô ì æ ò à ç æ ì ô ò æ ç ì ô æ ì ò ç æ ô ì ò æ ò ì ç ô ò æ à ô ç ì ò æ ô ì ò æ ç ì à ô æ ò ç ô ì ò æ à ì ç ô à ò à æ ç ô ì ò æ à ì ç ô ì à æ ò ç à ô æ ò à ç ô æ ò à ç æ ò ô ç æ ò ô ì ç ààà æ ì à ô ì ç æ ò ô ç ì æ ò à ô æ ç ô æ ç ò ô æ ì à ç ò ô æ ç ô ç ò à æ ô ò ç æ ò ô ç àà æ ô ò ç æ à ò ô ç æ à ò ô ç ô æ ò ì ç ô à æ ò ç æ ò ô ò ç æ ò à ì ô ç ò æ ô à ç ò æ ò à ì æ ô ç ì ô æ ò ò ç ì ô æ ì ò ç à ì ô ò ç ò æ à ò ç æ ò ò æ ì ç à ô æ ò ì ç ô à æ ò æ ò ç ì ò à ô æ ò ç ô ì æ à ò ç ô æ ç à æ ô ì æ ì ò ç æ ô æ ì ç à ò ô æ ò ç à ô ò æ ì ç à ô æ ò ç ì à æ ì ô ç æ ô ò ì à ç æ ò ì ô ç æ ì ò à ç ô æ ò ô æ à ç ç ô ç à ò ô ò æ ò ç à ô ò ò ç ì ô æ ç ì ò ô æ ô ç ì ç àà ô ì ò ô æ ç ò à æ ô ç ì ò æ ô ç ô à ç òò ô ì à ô æ ç ô à æ ì ç à ô æ ì ç ì ò ô æ ì à ç ô à ì ç ç ô à æ ç ô à æ ç ô à æ ç ô à æ ç ô ì ç à ò ò ô ç æ à ô ç ì à ç æ ì ç ô à æ ì ô ò à æ ì ç æ à ç ô æ ô ì æ à ç à ì æ ô ì ç à æ ò ç ô æ à æ ç ô à ô ç ì à ô ì æ ì à ô ç ô ì ç à æ ç à ç ô æ à ô æ à ì ô ì ç ô ç à ì æ ç ô à ô ç ì æ à ç ì ô à æ ç à ì ô ç à ì ô ç à ì ôì æ ì ì ôì ç àà ç ôì æ òç ç àà ôç à ç ôì æç ì ç ààà ôì ì ì ç ì ô ì ì ààà à ç ôì ì à ì à ç à ôì àà æ àà à à à à ôì à ç ì à æ à à ì à àà ç à à ôì à àà æ à ç àà à àà à à ô ì à ààà àà ç àà àà ì à ààà ààà à àà ò àà à à à àààààà à à ààààààà à à ààààààà

101

102

103

104

105

106

50% 40% 30% 20% 10% 0% ìæà 100

107

60%

æ Rockyou_rest æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì ì ì ì æ ì ì ì ì ì ì ì æ ì ì ì ì ì ì æ ì ì ì ì ì æ ì ì ì æ ì ì ì ì ì ì æ ì à à à ì à ì à ì à æ ì à ì à ì à ì à ì æ à à ì à ì à æ à ì à ì æ à ì ì à ì æ à ì à æ à ì ì à à æ ì à ì à æ à ì à ì à æ ì à æ à ì æ à ì æ à à æ ì æ à æ à ì à à ì à ì à ææ à ì æ à à æ ì à æ à æ æ ì à æ æ æ à æ æ ì à æ à à ì ææææ à à æ à ì à æ à ì à æ ì à æ à ì à æ à ì æ à æ ì æ ì æ ì æ ì àà æ ì æ à ì ì æ ì à æ ì ì æ ì æ à ì ì æ ì ìì ì æ ì ì ì æ ì ì à ì ì ì æ æ à æ à ì ì æ ì à æ ì æ ì à æ æ ì æ ì à æ ì ì æ ì à æ ì à æ ì ì ì à æ ì à æ ì ì æ à ì æ à à æ ì æ ì à ì æ à æ ì æ àà ì æ à æ ì à à æ ì ì æ àà ì à æ à ì à æ ì à ì à æ à æ ààà ì à à æ ì æ àààà ì æ à ì æ ì à æ ì à æ ì à æ à ì æ à ì æ à ì ì æ à ì æ à ì à æ ì æ à à à ì æ à à ì æ à à ì æ à ì æ à ì à æ ì ì æ ì à ì æ à ì æ à ì æ à ì æ à æ àà ì æ à ì à æ ì à æ à ì æ ì à æ à ì æ à ì à æ ì æ àà ì æ ì à æ à à ì æ à ì æ à ì à æ ì à æ ì à ì à æ à ì æ à ì æ ì à æ à ì à æ ì à ì æ à ì à æ ì æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à ì æ à æ ì à æ à ì æ à ì æ à à æ ì à æ à ì æ à ì æ à ì æ à æ ì à ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ à ì à æ ì à ì æ à ì æ à ì æ à ì à æ ì ì à æ ì æ à ì à æ ì æ à ì à æ ì æ à ì æ à æì à ì à æ à æ ì à æ ì æ à ì à æ ì à ì æ à æ ì à æ æ ì æ ì æ ì àà ì æ à ì æ à ì æ à ì æ ì à ì æ à ì æ à ì æ à ì æ à ì æ ì à æ ì æ à ì æ à æ ì à à ì æ ì æ à ì ì æ à ì æ à æ ì ì æ à ì à æ ì æ ì à ì æ ì æ ì à æ ì à æ æ ì à æ ì à æ ì à æ ì ì æ à æ à ì æ æ ì àà æ à æ ì ì æ ì æ àà ì æ à ì æ ì æ ì æ ì ì æì àà æ à æ ì æ à ì æ à æ à à ææ ì à ææì æì ì à ì à æì ì à ì æì à àà ààà à à ìì àà ææææì àààà æ ààà ààà ì àà àà à à àààààààààà

à Yahoo ì Phpbb

101

102

Search space size

103

104

105

106

Fraction of cracked passwords

æ Tianya



60%

æ Tianya

50%

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

107


ì ô ì æ ô ì æ ô æ ô ì æ ô ì æ ô ì æ ô æ ì ô ì æ ô ì æ ì ô æ ì ô æ ì ô æ ì ô ì æ ì ô ì æ ô æ æ ô ì æ æ ô æ ì ô æ ì ô æ ì ô æ ì ô æ ì ô ì æ ô ô ì æ ì ô æ ì æ ô ô ì æ æ ì ô ç æ ì ô ç æ ô ç ì ç æ ô ì æ ç ç ô æ ì ç æ ç ì æ ô ì ç æ ô ç æ ì ç ô ì æ ô ç æ ì ç ô ç æ ç ì ç ô ì æ ô ç ì ç æ ô ô ç æ ç ì ô ì ç ô ì æ çç ì ô ç ç ô ì ç ì ô æ ç ç ì ç ô ç æ ç ô æ ô çç ç æ ô ç æ ç ô çç æ ì ç ç ô ì ì çç ô ì æ ç ç ì ç ô ì ç ç ì æ ç ç ô ì ç ô ç ì æ ç ç ô ç ô ì æ ç ô ç ì ô æ ç ì ç ô æ ç ì ô ç æ ç ô ì æ ì ç ô æ æ æ æ æ ôç æ ì ò ô ç ò ææ ì òà ô ò ææ ò ì æ ò ô ç òò ô æ ì ô òòò æææææ ô ææ æ ô òòòò æ ò ô ò æ ì ç ò ò æ ôô àà æ ç ôô à ç òòòòòò ì à ò ò ôôô æ ç à à òò ç ò à ç ì æ à ôôô ç à ç òòòòò à ì ç æ ôôôôôôôô à ç òòòò æ ò à ç ì ç à æ æ òò ç ôôôôô æ à ç à æ ç ô à ì æ ç ì æ ç à ì òòòò ç æ ô ì à ò æ ì ç à æ ô ì ò ì ç ç ì à ô æ ç ç òò ì ç à ç ô ç ì æ ò ç ì ç à ì ç ô ì ç ì ç ò ì à ç ô ç ì ç æ ò ì à ôô à çç ò ì ç à ì ì æ à ì ôô ç ì ô à ò ô æ ì ç ì æ à ì ç æ à ò à ì ç ôôôô æ à ì ô æ ò ç ô ç àà ô òò ç ì æ à ô ç ç ì ò ç ô à à ì ç æ ô ì ç æ ô ò à ç ô ç à ì ì ç æ ì à ì ç ô ç æ ì à ô à ì ç ì æ ì à ì ç ì ô òò òòòò à ò ì ç ô òò æ ì à ç ò ô ò à ì æ ç à ì æ ò ç à à ç æ ì òòòò à ç ì æ ç ì à òòòòòòò à òòò ç ì æ ò ì ç ò ì à ô ç æ ì à ô ì æ ç ì ô à ç æ ì ô ì à ç òòòòòò æ ì ç ô ò à ô ò ç æ ò ì à ò ô ç æ ò ò ô ç æ ì à ô ç ì òòò æ ì à ô ç ì ò æ ç ô ò æ à ç ô ì ò ì æ ò ç ô ì ô à æ ò ç ç ô æ ì æ ç ô ì òòò ç à æ ô ì ç ô ì æ ô ç òòò æ ì à æ ô ç ò ç ì ô òò æ ì ç ò ô à ì æ ç ì ô ì ô ò ç æ ì ò ò ô ì æ ç à ì ò ô ç æ ç æ ì à ò ô ò ì æ ç ô æ à ò ç ô ç ì ò ô æ ò ì æ ç à ô ì æ ì ò ô à ç ò æ ì ò ç æ ô ì à ò ì ç æ ô ì ç ì ò æ ô ì ò ç æ ô ì à ç ò æ ì ô ò ì ç æ à ô ì ò æ ç ô à ì æ ç ì à ò ô ò æ ì ç ì ò ô æ ì à à ç æ ô ò à ì ò æ ç ò ì àà ô à æ à à ç ò ô æ ç ì ô ò æ ì ò ô ì ç ì æ ì ô ç æ ò ì ò ç ô æ ì ì ç æ ò ô æ àààààààààààà ò ç ì ô æ à ç ò ì ô æ à ç ì ò ì ô æ ç ì ò æ ì ò ô à ç æ ì ì ô ò æ ì ò ç ô æ ç ô ì ì ò æ ç ì ò ô à ì æ ò ç ô æ ò à ç æ ô à ò æ ç à ô æ à ì ò ç ô æ ò à ô æ ç ì ò ô æ ì ò ç æ ààà ô ò ç æ ç à ò ô à æ ò à ç ì æ à à ò ô ç æ ò ô ç æ à ô æ ò ç ô à æ ô ç æ ò à ô ç ò ô ç à æ ò ô à ç ò ô æ ì ç æ à ô ç ò æ à ì ô ç ò æ ì ô ò à ç ô ò à æ ç ì ô æ ì ç ì à ô ò ì æ ç ò ô ò æ ì ç ò ô ò à æ ì ç ô ò à æ ç ô ì ò ç æ à ô ç æ ò ô ç æ ò ô ò ç ô ò ì à ç æ ò à ò ç æ ç æ ò ì à ò ç ô ì ò æ ò à ô æ ç ì ò ô ç æ à æ ì ô ç ò æ ô ç ì ò à æ ç ì ô ò æ ç ì à ô ò æ ì ò ç ô æ à ò æ ô ç ì ç æ ô à ò ì ç ô æ ç à ò æ ô ç æ ò à ô ç æ ò ò à ô æ ç æ ò ô ì ç ò à ô ì ç æ ì òò ç ô à ò æ ç ô à ç ò ô ì ç à æ ò ô ç æ ì à ç ô ò ì æ ò ç ô à ô æ ç ò à ç æ ô ì æ ô òæ ô æ à ç ô ç æ ì à ô ç òò ì ç ô à ç ô à æ ô ç ì ô à æ òæ ç ì æ ô à ç ô ç ô à ç ô æ ô ç à ô ç ç à ô æ ô ì æ ç à ì ô ç à ô ò ì ò ç ô æ à ì ô ç ì ç à ì æ ç à ô ç æ ì ò æ à ç ô ì æ à ì ô æ à ô ç æ ç ô æ à ç ô à æ ì òæ ç æ à ô ì ç æ à ì ô ç à ô ô ç à æ ô ç à ì ç à ç ô ì æ à ô à æ ô ç à ô ç à ì æ ô à ì ô à ç ì ô à ç æ ì ç ô ì ç àà ô ç ì à æ à ì ç ôì ì à ç ôì ç ôì àà æ òæ à ç ôì ç ôì ç æç ì àà ç ôì ì à ì à ç ì ô ì ç àà à ì ç à ôì ì à ì à à ç ôì à à æ àà à à à à ôì ç à ì à æ à à ì à à à àà ç à ôì à à à æ à à ç ààà à ô ì àà à à à à à à ç ààà àà ì à àààà à à à à à ààà ò à à ààààà àà à à àà ààààà à à ààààààà

101

102

Search space size

103

104

105

106

(a) PCFG-based attacks on six Chinese datasets (using

(b) PCFG-based attacks on three English datasets

(c) Improved PCFG-based attacks on Chinese dataset-

1 million (1M) Duowan passwords as the training set)

(using 1M Rockyou passwords as the training set)

s (using 1M Duowan passwords as the training set)

60%

40% 30% 20% 10% æ ô ì à 0% ò

100

ôôôôôôôô ôôôô ô ô ô ô ô ôôôôô ô ô ôô ô ô æ ô æì ææ æ æ ô æ æ æ æ æ æ ô æ æ æ ì æ æ ô ì æ ì æ ì ì æ ì æ ì ô ì æ æ ìì ô ì æ ì ì ì æ ô ì ì æ ì ì ô æ ì ì æ ì ô æ ì ì æ ì æ ô æì æ ì æ æ ô ì æ ô æ ô æ ô ì æ ô ì ô æ ô ô æ æ ô ì ô æ æ æ ô æ æ æ ô æ æ ì æææ ô æ æ ô æ ææ ì æ ô æ ì ô æ ô ì æ æ ô æ æ æ ì ô æ ô ô æ ì ô ì ì ôô ô æô ì ì ôô æ ô æ æ æ æ æ ì æææ æ ôôô ì ì æ ô æ ì ô ì ò ææ ì ò ì ô ì ò ì ò ô æ ìì ì ì ô ìì ì ì òòòò ì ì ææ ôô ì æ ì òòòòò à æ ò ô à ì æ ò æ ò à ô æ ò ô ì à æ ò ôô à ò à æ ì ò à ôôô ì ô òò æ ì à ò ôô ì æ ì ò à æ ò à æ ì ôô æ ò à ì ô ò ô à æ ò ì à ôô à æ ò ì ô à æ ì ì ì ô æ ì ò à ìì æ à ôôô ô ò à ì æ ô æ ì à ô à æ ô ò ì ô à æ ò ô ô ì ì à æ ì ô ò æ ô à ô ì æ ì ò à æ ôô ìì à æ ô ì à æ ôô ì à æ ô òòòò ò ô æ ì ô à ì ò æ ô ì æ ì òòò ì ò ì ôôô æ ì ò ò æ àà ò ò ôô æ ô ì òòò ò æ à ì ô ì òò æ ô ò ì ò ô æ à ì òò ò ô ò ì æ ò ô ì æ ò ì à ô æ ò ì ô ì æ ò ô ì òò ò ò ì ô à ò æ ò ô ò æ ô ò ì ò ò ô æ à ì ò ò ì ì ì ôôì òò æ ì ô ì òòòò æ ì ô ì òò æ à ô òò æ ò ô æ à òòò æ ò ìì ò à æ ò ì òò æ ì ò ì æ à ì ô ì òòò æ ô ì æ ô à ì ô òòòò ì æ à ô ì òò æ à ô à òò ì æ ô à ì à æ ô ì à ì à òòò ò ô æ à ì ô æ à ì ô æ à ì òòòò æ ô à ì à à òò ì ô æ à ò ô ààà æ à òò ì à à ò ô à æ à ì ò àà ô àà æ ì òò æ ì à ô òò ì æ ô ì ò æ ò ô ààà ì ò ì æ ô ò ì ò ô æ ò à ô ì ò æ ì ò ô à ì æ ò ì ô ì à ò æ à à ò ô à à æ ò à ò à ô æ ì ò à ô ò æ à ô ì ò àà æ àà ò æ ì àà ô ò æ ô ò à ì æ à ò ô ì æ ò ì à æ ì ô ò à æ ì ò ô àà ì æ ò à ô ì à ò æ ì ò ô àà ì æ à ò ì ô æ ì ò ô à ì æ ò ì ì ô æ ò ì ô àà ò ì æ à æ ô ò ì à æ à ò ì ô æ ì ò à æ ô ò æ ì à ô ì ò æ à ì ô à æ ì ò ì ô à æ ì ò ô æ ò à ì æ ô ì ò à æ ì ô à ò æ ì à ô ò ì æ à ô ò ì æ ò ì à ò ô æ à ì æ ò ô ì à ò æ ô ò à ô ì æ ò à ì ô ò æ ì ò à ô æ ì ò æ ò ô à ì ò æ ô ò ì æ ô ò ì æ ò ô à ì ò æ ô ò à æ ô ò æ à æ ò ô æ à ì ô æ ò ì æ à ô ò æ à ô ò æ æ ò ô à ì æ ò ô à ì æ ôò ò ò à æ ô ô ò à æ ò ô æ à ò ô æ à ò æ ô ô ò à ô æ ì æ à ì ôò æ ô ò ì à æ ò ò ô æ ô ì à ò ô æ ô ò ì ô ì æ ì à ò ôò æ ò ì ô à æ ò ò à æ æ ò à ô æ ì ò æ ò ô à ò æ ò ô ì ò à æ ò æ ô ì à ô æ ì ò ô æ à ì æ æ ò à ôô æ ô æ ô à ì æ ì ò ô æ à ò æ ô ì ô ì ò à æ ì ì ò æ ô à ì ô æ à æ ò ô ò ô à æ ô ô à ò ò ô ì ô æ ì à æ ô ò ì à ô æ ì ô ò à ì ô ò æ à ô ò òæ ì æ à ô òæ ô à ô ô à ì æ ô ì æ ì ô ì ô àà ì æ ô à à ô æ ò à ôì æ à æ ô à ô ò æ à ô à ì ì ô à æ æ ì ô à æ ô ò à æ à ô ì æ à æ ì ô à æ ì ô à òæ æ ô à æ ô ì à ì à ô ì ô à æ ì à ô æ à ô à æ ô ì à ô ì æ ô ì æ ô ì ààà à ô à ì ôì ì à ô æ à ôì ì æ ì ì ô ôì òì ôì ôì àààààà ææ à ô ì ì ì ôì ì ôì ààààà ì à à ôì ì æ ààà à àà ôì à à æ à à ì à à à àà à ôì à à à æ à à à à à à à ô ì à à à à à à à à à à à ì à à à ò àààààà à à à ààààà à àà àààààà à à ààààààà ò

à ì ò ô

Dodonew 178 CSDN 7k7k

101

102

103

104

105

106

Search space size

107

50% 40% 30% 20% 10% 0% ìæà 100

60% æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì ì ì ì ì æ ì ì ì æ ì ì ì æ ì ì ì æ ì ì æ ì æ ì æ ì ì æ ì ì æ ì æ ì ì æ æ æ ì ì ì æ æ æ ì ì æ à à ì à æ à à æ ì æ à æ ì æ à æ à æ ì à æ æ à ì à ææ à ææ ì à æ à ì à æ à ì æ æ àà ì æ à à ì æ æ à ì à æ à æ à ì à æ à ì à æ ì à æ ì à æ æ à ì æ ì à à æ ì æ ì à æ ì ì æ æ ìì æ àà ì ì æ à ì ì æ à ì ì æ ì ì ì æ à ìì æ ìì æ ì à æ ì æ ì à æ ì à ì æ ì æ à ì æ æ à ì æ ì ì æ à ì ì ì æ ì à æ ì à ì æ à ì æ à ì æ à ì à æ ì æ à ì æ à ì à æ æ ì ì àà ì æ ì æ ì àà æ à ì à æ ì à à æ à ì à à æ à ì æ àààà ì æ ì àà ì æ à ì æ æ ì à à ì æ à ì æ à ì æ à ì à æ ì à æ ì æ à ì æ à ì à æ ì à à ì æ à æ ì à æ ì à ì æ à ì æ à ì æ ì à æ ì à æ ì à æ à ì à æ ì à æ ì à à æ ì à æ ì ì à æ à ì æ ì æ ì ì à ì æ à ì æ à à æ ì ì à æ à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ ì à ì à æ à ì æ ì à ì æ à æ à ì à ì æ æ à ì æ à ì à æ ì à ì æ à ì æ à ì æ à ì æ à ì à æ à ì æ æ à ì æ à æ ì æ ì àà æ ì à æ ì à ì æ à æ à ì à æ ì à æ ì à æ à ì æ à ì æ à ì æ à æ à ì à ì æ à æ ì à ì æ à æ ì à æ à ì à æ ì à æ ì à æ ì à æ ì à æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à ì à æ ì à æ ì à à æ ì à ì æ à æ æ àà ì æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à æ à ì æ ì æ ì àà æ ì à à ì æ à æ ì à ì æ à ì æ à ì æ à ì æ ì à ì æ à æ ì æ à ì æ à æ à ì æ à ì æ à ì æ à æ ì à æ æ à ì æ ì ì æ ì àà æ ì æ à ì æ à ì æ à ì ì æ à à æ ì à æ ì à ì æ ì à æ ì à æ ì à æ ì ì æ à æ à æ ìì à ì à æ ì ì æ à æ ì à æ à ì ì æ à ì à æ à æ ì à æ ì à æ ì æ ì ì æ àà ì æ à ì ì æ æ ì à æ ì à æ ì æ ì æ ì àà æ ì à æ ì à à ì à ì à ì æ ì à ì à ææ ì à ì æ ì æ ì æ ìæ ææ ì àààà ææ æì à æ ì æ à æ ì à æ ì æ à ì æ à æ à ì æ à æ ìì ì à à æ à à à ìì ì à à à à ììì ææ à à à æ ì àààààààà ààà à à à àààààà

æ Rockyou_rest

à Yahoo ì Phpbb

101

102

103

104

105

106

Search space size

107


æ Tianya 50%



60%

107

Search space size

æ Tianya 50%

ôôô ô ôô ô ô ô ôôô ô ô ô ô ô ôô ôôô ô ô ô ô ô ôô æ æ ææ ô æ æ ææ ô æ æ æ æ æ ì æ ô æ ì æ ì æ ì ô æ ì æ ì ì æ ì æ ô ì ì æ ì ì ô æ ì ì æ ì ô ì æ ì ì æ ì ô ì æ ì ì æ ô ì æ ì æ ì ô ì æ æ ì æ ô æ ì æ æ ô ì æ ô ì æ ô ì æ ô æ ô ì æ ô ì ô ô ì æ ô æ ì æ ô æ æ ô æ æ æ æ ì ô æ ì ô ææææ æ ô ææ æ ô æ ì æ ô ì æ ô ì æ æ ô ì æ ô æ ì ô æ ô ì ô ô ì ô æ ôô ì ì ô æô ô ôô ì æ ô æ æ æ ì æ ì ì ææ ò æ ì ì æ òò ì ôô ì æ ì æ æ ô ì ì ì æ ì òòòò ì ô ò ô òò ò æ ô ìì òò ô ì æ ì ì ô òò à ò æ à ì ò ô æ ò æ à æ ôì à æ òò ô ì à æ ò æ ô ò à æ ò ì à ôô ò ì æ à ì ò ôô ì ì ô ì à æ ò ô ô æ à ò æ ôô à æ ì ô ò æ ì à ô æ à ò ô à ì æ ô ò ì ô à ì ô ì æ ì ò à ô æ à ô æ à ò ìì ô ì æ ô ì ô ô òò æ ô àà ì ì æ ô ì ô ò à æ ô ì ô ì à æ ò ì æ ì ì à ì ôô æ ô ì à ô æ à æ ì òò ò æ ì ì ôôô à ô ì æ òòò ì à æ ò ôô ì æ ì ô à ô òòò æ ì ò ô ò à ò ò ôô ì òòòò ææ ô ì à òò ô æ ì ò ô ì æ òò ì ô à ì ô òò æ ì ô ì ò æ ô ì à æ ò ô æ ì ò ô ì ô òòòòò æ à ò ô ì ò æ ô ì ò ò ì ô à æ ò ô ò ò ì æ ô ì ò ò ì ô à æ ò ì ôì ì æ òòò ì ò ô æ à ô òò ì æ ò ô ì æ ô à òòòò æ ò ò à ìì ò æ òò ì à æ ò ì ò æ òò ì à æ ò ì ô à æ òò ô ì ò æ ô ì à òò ô æ ì òò ô ì æ à ì ô à òò æ ì à ò ô ò à ì æ ô à ì à òò ò æ ô à ô æ ì à ì ô òòò æ ì ò ì ô à ò à æ ò ô à à ò æ ì ô ò àà ì ò àà æ à à ô ò à ì à à ò æ àà ô à ì ò æ à ô ò ì ò ô æ à ì àà ô òò à ì ò æ ô ì ò à ì ò ô ì ò æ ì ò ô ì æ ò à ì ô æ ì òò ô æ ò à ô ò æ àà à ò ì à ô à æ ò à ì ò æ à ô ò ì æ ò ô àà æ ò ô ì æ àààà ò ì ô ì æ ò ì ô æ àà ì ò ô ò æ ì ô æ ì à ò æ ì à ò à ì à ô ò æ à ì à ô ò æ ì à ò ì ô à à æ ì ò à ô ò æ ì ô ì à æ ò ì ô æ à ò ì à ô ò æ à ì à ì ô æ ò à æ ì ò à ì ô æ ò ì à ì ô æ ì ò ì ò æ ô à ì à ò æ à ì ô ì ò ì æ à ô ò æ ì à ì æ ô ò ì æ à ì ò ô à æ ò ì æ ô ò à ì ô æ ò à æ ô ò à ì æ ô ò ì ì ò æ à ô ì ì æ ò à ì ô ò æ à ô ò ì æ ò ô æ à ì ô ì æ ò ì à ô æ ò ò ô æ ò æ ô ò æ ô à ò ì ì æ ô ò à ì æ ô ò ì æ à ò ô æ ò æ ô à ò æ ô ì à æ ò ô æ à ò æ ô æ òô à æ ô ò ô à ò æ æ ò ô à ò ô æ à æ ò ì æ ò à ò ì ô æ æ à ô æ ì ò à ô ì ô æ æ ì à ò ô ì ò æ ì ò ò æ ô ì à ô ò æ ò ô à ò ôò æ ì ô æ ò ò ô ò æ ô ò àà ì æ ò æ à ì ò ì æ ò à ô æ ò ì ò ô æ à ì æ ô à ò æ ò ô æ à ì ò ò æ à ì ôô æ ò ì æ ô à ô ò æ ô ì ò à ì ò à ì ô ææ ì æ à ô ò æ ò à æ ô ì ò æ ô à æ ò æ ô à ô ò æ ò à ô ô ì à æ ò ô ì ò à æ ô ì ô à ò ô ì æ à ò ô æ ô à ì òò ô æ à æ ô òò æ ì ô à æ ô æ à ô æ à ô ì ô æ ì à ô ô à æ ô à ô à ô à æ ô æ ì ô à ì ì ô à òòò æ à ì ô ô ì à æ ô à ì æ à ô æ ì ò à ì æ à æ ôô à æ ô æ ì à ô òæ æ ì à ô æ ì à ôæ æ ô à ô ì à ô à ì à ô æ ô à æ ôì à ô ì à æ à ô ì æ ô ì ô ààà ì ô à ì ô æ ôì ì æ ì ì ààà ôì à ôì à òì ôì à à ôì à ææ à ì ô à ì à ì ôì ì ôì ì ì àààà à à à ôì æ ààà à à à ôì à à æ à ì àà à à à à à à ôì à à æ à à à àà à à à ô ì à à à à à à à à à à ì à à à à à à à à à àà à ò à ààààà ààà à àà ààààà à à ààààààà ò

à ì ò ô

40% 30% 20% 10% æ ô ì à

0% ò 100

Dodonew 178 CSDN 7k7k

101

102

103

104

105

106

107

Search space size

(d) PCFG-based attacks on six Chinese datasets

(e) PCFG-based attacks on three English datasets

(f) Improved PCFG-based attacks on Chinese datasets

(using all the Duowan passwords as the training set)

(using 4M Rockyou passwords as the training set)

(using all the Duowan passwords as the training set)

Fig. 4. General and improved PCFG-based attacks on different groups of datasets derivation. For instance, the probability of “liwei@123” is computed as P (“liwei@123”)= P (L5 S1 D3 )· P (L5 → liwei)· P (S1 → @)· P (D3 → 123). In Weir et al.’s proposal [13], the probabilities for digit and symbol segments are learned from the training set by counting, yet letter segments are handled either by also learning from the training set or by using an external input dictionary. According to Ma et al.’s recent work [9] and on the basis of our past cracking experience, we observe that PCFGbased attack using a letter-segment dictionary, which is directly learned from the training set, to instantiate the letter segments of guesses generally performs better than using an external input dictionary. This is largely due to the fact that until now there has been no effective method to measure the quality of an external input dictionary, while an inappropriate input dictionary would greatly reduce the accuracy of guess generations. What’s more, using heuristic approaches to choose the input dictionary may defeat the purpose of using PCFG to avoid heuristic approaches when generating guesses in the first place. As a result, we prefer to instantiate the PCFG letter segments of password guesses by using a dictionary directly learned from the training set. We divide the nine datasets into two groups according to the languages and user locations. For the Chinese group of test sets, we randomly select 1 million passwords from the Duowan dataset [33] as the PCFG training set (denoted by “Duowan 1M”); For the English group of test sets, we similarly select 1M passwords from Rockyou [31] as the PCFG training set. The rationale underlying our choices of training sets is that, passwords in Duowan and Rockyou exhibit more composition varieties than that of other datasets in the same group, which can be seen from Table 3 to 5. Since we have only used part of Duowan and Rockyou, their remaining passwords as well the other seven datasets are used as the test sets. That being said, no external input

dictionary is needed in our general PCFG-based attacks. The attacking results about the Chinese group and English group are depicted in Fig. 4(a) and Fig. 4(b), respectively. From these two figures we can see that, when the guess number (i.e., search space size) allowed is below about 3,000, Chinese web passwords are generally much weaker than English web passwords in terms of the same application domain (i.e., Tianya vs. Rockyou, Dodonew vs. Yahoo, and CSDN vs. phpbb). For example, at 100 guesses, the success rate against Tianya, Dodonew and CSDN is 10.2%, 4.3% and 9.7%, respectively, while their English counterparts are 4.6%, 1.9% and 3.7%, respectively. However, when the search space size is above 10,000, Chinese web passwords are generally much stronger than their English counterparts. For example, at 10 million guesses, the success rate against Tianya, Dodonew and CSDN is 37.5%, 28.8% and 29.9%, respectively, while their English counterparts are 49.7%, 39.0% and 41.4%, respectively. The strength gap between these two groups of datasets will be even wider when the guess number further increases. This reveals that a reversal has occurred: Chinese passwords are more vulnerable to online guessing attacks (i.e., when the guess number allowed is small), especially the trawling attacks, while English passwords are more prone to offline guessing attacks in which the attacker is not subject to the restriction of the guess number. This “reversal principle” well reconciles two obviously conflicting claims (see Section 1.1) about the security strength of Chinese web passwords. We observe that, the original PCFG-based algorithm [9], [13] inherently gives extremely low probabilities to password guesses (e.g., “1q2w3e4r” and “1a2b3c4d”) that are of a monotonically long structure (e.g., D1 L1 D1 L1 D1 L1 D1 L1 , or (D1 L1 )4 for short). For example, P (“1q2w3e4r”) = P ((D1 L1 )4 ) ·P (D1 →1)· P (L1 → q)· P (D1 → 2)· P (L1 →w)· P (D1 →3)· P (L1 →e)· P (D1 →4) ·P (L1 → r) can hardly be larger than 10−9 , for it is a multiplication of nine proba-

10

Algorithm 1: Improved PCFG-based guesses generation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Input: A training set S; A name list nameList; A date list dateList; A parameter k indicating the desired size of the password guess list that will be generated (e.g., k = 107 ) Output: A password guess list L with the k highest ranked items Training: for password ∈ S do for segment ∈ splitT oSegments(password) do segmentSet.insert(segment) baseStructure ← getBaseStructure(password) if monotonicallyLong(baseStructure) then transf ormStructureSet.insert(baseStructure) baseStructure ← convertT oshort(baseStructure) baseStructureSet.insert(baseStructure) trainingSet.insert(password) Append name and date lists to the learned segment list for name ∈ nameList do correctedCount = totalOverlapN ameInSegmentSet ∗ nameList.getCount(name)/totalOverlapN ameInN ameList if name ∈ / segmentSet and correctedCount ≥ 1 then segmentSet.insert(name, correctedCount) for date ∈ dateList do if date ∈ / segementSet then segementSet.insert(date)

19 Produce k guesses: 20 function guess.calculateP robability() 21 guess.probability ← 22 23 24 25 26 27 28 29 30 31

baseStructureSet.getP robability(guess.baseStructure) if monotonicallyLong(guess.baseStructure) then baseStructure ← guess.baseStructure guess.probability ← guess.probability ∗ transf ormStructureSet.getP robability(baseStructure) guess.baseStructure ← convertT oShort(baseStructure) for segment ∈ splitT oSegments(guess.password) do guess.probability ← guess.probability ∗ segmentSet.getP robability(segment)

Initialize heap: for baseStructure ∈ baseStructureSet do for segmentT ype ∈ baseStructure.segmentT ypeSet do guess.password += segmentSet.getF irstSegment(segmentT ype) guess.calculateP robability() guess.segmentChangedP osition ← 1 heap.insert(guess)

32 33 34 35 36 37 38 39 40 41 42 43 44 45

while guessCount ≤ k do guess ← heap.pop() L.insert(guess) guessCount++ for i ← guess.segmentChangedP osition to guess.baseStructure.length do guessN ew.password ← guess.changeT oN extHightestSegment(i) if guessN ew.password = null then continue guessN ew.segmentChangedP osition ← i guessN ew.caluculateP robability() heap.insert(guessN ew)

TABLE 6 Changes caused to the probabilistic context-free grammars Training set

Base structures

Duowan 1M Duowan All

8905+0 20961+0

L segments

D segments

155693+24416 465157+20341 559017+98654 1824404+9744

S segments 865+0 2417+0

bilities. As a result, some guesses (e.g., “1q2w3e4r”) will never appear in the top 107 guesses generated by the original PCFG-based algorithm, even though they are popular (e.g, “1q2w3e4r” appears in the top-200 list of every dataset). The essential reason is that the PCFG-based algorithm simply assumes that each segment in a structure is independent. Unfortunately, in many situations this is not the case. For instance, the four D1 segments and L1 segments in the

structure (D1 L1 )4 of password “1q2w3e4r” are evidently interrelated with each other. To address this problem, we specifically tackle a few structures that are long but simple alternations of short segments by treating them as short structures, e.g., (D1 L1 )4 and (D1 L2 )3 are converted to D4 L4 and D3 L6 , respectively. In this way, the probability of “1q2w3e4r” now is computed as P (“1q2w3e4r”)= P ((D1 L1 )4 ) · P ((D1 L1 )4 → D4 L4 )·P (D4 →1234)· P (L4 →qwer). Note that, our approach is language-irrelevant and constitutes a general improvement over the state-of-the-art PCFG-based algorithm [9]. To further exploit the characteristics of Chinese web passwords, we insert the “Pinyin name any” dictionary and the six-digit date dictionary (see Section 3.3) into the L-segment dictionary and D-segment dictionary that are learned from the 1 million Duowan passwords using PCFG, respectively. Note that each segment in the L- and D-segment dictionaries is associated with a frequency. As there may have already been some Pinyin names in the original L-segment dictionary and the total frequency of these names (denoted by n1 ) largely reflects the tendency that users insert Pinyin names into their passwords, we insert a name (its frequency denoted by fr ) from the “Pinyin name any” dictionary into the L-segment dictionary only if: (1) it is not in the original r L-segment dictionary and (2) n1n·f ≥ 1, where n2 is the 2 total frequency of names that falls into the intersection of the “Pinyin name any” dictionary and the L-segment dictionary. In this way, we manage to only insert a few most frequent ones from an ocean of 2.4M unique names of our name dictionary. On the other hand, as there are only 27.2K items in our six-digit date dictionary, we take all of them into account. More specifically, for any six-digit date that is not in the original D-segment dictionary, we first associate it with a frequency 1 and then insert it into the D-segment dictionary. The resulting changes to the PCFGs learned from the training sets are summarized in Table 6. Our improved algorithm for the generation of password guesses is illustrated as Algorithm 1. In our implementation, we use max-heaps to store the interim results (i.e., the various segments learned), and this ensures that an element with the highest probability of the heap is always in the root node. Since the guess number k (i.e., the number of password guesses needed to be produced) is generally large (e.g., k = 107 ), to maintain efficiency we employ a tree structure to store each guess that is successively popped from the heap. To our surprise, we find that the balanced binary tree is more effective than the prefix-tree in reducing both space and time overheads. After the k password guesses have been generated from the training set, it is time to use them to crack test sets, and the remaining process is quite routine. In our improved PCFG-based attack (see Fig. 4(c)), when the guess number is small (e.g., less then 103 ), there appears little improvement in success rate; while the guess number grows, the improvement increases substantially. For example, at 105 guesses, there is 0.09%∼0.85% improvement in success rate; at 1 million guesses, this figure is 1.32∼4.32%; at 10 million guesses, this figure reaches 1.70%∼4.29%. This indicates that, the excessively high usages of Pinyin names and birthdays facilitate an attacker to reduce the search space, and this vulnerability is especially serious when large

11

Duowan Tianya Pinyin name

60.59% 13.75% 2.88%

Fig. 6. Coverage of L-segments in the test set Tianya (using Duowan as the training set and Pinyin name as an extra input dictionary when performing PCFG-based guess generation) guesses are allowed. In 2014, Li et al. [18] reported that using 2 million Dodonew passwords as the training set and at 10 billion guesses, their best cracking record is about 17.30% (see Fig. 5 of [18]). However, our improved attack, which uses only 1 million passwords as the training set and at merely 10 million guesses, is able to achieve success rate from 29.41% to 39.47%. This means that our improved attack can crack 70% to 128% more passwords than Li et al.’s best record. Alarmingly high success rates highlight the urgency of developing effective countermeasures (e.g., more practical password creation policies and more accurate password meters) to alleviate the situation. Since the effectiveness of PCFG-based attacks depends on the size of the training set, in the following we increase the size of each training set used in the above three experiments, and attempt to see whether the observations made above still hold. As shown in Fig. 4(d), at 10M guesses, there is an increase in success rate from 3.89% to 10.78% (the avg. is 6.34%) when we increase the training set (i.e., Duowan) from 1M to 4.98M. Similarly, Fig. 4(e) shows that there is an increase in success rate from 4.67% to 6.44% (the avg. is 5.84%) when we quadruple the training set (i.e., Rockyou). As for our improved PCFG-based attack, Fig. 4(f) shows that at 10M guesses, the success rate is from 33.20% to 49.86% when the training set size reaches 4.98M, which means an increase from 3.79% to 10.39% (the avg. is 5.90%). This suggests that, our improved attack with much less guesses is able to crack 92% to 188% more passwords than Li et al.’s best record (i.e., 17.3%, see Fig. 5 of [18]). Remarkably, the “reversal principle” still holds in this series of experiments. In our improved PCFG-based attacks, external name segments are added into the PCFG L-segment dictionary during training, and we get gladsome increases in success rates over general PCFG-based attacks (see Figs. 4(c) and 4(f)). However, such improvements are still not so prominent as compared to the prevalence of names in Chinese passwords. To explicate this paradox, we scrutinize the internal process of PCFG-based guess generation and manage to identify its crux. Here we take the improved PCFG-based attack against Tianya (using Duowan as the training set) as an example. During training, we have added 98K name segments (see Table 6) into the L-segment dictionary. However, as shown in Fig. 6, these 98K name segments only cover 2.88% of the total L segments of the Tianya test set, while the original L segments trained from Duowan can cover 3.77(=13.75/2.88-1) times more of the name segments and 60.59% of the non-name L segments in the Tianya test set. This suggests that the training set Duowan is able to well

cover the name segments in the test set Tianya, and thus the addition of some extra names would be of limited yields. This observation also holds for the other eight test sets and the detailed results are presented in the supplemental data. However, this observation does not contradict our findings that Pinyin names are prevalent in Chinese passwords and actually, it does suggest that when the training set is selected properly, the name segments in passwords can be well guessed. Still, when there is no proper training set available, our improved attack would show its advantages. Moreover, although our improved PCFG-based algorithm might not be optimal, its cracking results represent a new benchmark that any future algorithm should aim to decisively clear. 4.2 Markov-Chain-based attacks In our Markov-Chain-based password cracking experiments, as recommended in [9], we consider two smoothing techniques (i.e., Laplace Smoothing and Good-Turing Smoothing) to deal with the data sparsity problem, two normalization techniques (i.e., distribution-based and end-symbolbased) to deal with the unbalanced length distribution problem of passwords. This brings about four attacking scenarios as listed in Table 7. In each scenario we consider three types of markov order (i.e., order-5, 4 and 3) to investigate which order performs best. We note that the other scenario (i.e., backoff with end-symbol normalization) performs “slightly better” than the aforementioned four scenarios, yet it is “approximately 11 times slower, both for guess generation and for probability estimation” [9]. Therefore, attackers, who particularly care about the cost-effectiveness, are highly unlikely to exploit this scenario. Consequently, we mainly consider the four attacking scenarios in Table 7. Due to space constraints, the detailed password guess generation procedure for Scenario 1 is referred to Algorithm 1 in the supplemental material, and the generation procedures for the other scenarios are quite similar. TABLE 7 Four Markov-based attacking scenarios Attacking scenario Smoothing Normalization Markov order #1 Laplace End-symbol 3/4/5 #2 Laplace Distribution 3/4/5 #3 Good-Turing End-symbol 3/4/5 #4 Good-Turing Distribution 3/4/5

As with PCFG-based attacks, in our implementation we use a max-heap to store the interim results to maintain efficiency. To produce k = 107 guesses, we employ the strategy of first setting a lower bound (i.e., 10−9 ) for the probability of guesses generated, then sorting all the guesses and finally selecting the top k ones. In this way, we are able to reduce the time overheads by 170% at the cost of about 110% increase in storage overheads, as compared to the strategy of producing exactly k guesses. In Laplace Smoothing, it is required to add δ to the count of each substring and we set δ = 0.01 as suggested by Ma et al. [9]. The cracking results for Scenario 1 are included in Fig. 5. Due to space constraints, the experiments for scenarios 2, 3 and 4 are illustrated in the supplemental material. There is a subtlety to be noted when implementing the Good-Turing (GT) smoothing technique. We denote f to be the frequency of a password, and Nf to be the frequency of frequency f . According to the basic GT smoothing formula,

12

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100


æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ ô æ æ ô æ ô æ ô æ ô ô æ æ ôô æ ô æ ô æ æ ô æ ô æ ô æ æ ô æ ô æ ô æ ô æ æ ô ì æ ì ô æ ì æ ì ç ô ì æ ç ì æ ç ì æ ç ì æ ì ô ç æ ì ô ç æ ì æ ç ô ì æ ç ô ì ç ì æ ô ì ç æ ô ì ç æ ô ì ç æ ô ì ç ì æ ô ç ì ì ô æ ç ô æ ì ç ì ô ç æ æ ô ì ôì ç ì æ ô ç ì æ ç ô ì æ ç ô ì æ ì ç ç æ ç ôô æ ì ô ç ì æ ô ç ì æ ô ç ì æ ô ç ì æ ì ô ç ì ô æ ç ô ì æ ç ô ì ç ì æ ô ì ô ç æ ì ô ç æ ô ì ç ì ô æ ç ì ô æ ç ô ì æ ô ì ç ì ô æ ç ô ì ç æ ô ì ç æ ô ì ì ô ç æ ì ô ç æ ì ô ç ì ô æ ç ì ô æ ç ô ì ç æ ì ô ç ì ì ôô ç ì ô ç ææ ì ô ç æ ç ô ì ç ô æ ì ì ô ç æ ç ô æ ç ô æ ç ô ì ç ô æ ç ì ô ç æ ì ô ç ì æ ô ç ì æ ô ç ô ì ç æ ì ô ç ò æ ì ç ô ò ò æ ì ç ô ò ì ç ô ò æ ò ô ç ò æ ç ô ì òò ç ì æ ò ç ôô ò æ ì ç ò ô ì æ ò ç ò ç æ ì ôô ç ì æ ô ç ì òòòò ô ç æ ì ò ç ò æ ì ôô ç ì ò æ ç ô ì ò ç ò ì ô æ ò ç ì ô ò æ ç ì ò ô ç ò æ ô ò ç ô ò ì æ ç ò ì ô æ ç ò ô ì ò ç æ ç à òò ì ôì ç ì à æ ô ç à ì æ ô ç ì à æ ô ç òòòòò à ì æ ô ç ì à æ ç ô ì à æ ç ô ì à ì ç ô à æ òòòòòò ç ì ô à ò æ ç ì à ô òò ç æ à ì ô ò ì ç à æ ò ô ì à ç ò æ à ò ç ô æ à ç ô æ ì à ç ì ô æ à òòòò ì ç ì ò à ô æ ç ì à ô ç æ ì òòò ô ç ì æ ì æ ç òò ô àà ì ç æ ì ô òò ç à æ ô ì ç æ ì ô òòò ç æ ò à ô ç ì ò æ ô ç ì æ òò ç ì ô à æ ç à ô òò æ ç à ô à ç ô òòò æ ì ç à ì ô æ ç òò à ò ì à ç æ ô ò ì à ç ò æ ô à ò ç ì ò à ô ì ç æ ò à ò ì ô ç ò à ì æ ô ç ò ì à ò ç ô æ ì à ò ì ç ò ô à æ ò ì à ò ç ò æ ôì à ò ç ò ô à æ ç ò à ô æ ç à ì òò ô ç ì à ò æ ô ò ç à ì æ ç à ô ì òò à ç ò æ ô à ç ì à æ ç ô à ç ô æ à òòòò ç ô à æ ç òò à ô ì ç ì à æ òò ô ç ò à ò æ ì à ç ô ì à òò æ ç ì ô à ç ì à æ ô òòò ç ì à æ ç ô ì òòò ç ì ò àà æ ô ò ì ç à æ ô òò ç ì à ò ô æ ì ò à ç ô à ì æ ç ì ô à òòò ç æ à ô ì ç à æ ô òòò à ò ç æ ô à ì ç òò à ì ô æ ò ç à ì ò æ ô à ç ò ì ò à ô ì æ ç ò à ì ô ç æ ò ì à ô ò ì æ à ç ò ì ô à æ ò ì ç à ô ì ò æ ç à ô ò ì à ò æ ç ì ô ì à ò æ à ô ç ò ìç à æ ô ì ç à æ ì ô òòò à ç ì æ à ò ì ç ô ò à æ ì ò ç à ò æ à ì ç à ì ò æ ç à ò à ì æ ì ò à ì ç æ ò à ì æ ç à ò ì æ à ç ò ì ì à æ ò ç ì à ò ì æ ç ô à ì ò æ ç à ô ì ò æ ç à ò æ ô ì ò æ ì à ç ô ò æ à ç ò æ ô ì à ç ò ì æ ô ì à ì ò ç ì æ à ò ô ì ç æ ì ò ô à ì æ ì ç à ò ô ì ò æ ç à ô æ ò à ç ô ì ò æ ô à ç ì æ ò ô à ì ò æ ç ô à ò æ à ô ç ò ì æ à ô ç ò ì ì ò à ç ô ì æ à ò ç ô æ ì à ç ò ô æ ò à ç ì ô ò ì æ ç à ò ô ì æ ç ò à æ ô ò ì ç æ ì ô ò à ç ì æ ô à ç ò ì æ ò æ ô ì ç à ò ì æ ô ç ì æ ì ò ç ô ì ò æ ç à ò æ ì ô ò ç ì à ò ô æ ç à ò ò ô ç æ à æ ò ô ç à ì ò æ ç ô ì à æ ò ç ì à ô ò æ ò ç à ò ô ì ç æ à ì ô ì æ ò ç ì à ô ò æ ç ò à ì ô æ ç ò æ à ô ò ç æ à æ ô ò ç æ ô à ç ò ì ô à ç ò ì æ ò ì ç ô à ò ì æ ç ô æ ò à ì ç ò æ ô à ì ç ò æ ô à ì ô ç à æ ò ì ô ç à ò æ ç ô ì à ô ç æ ò à ç ô ò ì ò æ à ç ô ô à ç ò ç æ ô à ç ò æ ò ç à ô ç ò æ ô ò à ô ç ô ò ò à æ ç ì ô æ à ò ô ç à ô æ ç ò ô à ò ì ç æ à ô ç æ à æ ô ç ì æ ô à ç ì æ ò à æ ç ò æ ì à ò ç æ ì ô à ç ò ô æ à ò ç æ à ô ç æ ò ò ô à ç æ à òò ç æ ô à ç ò ô à æ ç ô æ ò à ç ì ô ò à ç ì ò ô æ à ô ç ò ì ô à ô ç ì ç à æ ç æ à ô æ ì ô ç à æ ç æ ì ô à ô ç à æ ô ç à ç æ ô à æ ôç à ç ì ì ô æ à ç ô ò æ ì à ô ç ô à æ ì òç ì à ô ç æ à ç ô æ à ç æ ô à ç æ æ ì ô à ç ò æ à ô ç à ç ô æ à æ ç ô à ç æ à ô à ì æ ô ì à ç æ ô à æ ç ì ô à ç ì ô à ç æ à ô ì ç à æ ô ç ì à ç ô ì à ô ç à æ ì à ì ô ç ì à ç à ì ô ì à ç ì à ôô à ç òì à ì ôì ç ì ààà ô ç à ì æ à à ôì æ à à çç ôì ì à à ì ææ àà ô ôì à à ì àà ç à à à ç àà à çç à à àà à àà à à à à ææ à à à à ôôìì à àà æ à à àà çç ô ààààààààà à ç àà à àà ò ààààà ò ç ìì ààààà à àà àà à à à ì à àààààà

101

102

103

104

105

106

50%

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

107

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ æ ô ô æ ô æ ô æ ì æ ô ì æ ô ì æ ô ì æ ô ì æ ô ì æ ô æ ì ô ì æ ì ô æ ì æ ô ì ô æ ì æ ô ì æ ô ì ç æ ô æ ì ç ô ì æ ç ô ì æ ì ç ô ì æ ç ô ì ç ì ô æ ì ç ç æ ì ì ç ì ô æ ç ì ô æ ç ì ô æ ì ô ç æ ô ì ç æ ô ì ç ì ô æ ì ç ô ì æ ç ô ì æ ô ì ç ì ô æ ç ì ô ç æ ì ç ì ô æ ç ì ô ç æ ô ì æ ô ì æ ô æ çç ô ì ì æ ô ç ì ô æ ç ì ô æ ç ì ô æ ì ç ô ì æ ç ô æ ì ç ô ç ì ô æ ç ô æ ç ô ô ç ì æ ô ì ç æ ô ì ç ô ì æ ç ô æ ç ì ô ç ì æ ô ì ç ô æ ô ì ç æ ì ô ç ô ì æ ç ô æ ç ì ô ô ç æ ì ô ç æ ô ì ç ô æ ç ô ì æ ô ì ç ô ç ì æ ô ì ç æ ô ç ô æ ç ô æ ô ç ô æ ç ô ç æ ô ì ç ì æ ô ò ì ç ô ò æ ò ì ô ç ò ì æ ò ô ç ò ì ô ò ç ò ô æ ò ç ô ì ò ç æ ò ô ì ç ò ô æ ç ò ì à ò ç ì æ ò à ç ôô ò ì æ ò ç à ô ò ì ç ô ò à æ ç ò ì ô ò à ç æ ô ì ò ç ò à ô æ ì ç ò ô à ç ò æ ò ô ì ç à ò ô æ ç ò à ô ò ç ì æ ò à ô ç ì ò æ ô ì ç ò à ôì ç ò à ì æ à ò ç ô ò æ ç à ô ò ì ò ç æ à ò ì ç ò ôô à ç ì æ ò ô ò ì ç æ ò ô ç ò æ ç ô ò àà ò æ ô ì ç ò à ô ò æ ç ì ò à ô ò æ ç ò à ô ì ç æ ò ì ô à ç ì æ ô òò ç à ò ì ô æ ç ò à ô æ ò ì ç ò ì ô æ ç à ò ç ô æ à òò ç à ò ô ç ì æ ò à ì ò æ ç ì à ò ôô ç ò æ à ì ô ç ì æ òò à ì ç ô ì à æ òò ô ç ò à ô æ ç à ì ç ô à æ ì ç ô ì òòòòò à æ ò ç ì à ô ò æ ç ò ì ô à ç ò ì ò æ à ì ô ò ç à ì ò æ ç ô ì ò à ç ò æ ô ì ò à ç ì ò æ ç ô à ò ç ò ò ôì à ç ò ì ò ô à æ ç ì à ç ì æ ô òòò ì ç à ò ô æ ì ò à ç ì ò ô à æ ç ì ç à ô æ òò ç à ò ô ç æ à ò ì ç à ô ò æ ì ò ç à ì ô ò à ç ò æ à ì ç ô ì à ç æ à ì òòò ô ò à ç ì à æ ô òò ç à ì ò ì æ ô ç ç àà òòò ì à ò ç ì ô à æ ì òò ç à ô ì à ç òò æ ì ò à ô ç à ì æ òò ç à ò ô à ì ç æ à òò ô ç à ì òò æ à ç ô ò ì à ò æ ç à ô ò à ò ç ô æ à ò ì à ò ç æ ô ì à ò à ç æ ô ì ò à ô ì ò æ ç à à ò à ò ç ô à ò æ ì ç à ì ô ò æ à ò ì ç ô à æ ò ì à ô ç æ ì ò à ô ì æ à ç ì ò à ô æ ò à ì ç ô ì à ò ì æ ô ç à ò ì æ ô à ì ç ô æ ò à ì ì à ô æ ò ç ì à æ ô ì ò à ô ç æ à ô ò ì à æ ô ò ç ì à ô æ ç ò à ì ô æ à ì ò æ ì ì ç à ò ì ì à æ ò ç ì æ à ò ì à ç æ ò ì à ò ì ç ì æ ò à ì ç æ ò à ì æ à ç ì ò ì à æ ò ì ç à ô æ ò ì æ ç à ò ì ì ô æ à ò ì ç ô ì à æ ò ì æ ì ç ò ô ì æ ç ô ì ò à æ ì ç ò ô æ ì à ô æ ç ò æ à ô ç æ ì æ à ò ô ò ì ç à ò æ ô æ ç à ì ò æ ì ô ç æ à ò ì æ ô ì ç ò à ì æ ô ç æ ò à ì ç ô ò ì à ç ì ô ò æ ì ô ò æ à ç æ ò à ç æ ô ò æ ì ô ç à æ ì ô ì ò æ ô ç à ò æ ô ò ç æ ì à ò ô æ ç ô æ ò ì à æ ì ç æ ô ò à ò ç æ ô ò æ ò ô ç à ô ò æ ç ôç à ô æ ç ò ô à ì ò ç æ ì ò ç ò æ à ô ì ç ç ò æ à ô ì ç ô ç æ ò ì ô ì à ò ì ô ç ò æ ô æ ç à ô ì ò ô æ ò ò ô ç ò æ ô à ç ò ô ò ç æ ô à ò ç æ ì ô ç ò ç ò à ì æ ç ô ç ì ô ò ç à ì ò ô ç æ ç ì ò ô à ô ç ç ì æ ô ì à æ ç ì ò à ô æ ç ô ç ò æ à ç ôç ç ò æ ç ò ô à æ ç ò æ ç ô ò ì à ò ç æ ô ò æ ô æ ò ç ì ô ò ì æ ô ç ò òæ æ ì àà ç ì ç à òòò æ ç ôô à òò ç ô ç ò ô à ç ô à æ ç ô à ô ç ò ô ç à æ ç ô à ç ô æ ç à ì ôç ì æ ì à ôæ æ æ à ô ç æ ô à ç à ô ò æ ç à à ì ì ô æ à ç ô æ à ô ç æ ô à òç æç æ ç à ô ôì ç ç æòôç à ì æ ô ç ì à ç æ ô ç æ à ô æ ì à ç ô ô à ç à ì à ç à ôì à à ç ì à ì à ôô à æ ç ô æ ôì ç àà à ì ô ç ì òç ì àà æ ì ô à à ì ôô à æ ô æ ôç ì æ ì ôì ì ô àààà ç à ì ç ææ ôô àà à ç àà ç à à à à à ààà çç àà à ì à ç à à ì à à à à à àà à ôìì àà à ì ææ à à æ à àà ô ôôô æ ç àààààà à à çì çç ààà ô ààà ç ààà à àààà ò àà ò ì àà à à ààààà à àà ì à à ààà à à àà


101

102

103

Search space size

20% 10% 0% ìæà 100

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ æ ì æ æ ì æ æ ì æ æ ì æ æ æ ì æ ì ì ì æ ì ì ì æ ì ì æ ì ì ì æ ì ì æ ì æ ì ì æ ì ì æ à ì à æ ì à ì æ à ì à æ ì à ì æ à ì æ ì à ì à æ ì à æ ì à ì æ à ì à æ ì à ì æ à ì à æ ì à à æ ì à à æ ì à ì à æ ì à à æ ì à à æ ì à ì æ ì à æ ì à æ à ì à à ì à æ à ì à æ ì à à ì æ à à ì æ à ì à æ ì à à æ ì à ì à æ à ì æ à ì à æ ì à æ ì à ì à æ à ì æ à ì æ ì àà ì æ à ì æ à ì à ì æ à ì æ ì à ì æ à ì à æ à ì æ à ì à æ ì à æ ì à ì æ à ì à æ ì à æ à ì æ ì àà æ ì ì à æ à ì æ à ì à æ ì à æ ì à æ ì à æ à ì à æ à ì æ à ì æ à ì à ì æ ì à æ ì à æ ì à à æ ì ì à æ à ì æ ì à ì æ à ì à æ à æ à ìì æ à ì à æ à ì æ à ì æ à à æ ì à æ ì à æ à ì ì æ ì àà æ ì à æ ì à ì à æ à ì æ ì à à æ ì à æ ì à ì æ à ì à æ ì à æ ì à ì æ à ì à æ ì à æ ì à ì æ ì æ àà ì à æ ì à æ à ì à æ ì à ì æ à ì à æ ì à æ à ì à æ ì à æ ì à æ à ì æ à ì à æ ì à æ ì à æ ì à à æ à ì à æ ì à æ ì à à ì æ à ì æ à ì à æ ì à æ à ì à æ à ì à æ ì à ì æ à ì æ à à ì æ à æ ì à ì à æ ì à æ ì ì à æ ì à æ ì à æ ì à æ à ì æ à à æ à æ à æ ìì à ì æ à ì æ à æ ì à à æ ì à æ ì à ì æ à ì æ à à æ à ì æ à ì à æ ì à æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à æ ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì æ à à ì æ ì à æ à ì æ ì ì æ à à æ ì à æ ì æ ì à à æ ìì à ì æ æ àà ì æ à ì æ ì ì à æ à ì æ æ à ì æ à ì æ ì æ ì à æ à ì à æ ì à æ æ ì à æ à ì æ à ì æ ì à æ à ì à à æ à ì æ à æ à ì æ à æ ì à ì æ ì æ à æ ì à ì æ ì æ ì à æ ì æ à ì ì æ à ì à æ ì à æ ì æ ì à æ æ à æ à à à ì à à ìì à à ææææ à à à à ìììì àààààààà ìì à à ææ à à ààà æ àààà à ì à ààà àààààà à ì

à Yahoo ì Phpbb

101

102

103

104

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô ô æ æ ô ô æ ô æ ì ô æ ì ô æ ì ô æ ì ô æ ì ì æ ô ì æ ô ì ô æ ì ô æ ô æ ìç ô æ ì ô æ ì ô ì æ ç ì ô æ ì ô æ ç ì ô æ ç ì ô ì ç æ ô ì ç ô æ ì ô ì ç æ ô ç ì æ ô ç ì æ ô ì ç ô ì æ ç ô ì æ ç ì ô æ ç ô ì ô ç ì æ ô ç æ ì ô ç ì ô æ ç ì ô æ ç ô ì æ ç ô ì ì ç ô ç ì æ ô ì ô ç æ ì ô ç ì ô æ ô æ ì ô ì çç æ ì ç ô ì æ ç ô ì æ ô ç ì ô æ ç ì ì ô ç æ ì ô ç æ ì ô ç ì ç æ ì ô ç ô ì æ ç ì ô ç æ ì ô ç ì ô æ ç ì ô ç æ ì ô ç ì ô ç æ ô ç ô æ ì ç ô ì ç æ ô ç ì ô ç æ ì ô ç æ ì ç ç ôô ì ç æ ô ì ç ô æ ç ô ç æ ô ç ô ç æ ç ô æ ç ô ìç ç æ ì ò ô ç ò æ ô ò ç ì ò ô ç æ ò ì ô ç ò ì ò æ ô ò ì æ à ô ì òò ô à çç ò æ ì ô ç à ò ò ì æ ô ç à ò ò ç ô à æ ì ò ç ô ò à ò æ ç ì ô à ò ç æ ò ô à ò ç ò ô ì à æ ò ç ì à ô ò æ ò ç ì à ô ò ì ç ò à æ ô ì ò ç ò à ì ç ò æ ì ô à ò ò ç æ ô ì à ò ô ç ì ò æ à ò à ìç ò ôì ç ò à æ ì ò ô ç ò à ç ì ô æ à òò ì ç ò ô à ò æ ì ç ò ô ò ç æ ô òò ò æ àà ì ç ô ò à ò ì ç ò æ ô à ò ç ì ò ô à ç ò æ ì ò ô à ç ì à òò ç ô æ ò ì à ç ò æ ô ò à ì ç ò ô à ì æ ç òò ç à ì ò ò ô à ç ò æ à ò ì ô ç à ò ç æ à ô ì òò ç à ò ì ô æ ç ò ò ì ç ô à ì ò æ ç à ò ô à ç æ ò ì à ô ç ò à æ ò ç ô à ì ò ì æ à ç ò ô ò à ò ç æ ì à ô ò ò ç à ì ò æ ì à ô ò ç à ò çò ìç ò æ à ò ì à ô ç æ òò à ì ç ô ò à æ ò ì à ç ô ì ò à à ì ò æ ô à à ô æ òò ì à ò çç ì à ô æ ò à ç ò ô æ à ò à ç ò ì ô æ à ò ç à ì æ ô ò ì à ò æ à ì ç ô ò à æ ì ç ô à à ì òò æ ô ç ì à ò æ à ô ç ì ò à æ ô ì à ò ç ì à ô æ ì ò ì à ç æ ô ì ò à ì ò ç æ ô à ì ò à ô æ ç ì ò à ì ô æ ç ì ò à ô ì æ ç à ì ò ô ì à æ ì ò ç à ô ì ò æ à ç ì ô à ò æ ô ç à ò æ ì à ô ì ç ò æ ì à ô ì ç æ ò à ô ì ç à æ ò ô à ç æ ô à ò ô ì ç æ à ô ò ì ç à æ ô ò ì à ç ô ì æ ò ô à ç æ ô ò à ç æ ô ò æ ô à ç ì ò ô æ ì à ç ô ò æ à æ ô ò ì ç à ô ò ì æ ô ò ì ç à æ ò ô æ ì à ì ç æ ò ô à ì æ ç ô ò æ à ì ô ç ò æ à ì ô æ ò ô ç æ ì à æ ò ô ç æ à ò ô ì æ ô à ç ò ì ô ç ì à ô æ ò ì ò ç ò æ ì ç æ ò ì àà ì æ ç ò ç ò à ò æ ò æ ç à ì ò ç æ ò ì ò à ç ò ò æ à ç ò æ ì ò ç æ à ò ç æ ò à ì ç ò ç æ ò à ç æ ì ò ô ì æ ò ô ç ò à ç ô ò æ ì ò ç æ ô à ô æ ì ò ç à ì æ ò ô ç ì æ ò à ô ò ç ò æ ç à ò ô ç à ô ì ç æ à ì ç ô ò æ ç ô à æ ì ô ç ç ô ç à ì ô ç æ æ ç à ò ç ô æ ò æ à ò ì ç ô æ à ç ì ò æ ô ò à ç à ì ô æ ì ç ò ô æ ç à ì ò ì ôç ò æ ç ò æ ò à ôç ò ô ì à ç ô à æ ç ì æ ô ç à ò æ ì æ ô òò ì òæ æ ç à ô ì ò ô à ò ô à ò æ ò ç ì ô ç à ô òòòò æ ç ô æ à òòæææ ç à ô ç ôì ç à æ ôç ì ç æ æ à ç ô æ à ì ç ô à ç æææ ì ç æ ç à ô ô ô òæ ì à ç à òæ ô òæææ à ç ô ç ôì à ç ôô ô à ç à ç à ç ô ç àà ç ôç ì à çç ç à ææ à ôôôô ì à ôô à ìì ç ì à à ì à ççç ææ ç à çì à à òô æ ôôô à çç à ì à ì à à çì à çì à ìç à ç æ à à æ à ô ì à à ô æ à æ à çì ôì æ ô à à ôì à ô à ì à à à ç à à ç à ç à ç ì ç àà à ç àà ì æ àà ì ôç ààà ôæ ôæ ô ààà à ì ç ôì ôæç àààààààààà ààà àà òç à ààà àààààà àààà à à à àà à à


à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

107

æ ô à

101

102

105

106

107

Search space size

50% 40% 30% 20% 10% 0% ìæà 100

104

105

106

107

(c) Order-3 Markov-based attack on Chinese datasets 60%

æ Rockyou_rest

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì ì æ ì æ ì ì ì ì ææ ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì à æ ì à à æ ì à ì à æ à ì æ à ì à ì æ à ì æ à ì à æ ì à ì à æ ì à æ à ì à ì æ à ì à æ ì à æ à ì ì æ à ì à æ ì à æ ì à æ ì à à æ ì à ì æ à ì æ à ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à à ì æ ì à æ ì à æ à ì æ à ì æ à ì à æ ì ì æ à ì à ì æ à ì æ à ì à ì æ à ì æ à ì ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì æ ì à æ ì à ì æ à ì æ ì à ì à æ à æ à æ à ìì æ à ì æ à ì æ ì à æ ì à æ ì à æ à ì æ ì ì àà æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì à æ ì æ à ì æ à æ ì à à ì æ à ì æ æ à ì ì æ à ì à æ à ì æ à ì æ à ì æ à ì æ à ì æ à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ ì à à ì æ æ à ì æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì à ì æ à æ ì à æ ì à æ ì à æ ì à æ ì à æ à ì æ ì à æ ì à ì æ à ì à ì æì à ì æ à ì à æ ì à æ ì à æ à æ à æ à ìì à æ ì à æ à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à à ì æ à æ ì à à ì æ à ì æ à ì à æ ì à æ à ì æ à ì à æ à ì æ à ì æ à à æ ì à æ à à æ ì à æ à æ ì à æ à ì æ à ì æ à æ ì à æ à ì æ à ì æ à ì à æ ì à ì æ à æ ì à ì æ ì à æ à ì æ à ì æ ì à æ à æ ì à æ à ì æ à ì à æ ì à ì æ à ì æ ì à à ì æ à æ ì à æ ì à ì æ æ à ì æ à à ì æ ì à æ ì à æ æ à ì ì à æ ì à à ì æ ì à æ ì à ì æ à æ ì à æ ì æ ì à æ ì à ì æ à ì æ à ì æ ì à æ ì à æ à ì æ à ì ì à æ à ì æ ì à æ ì à ì æ à ì æ à à ì æ ì à ì à æ ì à æ æ ì à à æ à ì æ ì à ì æ ì à æ ì à ì ì æì ì àà æ æ ì æ ì àà æì ì æ à à æ ì ì à à ææ ì à à æì ì ì à ì à ì æææ æ ì à æ ì æ àà æ à ìì à æì ì ì ì àà ææ à ì à à ìì ææ ì à æ à æ à æ æ à æ ì æ ì ààà à ææì à à à à ììææìì àààà à ææ ààà àà æ ì ààààà ì ààààààà à à à à à ààààà

à Yahoo ì Phpbb

101

102

103

104

105

106

Search space size

(d) Order-5 Markov-based attack on English datasets

103

Search space size

60%

æ Rockyou_rest



30%

106

(b) Order-4 Markov-based attack on Chinese datasets

60%

40%

105

æ Tianya

50%

Search space size

(a) Order-5 Markov-based attack on Chinese datasets

50%

104

(e) Order-4 Markov-based attack on English datasets

107


50%

60%

æ Tianya


60%

æ Tianya



60%

50% 40%

æ Rockyou_rest à Yahoo ì Phpbb

30% 20% 10% 0% ìæà 100

æ ì æ æ ì à à à ììì

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì ì æ æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à ì à æ ì à ì æ ì à à æ ì à æ ì à æ à æ ì à ì à æ ì à æ ì à æ ì à æ ì æ àà ì à æ ì à æ ì à ì æ à ì æ à ì à æ ì à æ ì à à æ ì à ì æ ì à æ ì à à ì à æ ì à æ ì à ì æ à ì æ ì à ì æ à ì à ì æ à ì æ ì à æ ì à æ ì à ì à æ à ì æì à ì à ì æ à ì æ à æ à ì æ à ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì à ì à æ ì à ì æ à ì æ ì à æ ì à æ ì à æ ì à ì æ à ì à æ ì à æ ì à ì æ ì à æ à ì æ à ì æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì æ à ì æ à ì à æ ì æ à ì à æ ì à ì æ à ì æ à ì æ à ì æ ì à æ ì à æ à ì æ à ì æ à ì æ à ì æ à ì æ ì à æ ì à æ à ì æ à ì æ à ì æ ì à æ ì à æ à ì æ ì à ì æ à ì æ à ì à æ ì à æ ì à æ à æ ì à æ ì à ì æ à ì æ à ì à æ ì à æ à ì æ à ì æ à ì à æ à ì æ à ì æ ì à æ à ì æ à ì à æ ì à æ ì à ì æ à æ à ì æ à ì æ ì à æ ì à à æ à ì æ ì à æ ì à æ à ì à æ ì à ì à æ ì à æ ì à æ ì à æ à ì æ à æ à ì æ à ì à ì æ à ì æ ì æ à ì æ ì à æ æ ì æ ì à ì æ ì à æ ì æ ì à ì ì à æ ì æ à æ à ì æ à æ à æ ì ì à æ à ì à ì æ à æ æ ì à ì ì æ æ à ì à æ à ì æ à æ æ àà æ ì æ à ì à ì æ ì à æ ì à æ à ì æ ì æ à æì à ì æ ì æ ì à æ ì à æ à ìì à æ à ìì æ à ìì ææ à æ à æì à à æì àà ìì à ì ææ àà ææ à ì ìì àà æææ ììì à ææ àà à æì æì à à àà æì æì ààààààà àà àààà à à à ààààà ààà àà àà à à

101

102

103

104

105

106

107

Search space size

(f) Order-3 Markov-based attack on English datasets

Fig. 5. Markov-chain-based attacks on different groups of datasets (using Laplace Smoothing and End-Symbol Normalization). Attacks (a)∼(c) use 1 million Duowan passwords as the training set, while attacks (d)∼(f) use 1 million Rockyou passwords as the training set. the probability of a string “c1 c2 · · · cl ” in a Markov model of order n is denoted by l ∏ P (“c1 c2 · · · cl−1 cl ”) = P (“ci |ci−n ci−(n−1) · · · ci−1 ”), (1) i=1

where the individual probabilities in the product are computed empirically by using the training sets. More specifically, each empirical probability is given by S(count(ci−n · · · ci−1 ci )) P (“ci |ci−n · · · ci−1 ”) = ∑ , (2) c∈Σ S(count(ci−n · · · ci−1 c)) where the alphabet Σ includes 95 printable characters on the keyboard plus one special end-symbol (i.e., cE ) that denotes the end of a password, and S(·) is defined as: S(f ) = (f + 1)

Nf +1 . Nf

(3)

It can be confirmed that this kind of smoothing works well when f is small, yet it fails for passwords with a high frequency because the estimates for S(f ) are not smooth. For instance, 12345 is the most popular 5-character string in Rockyou, occurring f = 490, 044 times. As there is no 5character string that occurs 490,045 times and N490045 will be zero, implying the basic GT estimator will give a probability 0 for P (“12345”). There have been various improvements suggested in linguistics to cope with this problem, among which is Gale and Hill’s “simple Good-Turing smoothing” [40]. This improvement (denoted by SGT) is famous for its simplicity and accuracy, and we adopt it in this work. SGT takes two steps of smoothing, and the details can be found in the supplemental material. In 2014, Ma et al. [9] introduced GT smoothing into Markov-based attacks to facilitate more accurate generation of password guesses, yet little attention has been paid to the unsoundness of GT for high frequency events as illustrated above. To the best of our knowledge, we for the first time

well explicate the combination uses of GT and SGT in Markov-based password cracking. From the experiments we observe that for both Chinese and English test sets: (1) At large guesses (i.e., no less than 2∗106 ), order-4 markov-chain evidently performs better than the other two orders, while at small guesses (i.e., less than 106 ) the larger the order, the better the performance will be; (2) There is not much difference in performance between Laplace and Good-Turing Smoothing at small guesses, while the advantage of Laplace Smoothing gets greater as the guess number increases; (3) End-symbol-based normalization always performs better than the distribution-based approach, while at small guesses its advantages will be more obvious. This suggests that at large guesses, the attacks preferring order-4, Laplace Smoothing and end-symbol-based normalization (see Fig. 5(b) and Fig. 5(e)) perform the best among all the series of Markov-chain-based attacks; at small guesses (e.g., less than 106 ), the attacks preferring order-5, Laplace Smoothing and end-symbol-based normalization (see Fig. 5(a) and Fig. 5(d)) perform best. It is worth noting that, the “reversal principle” also applies in all the Markov-based experiments. For example, in order4 Markov-based experiments (see Figs. 5(b) and 5(e)), we can see that, when the guess number is below about 7000, Chinese passwords are generally much weaker than their English counterparts. For example, at 1000 guesses, the success rate against Tianya, Dodonew and CSDN is 11.8%, 6.3% and 11.6%, respectively, while their English counterparts (i.e., Rockyou, Yahoo and Phpbb) is merely 8.1%, 4.3% and 7.1%, respectively. However, when the guess number allowed is over 104 , Chinese passwords are generally stronger than their English counterparts. For example, at 1 million guesses, the success rate against Tianya, Dodonew and CSDN is 38.7%, 21.2% and 25.9%, respectively, while their English counterparts is 39.2%, 26.1% and 33.3%, respectively.

13

5

S OME

CRITICAL IMPLICATIONS

We now discuss some important implications that our findings revealed in previous sections are highly likely to carry. 5.1 Implications for password cracking Password cracking algorithms are not only necessary tools for security administrators to obtain a realistic picture of the security provided by a user-generated password, but also they can be used to facilitate information forensics (e.g., for law enforcement agencies to recover encrypted data). Our results have three main implications for password cracking. Firstly, our findings in Sec. 3 show that Chinese passwords have vastly different letter distribution, structure and semantic patterns as compared to English passwords, and thus it is crucial for cracking algorithms to be trained on relevant Chinese datasets when targeting Chinese passwords. Secondly, when using PCFG-based attacks, it is better to mainly employ the L-segments directly learned from the training set and may additionally include some external special dictionaries to instantiate the L-segments of password guesses. This implication accords with the findings in [9]. Its validity can be well established by the fact that, given the same guess numbers and against the same test sets, our PCFG-based attacks can obtain much higher success rates (see Section 4.1) than those of the PCFG-based attacks suggested in [13], [18] where external dictionaries were mainly used to instantiate the L-segments of password guesses. Thirdly, compared to Markov-based attacks, PCFG-based ones are simpler to implement (in terms of both computation and memory cost), and they perform equally well (even better, see Figs. 4 and 5) when the guess number is small (e.g., a few thousands). For large guess numbers, order-4 Markov-based attacks are the best choices. Note that we have only shown the Markov-based cracking results when the guess number is below 107 , there is a potential that order3 Markov-based attacks will outperform order-4 and 5 ones at larger guess numbers (e.g., 1014 ). 5.2 Implications for password strength meters In Section 1.1 we have shown that the password strength meters (PSMs) of four Internet-scale service providers are highly inconsistent at assessing the security of (weak) Chinese passwords. Failing to provide coherent feedback on user password choices would have great negative effects such as user confusion, frustration and distrust. Carnavalet and Mannan [15] suggested that PSMs “can simplify challenges by limiting their primary goal only to detecting weak passwords, instead of trying to distinguish a good, very good, or great password.” It follows that the essential step of a PSM would be to identify the characteristics of weak passwords. From our findings in Section 3.3 and Section 4.1, it is evident that for passwords of Chinese users, the incorporation of long Pinyin words or full/family names is adequate evidence for a “weak” decision. Other signs of weak passwords are the incorporation of birthdates and simple patterns like repetition and palindrome. The “reverse principle” revealed in Section 4 shows that Chinese passwords are more vulnerable to online guessing attacks, which is highly due to the fact that Chinese passwords are more concentrated (i.e., some passwords are overly popular, see Table 3). Thus, a special blacklist that includes a moderate number (e.g., 50K as suggested in [29])

of most common Chinese passwords (e.g., learned from various leaked Chinese datasets) would be highly helpful for Chinese users to avoid trivial online guessing attacks. Any password falling into this list shall be deemed weak. However, it is well known that if some popular passwords are banned, new popular ones will arise. These new popular passwords may be complex and subtle to detect. Hence, whenever possible, PSMs shall further employ state-of-theart cracking algorithms with consideration of local password characteristics (e.g., service and language) to assess the real strength that user-generated passwords can provide. 5.3 Implications for password creation policies Password creation policies, or so-called password rule requirements, are generally used along with PSMs to nudge users towards better passwords. While password creation policies tell a user what constitutes a good (or an acceptable) password, PSMs feedback to a user how her submitted password performs (weak or not?). It is interesting to see that CSDN enforces a minimum length-8 policy (as shown in Fig. 2 and [35], 97.83% passwords in CSDN are of length 8+ ), and Dodonew enforces no apparent rule (i.e., neither minimum length nor character set requirement) as is evident from Table 3. However, Figs. 4 and 5 indicate that passwords from CSDN is significantly weaker than passwords from Dodonew given any guess number below 107 . A plausible reason is that Dodonew provides e-commerce services, and most of the users perceive it as important. As a result, users rationally choose more complex passwords for it. In 2012, Bonneau [8] cast doubt on the hypothesis that users will rationally select more secure passwords to protect their more important accounts. However, in 2013 Egelman et al. [16] initiated a field study involving 51 students and confirmed this hypothesis. In 2014, Stobert and Biddle [19] interviewed 27 participants to investigate user behaviour in managing passwords, and their results also corroborate this hypothesis. As far as we know, here we for the first time provide a large-scale empirical evidence (i.e., on the basis of 6.43 million CSDN passwords and 16.26 million Dodonew passwords) that supports for this hypothesis. We also note that though the overall security of Dodonew passwords are higher than passwords from the five other Chinese sites, many popular passwords dwelling in Dodonew also appear in other less sensitive sites (see Table 3 for a concrete example). This might be due to that users inadvertently choose popular passwords and that many users reuse the same password across multiple sites (43%∼51% of users reuse passwords as reported in [39]). What’s even more dangerous is that many users fail to recognize different categories of accounts, for achieving this goal is not an easy task [41], [42]. Further considering the “over-constrained nature of authentication” on the Web [43] and the “finite-effort user” [23], we suggest that when designing password policies, instead of merely insisting on stringent rules, security administrators should put more efforts on helping users gain more accurate perceptions of the importance of the accounts to be protected and on guiding users towards better ability of recognizing different categories of accounts. Both efforts are essential for common users to responsibly allocating (i.e., selecting one candidate from their limited pool of passwords memorized [41]) passwords.

14

6

C ONCLUSION

In this paper, we have conducted an extensive empirical study of 130 million real-life passwords, including 100 million from China and 30 million from US, the largest password corpus ever studied. To the best of knowledge, we, for the first time, explore several fundamental properties (e.g., the frequency distribution and the distance between different letter distributions) that characterize user-chosen passwords. By using a comparison approach, we have identified a number of interesting characteristics of Chinese passwords, such as the excessively common usage of long Pinyin names, birthdays and mobile numbers. We further performed a series of experiments by using the state-of-the-art password cracking algorithms as well as our improved PCFG-based algorithm to evaluate password strength. Our results show that, the identified characteristics can be exploited by an guessing attacker to largely reduce her search space. Of particular interest is our observation that the “reversal principle” applies: when the guess number allowed is small (e.g., less than 104 ), Chinese passwords are much weaker than their English counterparts, yet this relationship will be reversed when the guess number is large (e.g., larger than 105 ). This indicates that Chinese passwords are more susceptible to online, trawling attacks, while English ones are more vulnerable to offline guessing attacks, which for the first time well reconciles two conflicting claims [8], [18]. Considering the comprehensiveness of our exploration (and the ampleness of our corpus), we believe this work constitutes an important step forward in understanding Chinese passwords and provides substantial “ground truth” for better design of future password policies, meters, etc. While we have shown that both group of users choose passwords by miraculously following the Zipf’s law, the underlying mechanism that gives rise to the emergence of this law is left as an open issue. Another line of interesting future work would be to testify the effectiveness of our methodologies and observations on large-scale passwords from other non-English-speaking populations.

R EFERENCES [1]

R. Morris and K. Thompson, “Password security: A case history,” Comm. of the ACM, vol. 22, no. 11, pp. 594–597, 1979. [2] B. Zhu, J. Yan, G. Bao, M. Mao, and N. Xu, “Captcha as graphical passwords–a new security primitive based on hard AI problems,” IEEE Trans. Inform. Forensics Security, vol. 9, no. 6, pp. 891–904, 2014. [3] X. Huang, Y. Xiang, E. Bertino, J. Zhou, and L. Xu, “Robust multifactor authentication for fragile communications,” IEEE Trans. Depend. Secur. Comput., vol. 11, no. 6, pp. 568–581, 2014. [4] J. Bonneau, C. Herley, P. Oorschot, and F. Stajano, “The quest to replace passwords: A framework for comparative evaluation of web authentication schemes,” in Proc. IEEE S&P 2012. IEEE, pp. 553–567. [5] D. Wang, D. He, P. Wang, and C.-H. Chu, “Anonymous twofactor authentication in distributed systems: Certain goals are beyond attainment,” IEEE Trans. Depend. Secur. Comput., 2014, http://dx.doi. org/10.1109/TDSC.2014.2355850. [6] C. Herley and P. Van Oorschot, “A research agenda acknowledging the persistence of passwords,” IEEE Security & Privacy, vol. 10, no. 1, pp. 28–36, 2012. [7] J. Yan, A. F. Blackwell, R. J. Anderson, and A. Grant, “Password memorability and security: Empirical results.” IEEE Security & privacy, vol. 2, no. 5, pp. 25–31, 2004. [8] J. Bonneau, “The science of guessing: Analyzing an anonymized corpus of 70 million passwords,” in Proc. IEEE S&P 2012, pp. 1–15. [9] J. Ma, W. Yang, M. Luo, and N. Li, “A study of probabilistic password models,” in Proc. IEEE S&P 2014. IEEE, 2014, pp. 689–704. [10] R. Veras, C. Collins, and J. Thorpe, “On the semantic patterns of passwords and their security impact,” in Proc. NDSS 2014, pp. 1–16.

[11] M. L. Mazurek, S. Komanduri, T. Vidas, L. F. Cranor, P. G. Kelley, R. Shay, and B. Ur, “Measuring password guessability for an entire university,” in Proc. CCS 2013. ACM, Nov. 4–8 2013, pp. 173–186. [12] W. Burr, D. Dodson, R. Perlner, S. Gupta, and E. Nabbus, “NIST SP800-63-2: Electronic authentication guideline,” National Institute of Standards and Technology, Reston, VA, Tech. Rep., Aug. 2013. [13] M. Weir, S. Aggarwal, B. de Medeiros, and B. Glodek, “Password cracking using probabilistic context-free grammars,” in Proc. 30th IEEE Symp. on Security and Privacy. IEEE, 2009, pp. 391–405. [14] F. Bergadano, B. Crispo, and G. Ruffo, “High dictionary compression for proactive password checking,” ACM Trans. on Information and System Security, vol. 1, no. 1, pp. 3–25, 1998. [15] X. Carnavalet and M. Mannan, “From very weak to very strong: Analyzing password-strength meters,” in Proc. NDSS 2014, pp. 1–16. [16] S. Egelman, A. Sotirakopoulos, K. Beznosov, and C. Herley, “Does my password go up to eleven?: the impact of password meters on password selection,” in Proc. CHI 2013. ACM, pp. 2379–2388. [17] CNNIC Released the 35th Statistical Report on Internet Development in China, CNNIC, Feb. 2015, http://www.apira.org/news.php?id=1732. [18] Z. Li, W. Han, and W. Xu, “A large-scale empirical analysis on chinese web passwords,” in Proc. USENIX Security 2014, Aug., pp. 559–574. [19] E. Stobert and R. Biddle, “The password life cycle: user behaviour in managing passwords,” in Proc. SOUPS 2014, 2014, pp. 243–255. [20] D. V. Klein, “Foiling the cracker: A survey of, and improvements to, password security,” in Proc. of USENIX Security, 1990, pp. 5–14. [21] M. Dell’Amico, P. Michiardi, and Y. Roudier, “Password strength: an empirical analysis,” in Proc. INFOCOM 2010. IEEE, 2010, pp. 1–9. [22] I. Erguler, “Achieving flatness: Selecting the honeywords from existing user passwords,” IEEE Trans. Depend. Secur. Comput., 2015, doi: 10.1109/TDSC.2015.2406707. [23] D. Florˆencio, C. Herley, and P. C. Van Oorschot, “Password portfolios and the finite-effort user: Sustainably managing large numbers of accounts,” in Proc. USENIX Security 2014, Aug. 2014, pp. 575–590. [24] B. L. Riddle, M. S. Miron, and J. A. Semo, “Passwords in use in a university timesharing environment,” Computers & Security, vol. 8, no. 7, pp. 569–579, 1989. [25] A. S. Brown, E. Bracken, S. Zoccoli, and K. Douglas, “Generating and remembering passwords,” Applied Cognitive Psychology, vol. 18, no. 6, pp. 641–651, 2004. [26] R. Veras, J. Thorpe, and C. Collins, “Visualizing semantics in passwords: The role of dates,” in Proc. VizSec 2012. ACM, pp. 88–95. [27] S. Designer, John the Ripper password cracker, Feb. 1996, http://www. openwall.com/john/. [28] D. Florencio and C. Herley, “A large-scale study of web password habits,” in Proc. WWW 2007. ACM, 2007, pp. 657–666. [29] M. Weir, S. Aggarwal, M. Collins, and H. Stern, “Testing metrics for password creation policies by attacking large sets of revealed passwords,” in Proc. CCS 2010. ACM, 2010, pp. 162–175. [30] A. Narayanan and V. Shmatikov, “Fast dictionary attacks on passwords using time-space tradeoff,” in Proc. CCS 2005. ACM, pp. 364–372. [31] C. Allan, 32 million Rockyou passwords stolen, Dec. 2009, http://www. hardwareheaven.com/news.php?newsid=526. [32] V. Katalov, Yahoo!, Dropbox and Battle.net Hacked: Stopping the Chain Reaction, Feb. 2013, http://blog.crackpassword.com/tag/yahoo/. [33] R. Martin, Amid Widespread Data Breaches in China, Dec. 2011, http://www.techinasia.com/alipay-hack/. [34] J. Huang, H. Jin, F. Wang, and B. Chen, “Research on keyboard layout for chinese pinyin ime,” Journal Of Chinese Information Processing, vol. 24, no. 6, pp. 108–113, 2010. [35] D. Wang, G. Jian, X. Huang, and P. Wang, “Zipf’s law in passwords,” Cryptology ePrint Archive, Report 2014/631, pp. 1–24, 2014, http:// eprint.iacr.org/2014/631.pdf. [36] R. A. Butler, List of the Most Common Names in the U.S., Jan. 2014, http://names.mongabay.com/most common surnames.htm. [37] J. Goldman, Chinese Hackers Publish 20 Million Hotel Reservations, Dec. 2013, http://www.esecurityplanet.com/hackers/chinesehackers-publish-20-million-hotel-reservations.html. [38] Sogou Internet thesaurus, Sogou Labs, April 17 2014, http://www. sogou.com/labs/dl/w.html. [39] A. Das, J. Bonneau, M. Caesar, N. Borisov, and X. Wang, “The tangled web of password reuse,” in Proc. NDSS 2014, 2014, pp. 1–15. [40] W. Gale and G. Sampson, “Good-turing smoothing without tears,” Journal of Quantitative Linguistics, vol. 2, no. 3, pp. 217–237, 1995. [41] R. Nithyanand and R. Johnson, “The password allocation problem: strategies for reusing passwords effectively,” in Proc. WPES 2013. ACM, 2013, pp. 255–260. [42] D. Florˆencio, C. Herley, and P. van Oorschot, “An administrators guide to internet password research,” in Proc. USENIX LISA 2014, pp. 44–61. [43] J. Bonneau, C. Herley, P. C. van Oorschot, and F. Stajano, “The past, present, and future of password-based authentication on the web,” Communi. of the ACM, 2015, in press.

1

Understanding Passwords of Chinese Users: Characteristics, Security and Implications (Supplemental File) Abstract—This supplementary file is composed of six sections. Section 1 shows the top-10 most popular password patterns with digits in Chinese web passwords. Section 2 deals with the semantic patterns regarding Pinyin names and dates that are at least 4-characters long. Section 3 illustrates the coverage of name segments in test sets. Section 4 describes the Markov-based algorithm for the generation of password guesses. Section 5 explicates a subtlety about Good-Turing smoothing in the Makov-based password cracking algorithm. Section 6 demonstrates the Markov-based cracking results under three attacking scenarios. Index Terms—Password authentication, Password structure, Semantic pattern, Markov model.

F

1

TOP -10

MOST POPULAR PASSWORD PATTERNS WITH DIGITS

As we have seen that digits are popular in top 10 passwords of Chinese datasets, whether are they popular in the whole datasets? To answer this question, we investigate the frequencies of password patterns that involve digits, and the results (on top 10 most frequent patterns) can be found in Table 1. The first row of the table denotes the pattern of a password as in [1] (L denotes a lower-case sequence, D for digit sequence, U for upper-case sequence, and S for symbol sequence). We see that an average of more than 50% of Chinese web passwords are only composed of digits, while this value of English datasets is only 15.77%. In contrast, English users prefer the pattern LD. Note that, all the percentages hereafter in this work are taken by dividing the corresponding total accounts (e.g., the percentage at the upper-left corner of Table 1 is computed #of passwords with pattern D 19706174 as #of total passwords in Tianya = 30901241 = 63.77%). It is surprising to see that, the sum of merely the first two patterns D and LD accounts for over 70% for every Chinese datasets, which indicates that Chinese users excessively employ digits to build their passwords. This may be due in large part to the fact that most Chinese users are unfamiliar with English language (and Roman letters on the keyboard). If this is the case, is there any meaningful information underlying these digit sequences?

2

S EMANTIC

PATTERNS IN USER - CHOSEN PASS -

WORDS

To gain an insight into the underlying semantic patterns, we construct several dictionaries of different semantic categories and investigate their prevalence (see Table 2). “English word lower” is from http://www.mieliestronk.com/ wordlist.html and it contains about 58,000 popular lowercase English words. “English lastname” is a dictionary consisting of 18,839 last (family) names with over 0.001% frequency in the US population during the 1990 census according to US Census Bureau [2]. “English firstname” contains 5,494 most common first names (1,219 male and 4,275 female names) in US [2]. “English fullname” is a cartesian product

of “English firstname” and “English lastname”, containing about 1.04 million most common English full names. To get a Chinese full name dictionary, we make use of the 20 million hotel reservations dataset [3] leaked in Dec. 2013. The Chinese family name dictionary includes 504 family names (see http://en.wikipedia.org/wiki/Hundred Family Surnames) which are officially recognized in China. Since the first names of Chinese users are widely distributed and can be almost any combinations of Chinese words, we do not consider them in this work. As the names are originally in Chinese, we transferred them into Pinyin without tones by using a Python procedure from https://pypinyin. readthedocs.org/en/latest/ and removed the duplicates. We call these two name dictionaries as “Pinyin fullname” and “Pinyin familyname”, respectively. “Pinyin word lower” is a Chinese word dictionary known as “SogouLabDic.dic”, and “Pinyin place” is a Chinese place dictionary. Both of them are from [4] and also originally in Chinese, and we translate them into Pinyin in the same way as we tackle the name dictionaries. “Mobile number” consists of all Chinese mobile numbers, which are 11-digit strings and the first three digits belong to a small specified collection (e.g., 130, 139, 150, 185 and so on). As for the birthday dictionaries, we use patterns to match digit strings that might be birthdays. For example, “YYYYMMDD” stands for a birthday pattern that the first four digits indicates years (from 1900 to 2014), the middle two represents months (from 01 to 12) and the last two denotes dates (from 01 to 31). “PW with a l+ -letter substring” is a subset of the corresponding dataset in each column (see Table 2) and consists of all passwords that include a letter substring no shorter than l, and similarly for “PW with a l+ -digit substring”. Table 2 shows the various semantic patterns existing in Chinese and English web passwords. We can see that, a large fraction of English users tend to use raw English words as their password building blocks. More specially, 43.33% English users insert a 4-letter or longer (denoted by 4+ letter) word into their passwords, and this figure accounts for more than half of the total passwords with a 4+ -letter substring; 25.88% English users insert a 5+ -letter word into

2

TABLE 1 Password patterns with digits (The percentage is taken by dividing the corresponding total accounts.) Patterns D LD Sum of top2 DL LDL UD ULD DLD LSD LDLD LDS Sum of top10

Tianya 63.77% 14.71% 78.48% 4.12% 1.17% 0.51% 0.27% 0.43% 0.31% 0.28% 0.23% 85.82%

7k7k 59.62% 17.98% 77.60% 3.91% 0.89% 0.34% 0.12% 0.31% 0.06% 0.22% 0.07% 83.52%

Dodonew 30.76% 43.50% 74.25% 7.55% 1.46% 1.90% 0.27% 0.39% 0.36% 0.28% 0.11% 86.56%

178 48.07% 31.12% 79.20% 6.25% 1.27% 0.62% 0.05% 0.38% 0.08% 0.47% 0.10% 88.41%

CSDN 45.01% 26.14% 71.15% 5.88% 1.64% 1.62% 0.50% 0.52% 0.66% 0.47% 0.54% 82.97%

Duowan Avg. Chinese 52.84% 52.93% 23.97% 23.72% 76.81% 76.66% 5.83% 5.25% 1.52% 1.24% 0.37% 0.83% 0.31% 0.25% 0.45% 0.41% 0.37% 0.30% 0.34% 0.31% 0.51% 0.21% 86.51% 85.45%

Rockyou 15.94% 27.70% 43.64% 2.54% 1.62% 1.35% 0.94% 0.42% 0.50% 0.42% 0.21% 51.64%

Yahoo 5.89% 38.27% 44.16% 5.31% 3.30% 0.56% 2.48% 0.94% 0.39% 0.97% 0.26% 58.37%

Phpbb Avg. English 12.06% 15.77% 19.14% 27.78% 31.20% 43.55% 2.03% 2.57% 3.64% 1.66% 0.37% 1.33% 1.03% 0.96% 0.78% 0.43% 0.17% 0.50% 1.02% 0.43% 0.07% 0.21% 40.31% 51.64%

TABLE 2 The prevalence of various dictionary words in user passwords (The percentage is taken by dividing the total accounts.) Dictionary English word lower(len ≥ 4) English firstname(len ≥ 4) English lastname(len ≥ 4) English fullname(len ≥ 4) English name any(len ≥ 4) Pinyin word lower(len ≥ 4) Pinyin familyname(len ≥ 4) Pinyin fullname(len ≥ 4) Pinyin name any(len ≥ 4) Pinyin place(len ≥ 4) PW with a 4+ -letter substring English word lower(len ≥ 5) English firstname(len ≥ 5) English lastname(len ≥ 5) English fullname(len ≥ 5) English name any(len ≥ 5) Pinyin word lower(len ≥ 5) Pinyin familyname(len ≥ 5) Pinyin fullname(len ≥ 5) Pinyin name any(len ≥ 5) Pinyin place(len ≥ 5) PW with a 5+ -letter substring Date YYYY Date YYYYMMDD Date MMDD Date YYMMDD Date any above PW with a digit PW with a 4+ -digit substring PW with a 6+ -digit substring PW with a 8+ -digit substring Mobile Phone Number(11-digit) PW with a 11+ -digit substring

Tianya 5.42% 5.15% 7.52% 5.80% 9.00% 9.18% 6.34% 9.87% 10.91% 1.95% 22.03% 2.08% 1.11% 2.16% 4.03% 4.60% 7.34% 1.35% 8.39% 8.56% 1.24% 18.51% 14.38% 6.06% 24.99% 21.29% 36.61% 89.49% 81.64% 75.59% 28.04% 2.90% 4.71%

7k7k Dodonew 5.57% 9.33% 5.01% 8.51% 8.23% 13.25% 6.16% 9.04% 9.46% 15.42% 10.67% 14.61% 7.14% 10.04% 11.42% 15.90% 12.42% 18.06% 2.27% 2.87% 23.01% 32.54% 2.05% 3.69% 0.93% 2.23% 2.34% 4.48% 4.30% 6.14% 4.65% 6.32% 8.56% 10.82% 1.64% 2.34% 9.87% 12.91% 10.05% 13.31% 1.27% 1.64% 19.99% 26.95% 12.82% 12.45% 5.42% 3.93% 19.97% 17.08% 15.89% 12.70% 30.39% 26.66% 88.42% 88.52% 76.98% 71.90% 68.32% 61.16% 27.56% 26.53% 1.76% 2.63% 2.09% 3.39%

178 4.16% 5.09% 8.25% 7.10% 9.48% 12.51% 9.21% 13.27% 14.92% 2.50% 23.73% 0.83% 0.53% 1.93% 4.99% 5.20% 10.24% 2.24% 11.81% 12.11% 1.58% 19.38% 10.06% 3.94% 16.46% 13.09% 27.07% 90.76% 78.76% 70.02% 26.37% 3.97% 5.08%

their passwords, and this figure accounts for more than one third of the total passwords with a 5+ -letter substring. In contrast, few Chinese users choose raw Pinyin words or English words to build passwords, yet they prefer Pinyin names, especially full names. It is surprising to see that, of all the Chinese passwords (22.42%) that include a 5+ letter substring, more than half (11.24%) include a 5+ -letter Pinyin full name, and this tendency is more pronounced when considering Chinese passwords with a 4+ -letter full name. There is even a non-negligible proportion (i.e., 4.10%) of passwords in English datasets that contains a 5+ -letter full Pinyin name, and a reasonable explanation for this observation may be that many Chinese users have created accounts in these English sites. We also note that English names are also widely used in English passwords, yet full names are less popular than last names and first names. As far as we know, for first time we have explored the name

CSDN Duowan Avg. 9.75% 6.68% 7.68% 6.16% 13.32% 9.69% 9.31% 7.36% 15.02% 11.23% 14.20% 12.55% 10.35% 8.44% 15.32% 13.42% 17.18% 14.81% 3.33% 2.62% 33.23% 25.80% 3.41% 2.37% 1.47% 1.19% 3.65% 2.77% 6.58% 5.07% 6.87% 5.18% 11.51% 9.92% 2.47% 1.88% 13.14% 11.29% 13.46% 11.53% 2.12% 1.48% 28.03% 21.70% 16.91% 14.33% 8.78% 6.17% 24.45% 22.59% 20.67% 18.28% 35.30% 33.58% 87.10% 89.26% 78.38% 80.60% 69.87% 73.10% 49.73% 31.03% 3.75% 2.44% 7.57% 3.35%

Chinese 6.82% 6.27% 10.04% 7.46% 11.60% 12.29% 8.59% 13.20% 14.72% 2.59% 26.72% 2.41% 1.24% 2.89% 5.18% 5.35% 9.73% 1.99% 11.24% 11.50% 1.55% 22.42% 13.49% 5.72% 20.92% 16.99% 31.60% 88.93% 78.04% 69.68% 31.54% 2.91% 4.36%

Rockyou 41.99% 30.19% 38.01% 16.72% 46.24% 13.61% 1.45% 15.22% 15.83% 1.27% 77.84% 23.54% 18.80% 20.16% 13.05% 27.67% 3.33% 0.05% 4.79% 4.80% 0.20% 71.69% 4.34% 0.10% 7.53% 3.24% 11.33% 54.04% 24.72% 17.77% 6.88% 0.07% 0.75%

Yahoo 47.55% 25.78% 38.96% 14.59% 45.93% 12.01% 1.32% 13.43% 14.04% 1.03% 84.00% 29.49% 15.21% 20.82% 11.35% 26.51% 2.99% 0.07% 4.17% 4.18% 0.18% 75.93% 4.30% 0.05% 4.46% 1.23% 8.77% 64.74% 21.85% 8.48% 2.50% 0.01% 0.17%

Phpbb Avg. 40.44% 17.51% 31.15% 11.32% 36.30% 10.56% 1.20% 11.78% 12.34% 0.89% 76.92% 24.60% 9.20% 15.22% 8.25% 18.71% 2.50% 0.07% 3.35% 3.36% 0.16% 68.66% 2.77% 0.09% 3.59% 1.55% 6.45% 46.14% 19.33% 11.28% 3.73% 0.02% 0.18%

English 43.33% 24.49% 36.04% 14.21% 42.82% 12.06% 1.33% 13.48% 14.07% 1.06% 79.59% 25.88% 14.40% 18.73% 10.88% 24.30% 2.94% 0.06% 4.10% 4.11% 0.18% 72.09% 3.80% 0.08% 5.20% 2.01% 8.85% 54.97% 21.97% 12.51% 4.37% 0.03% 0.37%

patterns in a large-scale empirical password study. Equally surprisingly, we find that, on average, about 20% of Chinese users simply insert a six-digit birthday into their passwords. Besides, about 30.89% of Chinese users employ a 4+ -digit date as their password building blocks, which is 3.59 times higher than that of English users (i.e. 8.61%); there are 13.49% of Chinese users inserting a four-digit year into their passwords, which is about 3.55 times higher than that of English users (3.80%, which is comparable to results reported in [5]). We note that there might be some overestimates, for there is no way to tell apart whether some digit sequences are dates or not, e.g., 010101 and 520520. These two sequences may be dates, yet they are also likely to be of other semantic meanings (e.g., 520520 can stand for “I love you I love you”). Nevertheless, it doesn’t affect our conclusion that birthdays play a vital part when Chinese users build their passwords. Another interesting observation

3

is that, about 3% Chinese users just use their 11-digit mobile numbers as passwords, making up 39.59% of all passwords with a 11+ -digit substring. While there are few passwords longer than 10, if an attacker can determine that the victim uses a long password, she is likely to succeed by just trying the victim’s 11-digit mobile number. This reveals a practical attacking strategy against long Chinese passwords.

3

S EMANTIC

PATTERNS IN PASSWORDS

As we have shown in Sec.4.1 of the main text, the training set (i.e., Duowan) is able to well cover the name segments in the test set (i.e., Tianya) and thus the addition of some extra names would be of limited yields. This observation also holds for the other eight test sets and the detailed results are summarized in Table 3, where “Duowan1M” is Duowan 1M for short and “PY name” is Pinyin name for short. The fraction of L-segments in the test set y that can be covered by the set x is denoted by CoL(x). Table 3 shows CoL(x) is at least 11 times larger than CoL(Pinyin name)−CoL(x), and CoL(Pinyin name)∩CoL(x) is at least 1.9 times larger than CoL(Pinyin name)−CoL(x), no matter x =Duowan or Duowan 1M. As a result, adding extra names into the PCFG L-segments when training is of limited yields. Note that, this does not contradict our observation that Pinyin names are prevalent in Chinese web passwords and actually, this does suggest that when the training set is selected properly, the name segments in passwords can be well covered (guessed). Still, when there is no proper training set available, our improved attack would demonstrate its advantages. Moreover, although our improved PCFG-based algorithm might not be optimal, its cracking results represent a new benchmark that any future algorithm should aim to decisively clear.

4

A LGORITHM FOR M ARKOV- BASED GUESS GEN -

ERATION Markov-Chain-based password cracking model is inspired by the Markov-Chain models widely used in the natural language processing (NLP) domain, and it was first introduced in [6] and later explored in [7] to reduce the password search space, yet no advanced NLP techniques (e.g., smoothing and normalization) has been employed to cope with the data sparsity and overfitting problem. In 2014, Ma et al. [8] investigated these issues and reported that, “when using a Markov-Chain of an order that is high enough, but not too high, and with some ways to deal with overfitting, would perform reasonably well” and in some cases even perform significantly better than the PCFG-based cracking model. Consequently, here we further conduct a series of Markov-Chain-based attacks on the nine real-world password datasets. In our experiments, as recommended in [8], we consider two smoothing techniques (i.e., Laplace Smoothing and Good-Turing Smoothing) to deal with the data sparsity problem, two normalization techniques (i.e., distributionbased and end-symbol-based) to deal with the unbalanced length distribution problem of passwords. This brings about four attacking scenarios as listed in Table 7 of the main text. In each scenario we consider three types of markov order (i.e., order-5, 4 and 3) to investigate which order

performs best. Due to space constraints, here we only illustrate the detailed password guess generation procedure for experiments using Laplace smoothing and end-symbolbased normalization (i.e., Scenario 1) in Algorithm 1, and the generation procedures for the three other scenarios (see Table 7 of the main text) are quite similar. As with PCFG-based attacks, in our implementation we use a max-heap to store the interim results to maintain efficiency. To produce k = 107 guesses, we employ the strategy of first setting a lower bound (i.e., 10−9 ) for the probability of guesses generated, then sorting all the guesses and finally selecting the top k ones. In this way, we manage to reduce the time overheads by 170% at the cost of about 110% increase in storage overheads, as compared to the strategy of producing exactly k guesses. In Laplace Smoothing, it is required to add δ to the count of each substring and we set δ = 0.01 as suggested by Ma et al. [8]. Algorithm 1: Markov-based password guess generation using laplace smoothing and end-symbol normalization

1 2 3 4 5 6 7 8 9

Input: A training set T S; The max password length maxLen; An estimation of lower bound of probability lowProb used in password guess generation; The markov-chain order mkOrder; A parameter k indicating the desired size of the guess list Output: A password guess list L with the k highest ranked items Training: for password ∈ T S do for i ← 1 to length(password) do preStr ← subStr(password, max(0, i − mkOrder), i − 1) nextChar ← getChar(password, i) trainingResult.insert(preStr, nextChar) preStr ← tailStr(mkOrder) nextChar ← end-symbol trainingResult.insert(preStr, nextChar)

10 Laplace smoothing: 11 function trainingResult.getP robability(preStr, nextChar) 12 count = trainingResult.getCount(preStr, nextChar) + 0.01 13 countSum = trainingResult.getCount(preStr, charSet) + 14

0.01 ∗ trainingResult.charSet.size() return count/countSum

15 function P roduceGuess(password, probability) 16 if probability ≥ lowProb then 17 if getChar(password, probability) = end-symbol or 18 19 20 21 22 23 24 25 26

length(password) ≤ maxLen then guessSet.insert(password, probability) else

preStr ← tailStr(mkOrder) for char ∈ trainingResult.charSet(preStr) do newP assword ← password + char newP robability ← probability ∗ trainingResult.getP robability(preStr, nextChar) if newP robability < lowProb then continue P roduceGuess(newP assword, newP robability)

27 Produce k guesses: 28 P roduceGuess(null, 1) 29 L ← guessSet.top(k)

4.1 A subtlety about Good-Turing smoothing on password cracking There is a subtlety to be noted when implementing the Good-Turing (GT) smoothing technique. We denote f to be the frequency of an event, and Nf to be the frequency of frequency f . According to the basic GT smoothing formula, the probability of a string “c1 c2 · · · cl ” in a Markov model of order n is denoted by

4

TABLE 3 Coverage of L (CoL) segments in corresponding test sets (“PY” stands for Pinyin and “1M” stands for one million) Test set Tianya 7k7k Dodonew 178 CSDN Duowan Duowan rest

CoL CoL CoL(PY name)∩ CoL(PY name)− CoL(Duowan1M) CoL CoL(PY name) CoL(PY name) CoL(Duowan)− (PY name) (Duowan1M) CoL(Duowan1M) CoL(Duowan1M) −CoL(PY name) (Duowan) ∩CoL(Duowan) −CoL(Duowan) CoL(PY name) 16.63% 67.53% 11.82% 4.81% 55.71% 74.34% 13.75% 2.88% 60.59% 16.70% 71.60% 12.35% 4.35% 59.25% 79.84% 14.49% 2.20% 65.35% 15.76% 75.79% 11.79% 3.97% 63.99% 81.19% 13.47% 2.29% 67.72% 20.30% 79.15% 15.42% 4.88% 63.73% 83.98% 17.49% 2.81% 66.49% 17.26% 65.64% 11.35% 5.90% 54.28% 72.70% 13.43% 3.83% 59.27% 18.06% 80.05% 14.38% 3.68% 65.67% 100.00% 18.06% 0.00% 81.94% 18.07% 75.03% 13.46% 4.61% 61.57% 100.00% 18.07% 0.00% 81.93%

P (“c1 c2 · · · cl−1 cl ”) =

l ∏

P (“ci |ci−n ci−(n−1) · · · ci−1 ”),

(1)

i=1

where the individual probabilities in the product are computed empirically by using the training sets. More specifically, each empirical probability is given by S(count(ci−n · · · ci−1 ci )) , (2) P (“ci |ci−n · · · ci−1 ”) = ∑ c∈Σ S(count(ci−n · · · ci−1 c)) where the alphabet Σ includes 10 printable numbers on the keyboard plus one special end-symbol (i.e., cE ) that denotes the end of a password, and S(·) is defined as: S(f ) = (f + 1)

Nf +1 . Nf

(3)

It can be confirmed that this kind of smoothing works well when f is small, yet it fails for passwords with a high frequency because the estimates for S(f ) are not smooth. For instance, 12345 is the most common 5-character string in the Rockyou dataset and occurs f = 490, 044 times. Since there is no 5-character string that occurs 490,045 times, N490045 will be zero, implying the basic GT estimator will give a probability 0 for P (“12345”). A similar problem regarding the smoothing of frequency of passwords has been identified in [9]. There have been various improvements suggested in linguistics to cope this problem, among which is Gale and Hill’s “simple Good-Turing smoothing” [10]. This improvement is famous for its simplicity and accuracy. This improvement (denoted by SGT) takes two steps of smoothing. Firstly, SGT performs a smoothing for Nf :  N (1)      2N (f ) SN (f ) = f+ − f−      2N (f ) 2f − f −

if f = 1 if 1 < f < max(f )

(4)

if f = max(f )

where f + and f − stand for the next-largest and next-smallest values of f for which Nf > 0. Then, SGT performs a linear regression for all values SNf and obtains a Zipf distribution: Z(f ) = C · (f )s , where C and s are constants resulting from regression. Finally, SGT conducts a second smoothing by replacing the raw  count Nf from Eq.3 with Z(f ): Nf +1 Nf S(f ) = Z(f + 1)    (f + 1) Z(f )   

(f + 1)

if 0 ≤ f < f0

(5) if f0 ≤ f

N

+1) where t(f ) = |(f + 1) · Nf +1 − (f + 1) · Z(f Z(f ) | and f0}= f { √ Nf +1 N min f ∈ Z Nf > 0, t(f ) > 1.65 (f + 1)2 Nf +1 2 (1 + Nf ) . f

In 2014, Ma et al. [8] introduced GT smoothing into Markov-based attacks to facilitate more accurate generation of password guesses, yet little attention has been paid to the unsoundness of GT for high frequency events as illustrated above. To the best of our knowledge, we for the first time well explicate the combination uses of GT and SGT in Markov-based password cracking.

5

M ARKOV-C HAIN - BASED

CRACKING RESULTS

The cracking results for attacking Scenario 1 have been included in Fig.5 of the main text, and the experiments for the three remaining scenarios are depicted in Fig.1, 3 and 2, respectively. From Fig.5 of the main text and the figures 1, 2 and 3, one can see that for both Chinese and English test sets: (1) At large guesses (i.e., no less than 2 ∗ 106 ), order-4 markovchain evidently performs better than the other two orders, while at small guesses (i.e., less than 106 ) the larger the order, the better the performance will be; (2) There is no much difference in performance between Laplace Smoothing and Good-Turing Smoothing at small guesses, while the advantage of Laplace Smoothing gets greater as the guess number increases; (3) End-symbol-based normalization always performs better than the distribution-based approach, while at small guesses its advantages will be more obvious. This suggests that at large guesses, the attacks preferring order4, Laplace Smoothing and end-symbol-based normalization perform the best among all the series of Markov-chainbased attacks, while at small guesses (e.g., less than 106 ), the attacks preferring order-5, Laplace Smoothing and endsymbol-based normalization perform the best among all the series of Markov-chain-based attacks. It is worth noting that, the “reversal principle” also applies in all the markov-chainbased experiments. For example, in order-4 markov-chainbased experiments (see Fig.2(b) and Fig.2(e)), we can see that, when the guess number is below about 7000, Chinese web passwords are generally much weaker than their English counterparts. For example, at 1000 guesses, the success rate against Tianya, Dodonew and CSDN is 11.8%, 6.3% and 11.6%, respectively, while their English counterparts (i.e., Rockyou, Yahoo and Phpbb) is merely 8.1%, 4.3% and 7.1%, respectively. However, when the guess number is allowed to be over 104 , Chinese web passwords are generally stronger than their English counterparts. For example, at 1 million guesses, the success rate against Tianya, Dodonew and CSDN is 38.2%, 20.4% and 25.4%, respectively, while their English counterparts is 38.6%, 24.8% and 32.3%, respectively.

5

50%

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100


æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ æ ô æ ô æ ô æ æ ô æ æ ô æ ô æ ô æ æ ô æ ô æ æ ô æ ô æ æ ô æ ô æ ì ì æ ôô ì ô æ ì ç ì æ ô ç ì æ ç ô ì æ ç ô ì æ ç ô ì ç æ ì ô ç æ ì ô ç ì æ ì ç æ ô ì ç æ ì ô ç ô æ ì ç ì ô æ ç ô ç æ ì ôì ç ì æ ô ç ì ô ç æ ì ç ô ì æ ì ç ô æ ç ì ô æ ç ô ì æ ç ô ì æ ç ô ì ç æ ì ô ç ì æ ô ç æ ì ô ç ì æ ô ç ì æ ô ç ì æ ô ì ç ì æ ç ô ì ì ç æ ô ì ç æ ô ì ç ì ô æ ì ç ô æ ì ç ô æ ç ì ô ç æ ô ç ô æ ô æ ç ì ì ç ô æ ì ô ç æ ì ô ç ì æ ô ì æ ç ô ì æ ç ô ì ô ç ì ç ô ì ç ô ææ ì ç ô ì æ ç ô æ ì ç ô ì ç æ ô ì ç ô æ ì ç ô æ ç ô ì æ ç ô ì æ ç ô ì ç æ ô ì ç ô æ ì ç ô ì æ ç ô ì æ ç ô æ ì ô ç æ ç ô ì ç æ ô ì ì ç ô æ ô ì æ ç ì ô ò æ ç ì ô ò ç æ ì ô ò ì ô ç æ ò ì ô ò ç ì æ ô ì ò ç ò ì ô æ ç ò ì ô æ ç ò ô ì ò ç æ ì ô ò ç æ ì ò ô ç ì ò æ ô ì ò ç æ ô ò ì ç ò ô ç æ ì ò ô ì ò ç æ ô ì ò ç æ ò ô ì ç ò ô æ ì ò ç ì ô æ ò ì ç ô ò ì æ ç ò ô ì æ ì ô ò ç ôç ò æì ì ô ò ç ì ô æ ì ç òò à ô æ ì ò ç ô à ì ò æ ô ç ò ì à æ ô ç ò à ò ç æ ô à ì ò ô ì ò æ à ô ò à ò ôì çç æ ò à ì ç ô ò æ ì ç à ô ò ì ò ì ç æ à ô ç æ à ô ì ç ì ô à ç æ ô à òòòòò ç ò æ à ô ç ò à ô ç ì æ ç à ì ô òòò æ ò ç ô à ç ì à æ ô ì òòò ç ò ì à ò ç ô æ ò à ò ç ò ô à ò æ ç ò ô à ç ì òò æ ô à ò ç ì ò ô à ò æ ç ì ò ò ô ç à ò æ ò ç à ò ò ç ì à ò æ ò ç ò à ì ò ç ì ò à æ ò ç à ì òò ç æ à ò ì ò ç à ò æ ò ì ç ò à ò ì à ô æ ç òò ì à ò ç ô ò æ ì à ò ç ô òò à ì ò æ ô ç ì òò ò ì àà ô òò ç ò æ à ç ô à òòò æ ò à ç ô ì à òò ç ì æ ò ô ì ò à ì ç ô à òò æ ì ò ç à ì ò ô ò à ì ç æ ò ô ì à ç òò à æ ô ò ì ç à ò ì ô à ò æ ç ì ò à ò ì ô à æ ç ò à ò ô æ à ò ç ì ò ô à ò ì æ ç à ò ô ì ò à ç ò æ ì à ô à ò ì à ò ì ò ô à ç æ ò à ì ò ç æ ô à ì ò à ò ç ô æ ò à ì à ç ô æ à òò à ô ç ò à æ ì à ò ç ô à æ ò ì à ç ô ì ò æ à ì ò ç à ì ô æ ò à à ç ô ò ì æ à à ç ò ô ì æ à ì ò à ô ç à æ ì ò à ç ô à ò æ à ô æ ò ç à à ì æ ò à ì ç ô ò æ à ì à ç æ ò ô ì à à ì ç ò ô æ à ì à ò ì ç ô à æ ì ò à ç ì æ ò à ô ì à ç ò æ ì ô à ì ò à æ ç ô ì à æ ò ì ô à ç ò æ ì à ô ç ì à ò æ ì à ò ô ç ì à æ à ò ô ç æ à ì ô ì ò à æ ç à ô ò æ à ç ì æ ò ô à æ à ò ì ç ô à æ ò à ô æ ç ò ì ô æ ì ç ò à ì à æ ò ì ô ç à ò æ à ô ì ç ò æ ì à ô ç ò æ à ì ò æ ô ç à ì ò æ à ì ç ô ò æ à ò ô ì ç ò à æ ô æ ì ç ò à ì ò ô à ì ç æ ò æ à ô ì ç ò ò ì æ à ç ô ì à ò ç ô æ ò à ò ç æ à ò ô æ ò ç ò ô à ò æ ì ç ò ô ì à æ ò ì ç ô æ ò ç ò ô æ à ì ò ç ô à æ ì ò ò ç æ ô æ ò à ç ò ì æ à ì ç ò ì æ ç ì à ô ò ç æ à ô ò ç ì ò à æ ô ò ç ò ì ô æ ç ò ì ò æ æ ò ô àà ç ô ò æ à ò ì ôç æ ç ò ì à ô æ à ò ôç æ ò ç ô à æ ò ç æ ô ò à ô ç æ ô ì à æ ò ç ôç ì ò æ à ç ô ò æ ç ò à ò ì ô æ ò ç ì ô æ à ò æ ç ì ò ô à æ òæ ç æ ô ì à ò ç æ ô ç ô ò à ç ô ò ç ô à æ ç ô ò æ ò ô ç à æ ô ò æ ô ç æ à ô ç ò ì æ ì à ç ò ô ì æ ò ò ô à ì ç ô ì ò ç ò ô æ à ò ì æ ç ô ì ò ç à ò ô æ ç à ç ì ô òò æ ì ô ì à ç ô ò ç ò æ ô ì à ç ì ô æ à ì ç ô ì ç à æ ôç æ òòòæ à ç ô à ç ì ô æ ì ç à ç æ à çô ç ôô à æ ô ç ì æ à ç æ ô ô à æ òòòòòòò ç ç à ô æ ô ç à ç æ ô ç à æ ì ôì æ ç ì à ô æ ç æ à ô ì ç ôç à ì ì à ç ì ô à ç ì ô ì à ç ç ô ì à æ à æ ô à ôç ç ì æ à ô òò òòææææ ç à ì ç ô ì à æ ô ì ç à ç ô ì ô à ç ì æ à ô æ ç à æ ç à ô à à ç ô à à ç à ôì à ôì à ç à ì ô àà ç ç ô ì ò ææææ à çì ç ôì à à ç à à à à ôô çç à ôì ç àà çç à à à à à ìì à à à ì à ç à ôôì à ì ôì à ôì à à à àà ç àà ç à ç à à ææ à à àà ààà ôôì à à æ àà à à æ ààà àà àà àà à ô çç àà ô à ç ç ààààààààà ì ò ìì ààà ì àààà àà à à à ààààààà

101

102

103

104

105

106

60%

æ Tianya

50%

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

107

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ æ ô æ ô ì æ ì æ ô ì æ ì ô æ ì æ ô ì æ ô æ ì ô æ ì æ ô ì æ ô ì æ ì ô æ æ ì ô ç æ ì ô ç ì æ ì ô ç æ ì ô ç ì æ ì ôç æ ç ì ô æ ç ì ô ì æ ô æ ì ì ô æ çç ì ô æ ì ç ì ô æ ì ç ì ô æ ç ì ô ì ç æ ì ô ç æ ì ì æ ç ç æ ì ôô ç ì æ ô ç ì æ ô ç ì æ ô ì ç ô ì æ ç ô ì æ ì ç ô ì æ ç ô ì æ ô ç ì æ ì ô ç ì æ ô ç ì æ ô ç ì ì ô ç æ æ ì ôç ìôô ç ì ô æ ç ì ô ç æ ì ç æ ç æ æ ôô ì æ çç ô ì ô ç æ ì ô ç æ ì ô ç ì ô æ ç ì ì æ ç ôô æ ç ô ç æ ôô ç ô æ ç ô æ ô ç ç ô æ ì ç ô ì æ ç ì ô ç ì æ ô ì ç ô æ ç ô ç ì æ ô ç ì ô ç æ ì ô ç ì æ ô ç æ ì ô ç ô ç æ ì ç ô æ ì ç ô ç ì ô æ ç ô ì ç æ ô ç ò ô æ ç ì ò ô ç æ ì ò ô ç ò æ ì ô ò ç ì æ ò ô ç ì ò ô ç æ ò ô ç ì ò æ ò ç ô à æ ò ì ç ô ò æ à ì ç ò ô ì ò æ ç à ô ò ì ç æ ò ì ô à ò ç æ ì ô ò à ç ò ì æ ô ç ò à æ ô ì ò ç ì ò à ô æ ç ò ô ì ç æ òò æ ô ò ç ì àà ò æ ç ò ì ôô æ ç à ò ì ò ô ç æ ì à ò ì ô ç ò ì æ à ç ô ò ì æ ò ç ô à ò ç æ ì ô ò à ç ò ì ô æ ò ì ç à ô ì ò æ ç ò à ô ì ç ò æ à ô ò ç ì æ ò ô à ç ò æ ô ò ç à ì æ ò ô ç ì à ò ì ô ç ò æ à ô ò ç ò à æ ç ô ò à ç ì æ ò ô à æ ò ì ç ô ò ì à æ ç ô ì à ç ì ô æ à ç ô ç ì à æ ô ç ì à òòòòòòò ô ì ò ç æ ò à ì ô ò ç ò à æ ô ç ì à æ ç ì òòò ô ì ò à ç æ ì òò ç à ô ò ì ç æ à ô òò ç ò ô à ò ì ç ò æ ô à ò ì ç ò ì à ô ç æ òò ò ì à ç ò ô æ ò ç à ò ô ò ç ò à æ ì òò à ç ò æ ì ò à ç òò à ì æ ò ç ò à ò ì ç ò æ ò à ò ç æ ò à ì ò ç à ì æ òò ç ì ô ò à ò ì ò æ ç à ô ò ì ç à æ òò ô ò ì à ç ò ô à ò æ ì ç ò à ô ò æ ì à ç ò ì ò à ô ç ò à æ ò ô à ò ç ò à ô æ ç ì ò à ò ì ô ò æ ç à ò ì ò à æ ç ì ô ò à ì ç à æ ô ì òòò ç ì à æ ô à òò ì ç ò ì à ô æ ò ç à ì ô à òò æ ì ç à ò ô ì æ ç à ò ò à ô ì æ ç à ò ì ò ô ì à ç æ ò à ì ò ô æ ç ì à ò ì ô ç à ò æ ì à ò ç æ ô à ì ò ì à ç ô æ à ò ì à ô ç ì æ ò à ì ç ô æ ò æ àà ò ç à æ ô ò à ì ç à æ ì ò ô à ç æ à ò ì ô à ì ò ç æ à ô à ò æ ì ç à ì ò à ô æ ç ì ò à ì ô æ à ò ç ì à ì æ ò ô à ì ç æ à ò ì ô à ç æ ò ô à ì ò à ç ì æ ô ò à ì æ ô à ò ç à æ ç ô ò ì à æ ì ò ç à ô ì æ à ò ô ì ç à æ ì ò à ì ô æ ç ò à ì ç ô ò ì æ à ò à ç æ ô à ò æ ì à ç ò ô æ à ò ô ç à æ ò ì ô ì ç à ò æ ì ì à ô ç ò ì æ à ô ò ç à æ ì ô ç à ò æ ô ç ò æ à ìì ç ô ò æ à ì ô ò æ à ç ì ô ò à ì ç æ ò ô ç ì ò æ ô ç ò ì à æ ò ç ì ô à ò ç æ ô à ò æ ò ç à ô æ ò ò ô ç à ò ô ò à ç ì ò æ ô ò æ ç à ì ô ò ç æ ì à ô æ ò ç ò à ì ì æ ò ç à ì ô æ ì ç ì ò ô à æ ò ç ì æ ò ô ì æ ç ò ò æ ì ç ô òæ æ ò ç ô àà ì ò æ ò ì ç ò ô æ à ò æ ò ç ì à æ ç ì æ òò ò æ à ç ô à ò ç æ ì æ ô ì ò ç à ô æ ò ç à æ ç ò ì ô ò à ç ò æ ì ô æ ô ç ò à ì ç æ ô ç ò à ò æ ç ô ç ò à ô æ ò ô æ à ç æ òæ ì ôç æ ç ì à ô ò ô ì à æ ò ò æ ôç ç à æ ô ç ì ò æ à ç ô ì òò æ ô ò à ç ô ò æ ò ç à ô ô ç òò ò à æ ô ç æ ò ô à ò ò ç æ à ç ô ì ò ç ò ì à ô ò ò ç æ à ô ì æ ç ô ò à ò æ ç ô ç à æ ì ô æ ç à ì ô æ ç ì à ô ç ô ì ì ç ô à ç ô æ ç à ô à ç æ ôô ç òòòòòò à ì æ à æ ç ôç ò òòò ç à ô ç ôç à æ æ ì ç à ô ì à ç æ à ì ô à æ æ à ì ç æ ô à æ æ ç à ôæ ì à ô à ç ì à ôç æ ì ç ô à ç ô æ à ç ò òòæòæ à ì ô ç à æ ì ô à ì ç ì à ô ô ç ì ç à ç æ ì ì à ôô ç ç à æ ô ô à à ç à ô æ ììì ç à æ ôì à à ì çç ì à à ô à ç ô ç àà à ôì òò ô à çç ç à à æ àà ç à ôô ç à ôìì à à ççç à ææ ì æçç à à æ à à ì à ì àà ç àà à ôôì à à à ôô à àà à à ç àà ç æ àà à àà à àà ìì à ì ôì æ ààà æ ààà à à àà ôç àà ô àà à àà ç ààà ç àà ò ìì àààà ì ààà à àà à à àà àààààà


101

102

Search space size

104

105

106

20% 10% æ

0% ìà 100

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì æ ì ì æ ì ì æ ì ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì ì æ ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ æ ì ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì à à ì æ à à æ ì à ì æ à à æ ì à à æ à ì à æ à ì à æ ì à à ì æ à à ì à æ à ì æ à ì à à æ ì à à ì æ à ì à æ à ì à æ ì à à æ ì à æ à ì à à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à æ à ì à æ ì à ì à æ à ì æ à ì à æ ì à æ à ì æ ì ì àà æ ì à à æ ì à æ ì à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à æ ì à à æ ì ì æ à ì à æ ì à æ ì à æ ì à æ à ì à æ ì à ì æ à æ ì à à ì æ à ì æ à ì æ à ì à æ ì à æ ì à ì æ à à æ ì ì à æ à ì æ à ì æ à ì æ à ì à æ à æ ì à ì æ à æ ì à ì æ à ì à æ ì à æ ì à æ à ì æ à ì ì à æ à ì æ ì à æ ì à æ ì à æ à ì à æ ì ì à æ à ì æ à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à ì æ ì æ àà ì æ à ì à æ ì à ì æ à æ ì à æ à ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì à æ ì à æ ì à æ à ì à æ ì à æ à ì à æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à ì à æ ì à æ ì à ì æ à ì à æ à ì æ ì à æ ì à à ì æ ì à æ ì à à æ ì æ à ì à ì æ à æ ì à æ ì æ æ ì àà æ ì à ì à æ à æ ì à æ ì à æ à ì æ à ì à æ ì à æ à ì ì ì àà ææ ì à æ à ì æ ì à æ ì à æ ì à ì à æ à æ ì æ à à ì æ à ì æ à ì æ ì à ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì à ì à ì ææ ì à æ ì à ì à æ ì à æ ì à æ æ ì à à æ à ì æ ì à æ ì à ì æ à ì æ à æ ì à æ à ì æ ì à æ à ì æ ì à ì æ à æ ì æ à ì ì æ ì æ à æ ì à ì à æ æ ì æ ì àà æ ì æ à ì æ à ì æ à ì æ ì à æ ì à æ ì à æ à ì æ à ì æ ì à æ ì æ à ì æ ì à ì æ ì ì ææ ààà ì ì æ à ì æ ì à æ æ à æ ì à ì ì à ì à ææ ì à ì æ à à ì ææ à æ ì à æ æ à æ à à ììì à ì à ææ ì à à à à æ à æ à à àà ààààà ææ à æ àà ì ààà à à ìàìàààààà ààà à ìì

101

102

103

104

30% 20% 10%

101

102

105

106

105

106

107

60%

50%

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì ì æ æ ì æ ì æ ì æ ì æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ à ì ì æ ì àà æ à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ à ì æ à ì à æ ì à æ ì à ì æ à ì à æ ì à æ ì à ì æ à ì à æ ì à æ à ì à æ ì à æ ì à æ ì ì àà æ ì à ì æ à ì æ à ì à æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à ì æ à à æ ì à æ ì à ì æ à ì æ à ì æ à à ì æ à ì æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ à ì à æ ì ì à æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì à ì æ æ ì àà æ ì à ì æ à æ ì à ì æ à ì æ à ì æ à ì æ ì à æ à ì æ ì à ì à æ æ à ì æ à ì æ à ì æ ì à ì à æ à æ ì æ à ì æ à ì æ à ì æ à æ à ì à æ ì à ì æ à ì æ à ì æ à ì æ à à æ ì ì à æ ì à æ ì à æ ì à æ ì à æ ì à ì æ à à ì æ ì à ì à æ ì à æ à ì æ à ì à ì æ à ì æ à à æ à æ à ìì æ à ì æ à ì à æ ì à æ ì à ì æ à ì à æ à ì æ à æ ì æ àà ì æ à ì æ à ì à ì æ à ì à æ ì à æ ì à ì æ à à æ ì æ à ì ì æ à à ì æ à ì æ à ì à æ à ì æ ì à æ ì à ì à æ à ì æ ì à ì æ à ì æ à æ ì à æ ì à æ ì à æ à ì ì æ à ì æ à ì à æ ì à æ à ì æ à ì ì à æ ì æ à à æ à æ ì à ì æ à ì ì æ à ì à æ ì à æ ì à à æ ì à ì à ì æ à ì æ ì æ à ì à æ ì à æ ì à æ ì à ì à æ ì à æ ì à æ ì à æ à ì æ à æ ì à æ ì æ à ì ì æ à ì æ à ì à æ æ à ì æ ì à æ ì à ì ì æ à æ æ à ì à æ ì ì à æ ì æ æ à ì æ à æ ìì à æ à ì à æ æ à ì æ ì à ì æ à ì æ ì à æ ì à æ ì à æ ì æ à ì æ ì à ì à ì ì ææ ì æ ì àà ì æ ì ì æ à æì à ìì æ à ì æ àà ì à æ ì à æì æ à à ææì ì æ ààà à à à ææææìì à æ à æ ììì àà à à à ææì à ææ æ ì ààààààààààààààààà àà ààààà à ìì à ìì

à Yahoo ì Phpbb

40% 30% 20% 10% æ

101

102

Search space size

103

104

105

106

æ Rockyou_rest

50%

à Yahoo ì Phpbb

40% 30% 20% 10% æ

0% ìà 100

107

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì ì æ ì æ ì æ æ ì ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì à æ ì à æ à ì ì à æ à ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì à æ à æ ì à ì æ à ì æ à ì æ à ì à æ à ì æ à ì æ à ì à ì æ à ì æ à ì æ à ì à æ ì à æ à ì æ à ì æ à ì æ à ì à æ à ì æ à ì æ à ì à ì æ à ì æ à æ ì à ì à æ à ì æ à ì à æ ì æ à ì æ à ì æ à ì à à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ à ì æ à ì æ à ì à ì æ à æ ì à æ ì à æ ì à à ì æ ì à à ì æ ì à ì æ ì à æ à ì æ à ì à ì æ ì à æ à æ à æ ì à æ ì æ à æ à ì à æ ì à ì æ ì à æ ì æ à æ à ì æ à ì æ à ì æ à æ à ì æ à æ à ì æ ì à ì æ à ì æ à ì æ à æ à ì æ ì à ì æ à æ ì à æ ì à æ ì à æ ì à æ ì æ à ì à ì æ à ì æ à ì à æ ì à æ ì à ì æ ì à æ ì à æ ì æ à ì à æ ì à ì æ à ì à æ ì à æ ì à ì æ à ì à ì æ ì æ à ì æ ì à æ ì à æ ì à ì æ à ì æ à à ì æ ì æ à ì æ à ì à æ ì à æ à æ ì à ì æ ì à æ à ì à æ ì æ à ì æ à ì à ì æ à ì æ ì à à ì æ ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ à ì à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ à æì à ì æ à ì æ ì à æ ì æ à æ ì à ì æ à ì æ ì à æ ì à æ ì æ à æ à æ à æ à ìì æ ì à æ ì à æ à ì æ ì à à ì æ ì æ à à ì æ ì à æ à æ ì à æ ì à æ ì à ì æ æ ì à æ ì à æ à ì æ à ì æ à ì ì à ì ææ à ì ì æ ì à ì æ à æ ì à æ à ì à ì æ à ì æ à à æì ì æ ì àà ì æ à æ ì æ ì æ àà æ ìì à æ ææ ì àà à æ à ì æ ì æ ì à æ à æ æ à àà æ ì à ì à à æ ì à ì à ì æì ì æ ì àà ì æ à à ææ à ì æ àà à ì æææ ì à ì ææ ææì ìì ààà à æ ààà ìì ææ ì àà à ààààààà à ààà à à ààààààà à àà à ààààà à àà

æ ææ æ ì àà à ììì

101

102

Search space size


104

(c) Order-3 Markov-based attack on Chinese datasets

æ Rockyou_rest

0% ìà 100

107

103

Search space size


à Yahoo ì Phpbb


æ Rockyou_rest

30%

40%

æ ô ç ì à

60%

40%

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ ì ô æ ì æ ô ì æ ô ì æ ô ì æ ì ô æ ì ô æ ô æ ì ô ì æ æ ô ì ìç æ ô ì æ ç ô ì æ ì ô æ ô ì æ çç ô ì æ ç ô æ ç ô æ ì ç ô ì æ ç ô ì ç æ ô ì ôì ç æ ì ô ç ì æ ì ç ô ì æ ç ì ô ì æ ç ô æ ì ç ô ç æ ì ô ì ç ô æ ì ç ô ì æ ç ì ô æ ç ô ì ç ì ô æ ô ç æ ì ô ì ç æ ì ô ç ì ô ç æ ì ô ç ì æ ô ç ì ô æ ç ô ç æ ì ô ì ç æ ô ì ç ô ì æ ç ì ô ç ì æ ô ç ô ç æ ç ô æ ç ô ç æ ô ì ç ô æ ç ì ô ç ì æ ô ì ç æ ô ì ç ô ì ç æ ô ì ç ô æ ì ç ô ì ç æ ì ô ç ì ô æ ì ç ô æ ô ç ì ì ç ô æ ìç ç ô æ ì ç ì ô ç ì æ ô ç ô ì æ ç ô ì æ ô ì æ ô ì çç æ ç ôô ì ç ô æ ô ç ì æ ô ç ì ô ò æ ç ì ò ô ç ò æ ô ì ò ç ô ì ò æ ç ô ì ò æ ô ì ô ì òò ô à ò çç æ ò ô ì ç ò à æ ô ò ì ç à æ ò ô ì ç æ ô ì à òò ç ì ô ò ì æ à ç ò ô ò æ ç ô à ò æ ç ô ò à ì ç ò ô æ ì ò à ì ô ç ò æ ò à ì ô ç ò ì ô à ò ç æ ò ì ô ç à ò ì ô æ ç ò à ì ò ô ç æ à ì ô ç òò æ ì ç ô ô òò ç ô ò àà æ ç ì ò ô ì ç ò à æ ô ò ì ç ò à ô ò ì ç æ ô à ç òò ô æ ç à ì ò ô ò ç ì à æ ò ô ç ò à ô ì ç ò æ ò à ô ì ç à ô æ ç ì òò ò ì ç à ô ò æ ç à ò ô ò ç à æ ò ô ç ì à ô ç ì òò à ç ò æ ì ô ç à ì òò ç ì æ ô ò ò ç ò ô ì ç ì àà ò ô ì ç æ à ì òò ç ô ò à ò ç ì æ ô à ì ç ô à ì æ ì ç òòòò à ô ò ì ç æ à ì òò ô ç à òò ò æ ì ç à ò ô ò ì à æ ç ô ì ò à ç ô ò æ à ò ç ò ô ì ò ç æ à ì ò ô ò ç ì à ò æ ô ç ò ì à ò ô ç ò à æ ì ò à ç òò ì à æ ì ôç ç òòòò à òò ô æ ç à ò ò ò ô à æ ç ì òò à ô ò ì ò æ ç à ò ô à ì à æ ì à òòò ì ò à æ ò çç ì ò à ì ò æ à ç ò ì à ò æ ç ì ò à ò à æ ç ò ì à ò ô æ ì ç à ò ì ô à ò æ ç à ì ò ç ò ì ô æ à ò ì à ô ò æ ì à ò ì ç à ô ò æ ì à ç ì ô ò à æ ì ì ç ò à ô ò ì à æ ò ç ì à ô æ ì ò à ç ì æ ô à ò ì ì ç à æ ì ô ò ì à ì æ ò ç ô à æ à ò ç ô à æ ò ç à ì ô æ ò à ì æ ç ô ì à ò æ ô ç à ò æ à ç æ à ô ò ì æ à ô ç ò ì æ à ò ô ì ç à æ à ò ì æ ô ç à ì ò æ à ì ô ç ò à æ ì ô ç à æ ò ô à ìì ç æ ò ô à æ ò ç ì ô à ì æ ì ò ç à ô à ò ô ç ì æ à ò ç ì ô à æ ç ò ì ô à æ ò ì ç à ô ò æ ì ç ò à ô æ ç à ò ô ò ç æ ô à ò ç æ ô à ò æ ç ì ô æ à ò ç ì ò à ô ç ò æ ì à ô æ ò ì ç æ à ò ç ô ò æ à ô à ç ì ò ì ç ò à ô æ ò ì à ò ô ç æ ì æ à ò ô ç ò æ ì ò à ò ô æ ç ì ô ò à æ ò ç ô æ ì ç ò ì ò ò ç ì æ àà ò ô ò ì ç ò à ò ô ì æ ç ò à æ ô ì ò ò à ô ç ò à ô ò æ ç ò æ ì à ô ç ò ò ò æ ò ì ç ô à ò æ ô æ ô ç à ò ô æ ò ç à ò ò ô ì ç ò æ à ô ì ç ç ò æ ì à ô ò æ ò ô ç æ à ì ì ò æ ç à ì ô æ ò ô ì à ç ò æ ò ò ò æ ç à ô ò ì ç ô ç à æ ì ô æ ç à æ ô ç ì ô à æ ô ç ò ì æ ç ò ò ç ì à ç æ ò ç æ ò à ç æ ô ì òò ç à æ ç ò ì æ ô ç ò à ô ç ò ò ç ò à ô æ ç ç ô à ç ô æ ç à ô ì æ æ à òò ç æ ò ô ç ì à ç ì ô æ ì òòæ ç à ôô ç æ òæ ô ì à ç æ æ à ô æ ç ç æ ô ç à æ ì ô æ òò ò òòòòæ ç à ô ç æ ô à ô ç à æ ç æ à ô ì æ à æ ì ç òòò ò ô à æ ç ôç ì à ç ì ô à çç à ô æ à ì ç ô æì ç ôç à ç ô à ç ç à ôô à ô æ à ôì ç à ç æ ç ò ò òæ ì ç æ ç àà ô ç à ô à æ ô à ì ç ææ ç àà ç ôì à ì ô ì æ ç à ì ç æ ô ì ô ç àà æ ç æ ôì à ìì ç ô à ç ô à ç ç à ô à ô à ììì ì ôô çç òææææææ àà à àà çç ô à à ôô ç à à ôô ç ì à ôì à ç ôì ôç ç àà à à ì ì à à àà à çìì çì à ôì à à à ôô à à à à à à à ç àà çì à à ç æ à àà à à àà ì ôì àà æ æ àààààà à ààà ôç à ô àààà ç ç ààààààààà ò ìì ààà ààà ì àà à à ààààà àààà


à ì ò ô ç

0% ò 100

107


60% 50%

æ Tianya

50%

Search space size

(a) Order-5 Markov-based attack on Chinese datasets Fraction of cracked passwords

103


60%

æ Tianya



60%

103

104

105

106

107

Search space size



Fig. 1. Markov-chain-based attacks on different groups of datasets (using Laplace Smoothing and Distribution-based Normalization). Attacks (a)∼(c) use 1M Duowan passwords as the training set, while attacks (d)∼(f) use 1M Rockyou passwords as the training set.

30% 20% 10% æ ô ç ì à 0% ò

100


æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ æ ô æ ô æ ô æ ô æ æ ô æ ô æ ô æ ô æ ô æ æ ô æ ô æ ô æ ô æ æ ô æ ô æ ì ô æ ì ô æ ì ô æ ì æ ô ì ì æ ô ç ô ç æ ì ô ç ì æ ì ô ç æ ì ô ç æ ì ç ô ì æ ç ì ô æ ç ô ì æ ç ô ì æ ç ô ì ç æ ô ì ç æ ô ì ç æ ô ç ì æ ô ç ì æ ô ç ì æ ì ç ô æ ô ì ç ôì æ ç ô ì æ ç ô ì ç æ ô ì ç æ ì ç æ ì ç æ ôô ç ì ô æ ç ì ì ô ç æ ì ô ç ì ô æ ç ì ô æ ì ç ô æ ç ô ì æ ç ô ì ç æ ô ì ôì ì ç æô ç ì ô ì æ ô ç ì ô ç ì æ ô ì ç æ ô ì ç ì æ ç æ ç ì ôô æ ç ì ô ì ç æ ô ì ç ô ì æ ç ô ì æ ç ô ì ç æ ô ì ç ô æ ç ì ç ì æ ôô ô æ ô çç ì ô ç ì æ ç ô ì æ ç ô ì ç æ ô ì ç ô æ ì ç ô æ ç ô ì ç ô æ ì ç ô æ ì ô ç ò ì æ ò ô ç ò ô ì ç æ ò ô ò ç ì æ ò ô ç ò ì ô æ ò ç ì ò ô æ ò ç ì ô ò ç æ ì òò ì ò æ ôô ç ì ò ô ç ò æ ì ô ò æ ç ì ò ô ç ò æ ì ô ò ì æ ç ô ì æ ç ô æ ç ô ì òòòòò ç æ ô ò ì æ ô ç ô ç æ òòò ç ô æ ç ì ô ì ç æ ô ì ç æ òòòòò ô ì ç æ ô ì ç ì æ ô òòò ç ì ò æ ô ç ò ì ô æ ì ç ì ç òò æ ô ò ç àà æ ò ô à ì ç ò ô à ò æ ç à ô ò ì æ ò ç à ô ì ò æ ç ì ô à ò ç æ ô à ò ì æ ç ò à ôæ ò æ ç ô à ì ò ô ç æ ì ò à ç ô ò ì æ à ô ì ò æ ç à ò ì ô ç æ ò à ì ô ç ò ì æ à ò ç æ ô ì ò à ç ô æ ò ì à æ ç ò ô ì æ à ò ç ì ô ò æ à ì ç ì ô ò æ ç à ì ò ô ç ò æ à ò ôì ç ò ô à æ ò ç à ç ì ò æ à ì ò ôô ç æ à ò ì ç ò æ à ç ì ôô ò æ ç à ì ò ì ô æ à ç ò ì æ ô ç à ì æ òò ç ô à ò ç ô ì æ à ò ô ç ì ò ô à ç ì æ ò à ô ì ç ò æ à ô ò ì ç æ à ò ì ô ç ò à ì ô ç æ ò à ò ô ì ç à æ ò ç ô à ç ò æ ç à ò ô ò à ç æ ô ò ì à ç ò ì æ ô ò à ç ì ç òòòò òò à æ ò ç à ôôì ç æ à òòò ç ô ò ì æ à ç ò ô ì ò à ç æ ô ì ç à ì æ ô à ì ç ô à æ ç ì à ô ç ì æ ò òòò à ò ô ç ò à æ ò ô ç ì ò à ò ô ç æ ò à ì ô ç à æ ì òòò à ô ò æ ç ò ì à ô ò ì ç à æ ò ì ò ô à ç ò æ ò ç à ì ô òò à ì æ ç ô ì à òò ì ô ç æ à à òò ì ç ô à ò æ ç à ô òò ç æ à ò ì ô ò à ç æ ò ì ô à ì ç æ à ô òò ì ç æ ò ì à ô ç ì ò à æ ô ì ò ç à ì æ ò ô ì à ò ç ì æ ô ò à ç æ ò ô à ì ò à æ ô ç ò ì à ô æ ì ç ò à ì ò ô æ ç à ò ô ç æ à ì ò ç ô à ò æ ì ì ò æ à ô ì ç ò à æ ô ç ì ò à ì æ ô ç ò à ì ç æ ô ò à ì ò ì ç à ô æ ì ò ç à ì ô æ ò ç à ì æ ô ò ç ì à ô æ ò ì à ç ô ò æ à ç ò ì à ç æ ò ì à ç æ ò ì ì à æ ò ì ç ì à ì æ ò ì ç à ì æ ò à ì æ ç ò ì à ò æ ç ò à æ ç ò ì à æ ò ì ç æ à ì ì ò ç æ ô à ò æ ç à ò ô ì æ ò à ç ô ì æ ò à ô ì ç æ ò à ì ô ç ì ò æ à ô ì ò à ç æ ô ò ç à ì æ ì ô ò ç à ì æ ò ì ô ì ç à ò æ ô ç ò à æ ì æ ò ô ç ì ò ç ô ì æ ò æ ç ò à ì ô ì ç æ à ì ô ò æ ì à ç ô ò æ ò à ç ì ô ò à æ ç ì ò à ô ì ç æ ò ì à ô ì ò ç ì à æ ò ô ç à æ ò ô æ ç ò à æ ò ô ç à ì æ à ô ç ò ô ò à ç æ ì æ ì à ò ç ô æ ò à ì ç æ ô ì à æ ç æ ò à ô ì ì à ç ò æ ô ò à ç ì æ ò ô ò à ç æ ô ò à ç æ ô ò à ç ò à ô ò ì ç æ ì à ò ô ç æ ò ç ì à ì ô à ò ç æ ô ò à ì ç æ à ô ì æ ç ò à ô æ ç ò à ô ì ò ç æ à ô ç à ô ì ò æ ç à æ ô ò ì æ ç ò ô ì à ç ô ç ô à æ ô ç à ò ò ô ç à æ ò ô ç à ç ô ò æ à ô æ ò ç ì à æ ò ô ç ç à ô æ ò ç à ôç æ ô ç à ì ç æ ô à æ ì ô æ à ò æ ç ô à æ ç ì ò ô ç à ì æ ç ò ç ì à æ æ ç à ò æ ç ò ô à ç òò à ô ò æ ç à ô æ ò ç à ô æ ô ç à ò ô ç à ò ç ô à ç æ ì à ô òò ç æ ô à ì ç ì ô à ô ç ì æ ô à ì æ ç æ ô à ç æ à ç ô ç à ô à ç æ ô ò ç à ô æ ç à æ ô ì ç à ì ô à ç ô ì ç à æ ô à æ ì ô ç òç à ç ô æ à ô ç æ à ç ô ì ç à æ ô æ ç à ì ò ô æ à ç ô à ç ô æ à æ ì ç ç ô æ à ô à æ ô ì à ç ô à æ ç ì à ô ç ì à ô ç æ à ô ì ç à æ ô ç ì à à ç ô ì à ç ô ì ç à æçç ì à ì ô ç à ì ç ô ì àà ç ì à ô ôç à ç òì à ì à ôì ç à à ì à ô à ì à æ ôì æ ì ààà ç ôì ì à ì à ææ àà ô ôì ì àà à ç ààà à ç à à à çç à à à à à ààà à à ææ àà à à ôôìì à àà æ à à à à à à àà çç ô ìì àà ô à àà à à ç à à à à à ò ààààà ò ç ààààà à à à àà à à à à à ì à à àààààà

101

102

103

104

105

106

50% 40% 30% 20% 10% æ ô ç ì à 0% ò

107

100

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ æ ô æ ô æ æ ô ì æ ô æ ì ô æ ì æ ô ì ì æ ô æ ì ô æ ô ì æ ô ì ì ô æ ì æ ô ì ô æ ì ô æ ì ô ì æ ç ô ì æ ç ì ô ì æ ç ô ì æ ç ô ì æ ç ô ì æ ç ô ì ì ô æ ç ì ç ç æ ô ô ì ç ô ì æ ç ì ô æ ì ô ç ì ô æ ì ç æ ì ô ç æ ì ô ç ô æ ì ç ì ô æ ç ì ô æ ì ç ô ô ç ì æ ô æ ì ì ç ç ô ì æ ç ô ì ç æ ô ì ç ì ô æ ì ô æ ô æ ì ô çç ì æ ô ç ô ç æ ì ç ô ì æ ô ç ì æ ô ç ì æ ç ì ô ô ç æô ì ç ô æ ç ì ô æ ç ì ô ç æ ì ô ì æ ô ì æ çç ì ô æ ç ô ô ç ì æ ô ç æ ô ç ô æ ç ì ô æ ç ì ô æ ç ô ì ç ô æ æô ôì ç ô ç æ ô ç æ ô ç ô ìç æ ç æ ç æ ç ì ôôô ç æ ì ô ì æ ò ô ò ì ô æ ò ì çç ò ô ò ì æ ô ç ò ì ç ô æ ò ò ô ç ì æ ò ì ô ç ò ç ò ô æ ì ò ô ç ò æ ô ò ì ç ò ì æ ç ô ò à ò ì ô ç ò æ à ô ò ì ç æ ò ô ç à ò ì ç æ ô ò à ò ô ç æ ò à ì ô ç ò æ ì à ô ç ì ô òò à æ ç ò ô ì à ç æ ô òò ì à ç æ ò ô ì ç ò à æ ô ò ç ò à ô ì æ ç ò ò ç æ à ô ò æ ô ì ç à ô òò æ ì ò ç à ô ò æ ç à ô ò ì æ ò ç à ô ò ì æ ô ò à ì ç ò æ ì ô ç à ò ì æ ô ò ç à ò ì æ ô ò à ç ô æ ò ì ç à ò ô æ ç ò à æ ç ò ô à ò ì ç æ ô ò à ç ò ô æ ò à ç ô æ ì ò à ç ì ò ô æ ì ò ç à ô æ ç à ì òò ô ì æ ç à ô ì òò ç ì æ à ô à òò ì ç æ ô ì ò à ì æ ò ç ô à ì ç æ ô òò à ç ì æ ô òò ç à ô æ ò ç à ò ô ç ò æ à ô ì ç òò à æ ô ç ì à ô ç òò æ ì ò à ç ô ì ç à æ òò ì ô ç ò à ì ò æ ç ô à ò ì ç æ ò ì à ô ç ì ò æ à ì ò ç ô ì ò à ç ì æ ò ô ç à ò ì æ ç ò ì ô à ì ç ò ô ç ì òò æì à ò ç ô à ì ò æ ç ì ò à ô ò ì ç à æ ò ç ô ì ò à ç ò æ ò ô à ì ç æ ç à ô òò ç à ò æ ç à ì ò ô ì ò à ç ò æ ò à ì ô ç ò à ì ò æ ò ç ì ô à ì ç à òò æ ì ô ò à ç ì à ç òò ì æ ô à ò ò ç à ì ò ç ô ò æ ì à ò ò à ç òò à ç ì ò æ ô à ò ì ç à ò ô æ ò à ì ç ò à ô ò æ à ç ò à ô ò æ ç à ò ì ô à ò ç æ ì à ò ô ì à æ ç ò à ò ô ç à æ ì ò ì à ô ò ç æ à ì ô ò à ç ì æ à ò ô ì à æ ç ì ò ô à æ ô à ì ò ç ì à ô à ò ì æ ô ì à ç ò æ ô ì à ò ì ç ô æ à ì ô ì à æ ò ç à ô æ ò à ç ì ô æ ò ì à ì ô ç à ò ô æ à ì ç ò ô æ à ì ô ç æ ò à ô ì ò ì à æ ç ì ô à æ ì ò ô à ç ì ò à ì æ ô ç ò à ì æ ô ò à ç æ ì ô à ì ò æ ì ô ç à ì ò æ à ì ç æ ò à ì ç ò æ à ì ì ò à ì æ ç ì ì ò ì ì æ ç ì ò ì à æ ì ò ç ì à ò æ ì æ à ç ì ò ì à æ ò ç ò à æ ç æ ì à æ ô ò ì ç æ à ò ì ô æ ç à æ ò ô æ à ì ò æ ç ô æ à ì ç æ ô ò à ì ç ò æ ô ò à æ ç æ ì ô ò à ì ç ô ò æ ì ç à ô ò ô ò ç æ à ì ò æ ô ò ç à æ ì ì æ ô ò ç æ ò ô à ò æ ç ô ì ò ô à æ ò ç æ ì ò à ô æ ò ì ç æ ô ò ç à ô ì æ ò ò ô ç æ ç æ à ò ò ô æ ç ô à ô æ ò ç ô à ç æ ô ò ì ò æ ô à ò ì ç ì ç ô à æ ò ç ì æ ç ò à ì ô ç æ ò ô ç ò ì à ç ô æ ç ô ì ò ç ô ç ò æ ô à ç ò ô ç ò ç à ò æ ôô ç æ ì ô ò à ç ì ò ç æ ò ôô à çç ì æ ç ô ì ç ô ò à ç ô ò ì ç ô ç ì à ò æ ô ç ô ç ô ì à æ ç ì ô ç æ à æ ò ô ç æ ç ô æ à ç ò ç ò ô æ ç ò à ç æ ô ò ç æ ò à ô ç òò æ ç ô à òò ç ì ò ô ò ô æ ç à ôì æ ô ò æ ç à ì ô ç ì ò à ç æ ç òò à ç òò ç æ à ôô ç ç à ôì ç æ à ô ç ò æ ôì ç à ô æ ç ô à ì ç æ ì à æ ç ô æ ì à ì æ ç ô ì ç à æ ô ô à æ ô òô ç à ç à ç æ ì ô à òæ ô à æ æ ôì ç à òç æ à ç ô ì à ææ ç ô à ç ôæ æ à ç ô à æ ì ç æ à ô à ç ô à ç ì à ô ç à ç ô ô à çç ç ì ì à ì ôô à à æ ç ô æ ôì ç àà ì à ô ç ì ò à òç à ì æ ì ô ì àà à ô æ à ôæ à à æ ìì à æ à ôì à ì à ç à ì à ç æ ôô àà à ç à à à ç àà àà à à çç ì ààà ì à ç à ì àà à àà æ ôì àà à àà ì æ ç ì æ ô ôôô ààààààà à æ ààà à ç àà ç ô àààà à à à ç ìò àà ààà à ò ì àààà ààà ààà à àà ì à à à à àà ààààà

à ì ò ô ç


101

102

Search space size

20% 10% æ 0% ìà

100

106


æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ææ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì ì æ ì æ ì ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì æ ì ì æ ì ì æ ì ì æ ì ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì æ ì ì à æ à ì à æ à ì à æ ì à à ì æ à ì à æ à ì à æ à ì à æ ì àà æ à ì æ à ì à æ à ì à æ à ì à æ ì à à æ ì à à æ ì à æ à ì à æ ì à à æ ì à æ à ì à æ à ì à æ ì à æ ì à à ì à æ à ì æ à ì æ à ì æ à ì æ à æ ì à æ à ì æ à ì æ à ì æ à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ æ ì à æ ì à æ à ì æ æ ì à æ ì à æ ì à æ ì à æ ì æ à æ ì à æ ì à æ ì à ì æ æ ì à ì æ à ì æ à æ ì ì à æ ì æ à ì æ à ì æ ì à ì æ à ì æ à ì æ ì à ì æ ì à æ ì à æ ì à æ ì à ì æ ì à æ ì æ à ì ì æ à ì æ ì à ì æ à ì à ì æ ì à æ ì à ì æ ì à æ ì à ì æ ì à ì æ à ì æ à ì æ à ì æ à ì æ à ì à æ ì à ì æ à ì æ à ì à æ à ì æ ì à à ì æ à ì à æ ì à æ ì à à æ ì à ì æ à à ì æ à ì à æ à ì à æ à ì à æ ì à æ ì à à æ ì à à ì æ à à æ ì à æ ì à æ ì àà æ à ì à ì æ à ì à æ à ì à à æ ì à æ à ì à æ ì à à ì æ à ì æ à à ì æ à ì æ à ì à æ ì à à æ ì à à æ ì à æ ì à à æ ì à ì æ à ì à æ à ì æ à ì à æ ì à à æ à ì æ ì à æ à ì æ à ì à æ à ì æ à æ ì à à æ ì à ì æ à æ ì à à æ à ì æ à ì æ à ì à æ ì à æ à ì æ à ì à æ à ì æ à ì æ à æ ì à ì æ à ì æ à ì à æ à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ ì à ì à æ ì à æ ì æ ì à ì æ à ì æ à æ à ì æ à ì æ à æ à ì æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à æ ì à æ ì à æ ì à æ ì æ ì æ ì æ àà ì æ ì à æ ì à æ ì à æ ì à ì æ ì à æ à ì æ à æ ì à æ ì ì à æ ì à ì æ à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ ì ì à æ ì æ ì à ì æ à à æ ì æ à ì æ à æ ì à æ ì à æ ì à æ ì à æ æ ì àà à ì æ à à æ ì à æ æ æ ìì àà æ ì æ à ìì æ ì æ ì æ ì àà æ ì æ ì à æ ì æ à ì æ ì à à ææ à ì ì àà à à ææææ à à ìììæìì àà ìì à ææ à à à à àà æ ì àààà à ì àààà àààà à ààà àààààà

101

102

103

104

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ ô æ ô æ ô æ ì ô æ ì ô æ ì æ ô ì æ ô ì æ ô ì æ ô ì æ ô æ ô ì æ ì ô ì æ ô ô ì æ ìç ô æ ì ç ô ì æ ç ô æ ì ô ì æ ô ì æ ô ì çç ì æ ô ç ì ô æ ç ô ì æ ç ô ì æ ç ô ì ç æ ô ì ç ô ì æ ç ô ì ç æ ô ì ç ô ì æ ç ô ì æ ô ç ì æ ô ç ì ô æ ç ì ô ç ì æ ô ç ì ô æ ì ç ô æ ç ô ì æ ç ô ì ì ç ô ç ì ô æ ô ì ç ô ì æ ç ô ì æ ô ì ô ì æ ô ì çç æ ì ç ì ç ô æ ì ç ô ì æ ç ô ì ç æ ô ì ô ç æ ì ç ô ç ì æ ô ç ô ì ç æ ô ì ç æ ç ô ì ç ô æ ì ç ô ì ç ô æ ç ô ì ç æ ô ç ì ô ç æ ì ô ç ì æ ô ç ô ç æ ô ç ç ô æ ô ç ì ô ç æ ì ç ô æ ì ç ô ò ì ç æ ô ò ì ç ò ô æ ò ç ì ô ò ç ì æ ò ô ì ò ç ò ì æ ô ç à ò ò ô ç ì æ ò à ç ô ò ì à ò ô æ ç ò à ì ô ç ò æ ò ç ì ô à ò æ ç ì ô à òò ô à ò ç ò ô ç ò à æ ì ò ç ô ì à ò æ ò ì ô ç à ò ò ì ô ç à æ ò ì ç ò à ò æ ì ô ò ç à ì ô æ òò à ç ì ò ô æ ò à ç ò ì ô à ò ç æ ò ô ì à ò ç ì ò æ ô à ò ç ì ò ô à ç ò ò æ ì ô à ç ò ò ç ì à ô ò æ ò ç à ò ôì ç ò à ìç æ ò ì ô ç ò à ç ò ô à ì ò æ ç ò ô à ì ò ç ò ì à ô æ ò à ô òò æ à ì ò ô çç ò à ò æ ç à ò ì ô ò à ç ò æ ô à ò ç ì à ò æ ì ç ô à ò ò ç à æ ì ò ô ç à ò ì ò ô ç æ ì ò ò à ç ô ò à ç æ à ç ô ò ì à ò æ ç à ì ò ô ç à ò ì ò æ à ô ç ò à ì æ ô ç òò à ì ò à ò ç ô ì ò æ à ò ì ç ò à ô ò ç ì æ ò à ò ì ô à ç ò æ ì à ò ç ô à ò ì æ à ì ò ô ç à ò ì ç æ ò ô à ò à æ çò ò à ô ò à ç æ ò ô à ò æ ç à ò ì ô à ì æ ç ò à ô à ì ò æ ì ç à ô ì æ à ç ì òò ô à æ ò ì à ç ô ì à ò æ ô ç à ì ò à æ ô ç ì ò à ì æ ô à ò ì ç ì à æ ô ò ì à ì ç ô æ ò ì à ô æ ç ì ò à à ò ô ì æ ò à ì ç æ ô ì ò à ì ç à ô ì æ ò à ì ç ò æ ô à ì ò ç ô ì à æ ì ò à ô æ ç à ò ô ì æ ì ç à ì ò ô æ ì à ç ò ô à æ ì ò ô ç à æ à ç ô ò æ à ô ç ì à æ ò ô ì à ô ç ì ò æ ì à ç ò ô æ ô ì à ò ç æ ì ô à ò æ ç ô à ò æ ô à ç ò ô æ æ à ô ç ò ì æ ô à ì ò ç ô ì à æ ò ô ì æ ç à ò ô æ ì à ì ò æ ô ç ì à æ ò ì ç ô æ à ò ì ô æ ç à ò æ ô ì æ ô ç à ò æ ì ô ò ì à ç æ ô ò æ à ì ô ò ç ì ì à ô ò ç ò ô à ì æ ò ô ç ò æ ì à ô ò ç ì æ ò à ç ò æ ò æ ç à ì ì ò ç à ò æ ì ç ò à æ ò ç ì æ òò æ ò ç à ò òç ç ì à æ ç ì ò ç à ò ì æ ò ç æ ò à ì æ ò ç ò ì ç ò æ ò ç ô ô æ ç ò ç æ ì àà æ ò ì ô ò à ò ô ç à ò æ ç ò ô ì æ à ç ô ô ì ç à æ ç æ ô ç ì à ç ô ì à æ çç à ç æ ô ì à ô ò ç æ ò à ô ì ò æ ç ô ò ò ì à ç æ ò ì ç à æ ô ò ì ô à ì ì ò ç æ ô ò à æ ç ò ô ò ì ç æ à ô ô à æ ç ì æ à ì æ ç ô òò æ à ç òò ò æ ô æ ò à ì ô ò ç ì ç ô ì à ò æ ç æ à ô òò ç ì ò ô à æ ô òò ç ô à ô ò æ ç ò à æ ç à ô ì ô ç ç à æ ôçç æ à ô çç ç à ç ç ôô æ ì à ò ô ç à ò ì à ô ò ææ æ à ô ç ôì à æ æì ô æ ç ôææ æ à ç ô à ç à ç à à ç ôô à æ çç à ç æ à ôôôô ç à ì à ôô ì à à ì à ççç æç à à ç à ç òô æ ôôô à çç à ìç à ì à à à ì ç à à æ à à ììì à æ ô ì æ à ô à æ à çì ôì æ ô àà ôì à ô ì àà ç à ç ç ç ì ç ààààà à à ç à æ ì æ à à ì à ôì àà ôô æ ô àààà à àà àà æ à ààà ç à ô æ ôì òç àà ààà ààààààààààààà à àààà à à à à à ààà à à


à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

107

101

102

105

106

107

50% 40% 30% 20% 10% æ 0% ìà

100

104

105

106

107

(c) Order-3 Markov-based attack on Chinese datasets 60%

æ Rockyou_rest æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì à à ì à æ ì à ì æ à ì à æ ì à æ ì à æ ì à à ì æ à ì æ à ì à æ ì à æ ì à æ ì à à ì à æ à ì æ à ì à æ ì à æ ì à ì à æ ì à æ à ì æ à ì à æ ì à æ ì à ì à æ ì à æ ì ì à æ ì à æ ì à ì æ à ì à æ à æ ì à æ ì à ì æ à ì à æ ì à æ à ì æ à ì ì æ ì à ì æ à ì æ ì à æ à ì æ ì à æ à ì æ à ì à æ ì æ à ì æ à ì æ à ì æ à ì à ì æ à ì æ à ì æ à ì æ à ì æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì æ ì ì æ ì àà æ ì à æ ì à æ ì æ à ì æ à ì æ ì à æ ì à ì æ à æ ì à ì æ à ì æ à ì æ à ì æ ì à æ ì à æ ì à ì æ ì æ à ì æ à ì æ à ì æ ì à ì æ à ì æ ì à æ ì à ì æ à ì æ à ì æ à ì æ ì à ì æ à ì æ à ì æ à ì æ à æ ì à æ à ì à æì ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ à ì à æ ì à æ à ì æ à ì æ à ì æ à ì à æ à ì à æ ì à æ ì à à ì æ à ì æ à à æ ì à ì à æ ì à æ à ì æ à æ à ì à æ ì à ì à ì æ à à ì æ ì à æ à ì à æ ì à æ ì à à æ ì à ì æ à à æ ì æ àà ì æ à æ à ì æ à ì à æ à ì à æ ì à æ à ì æ à ì à æ æ ì à à æ ì à æ à ì à ì æ à ì æ à ì à æ ì à ì æ à æ ì à æ æ à ì æ à ì æ ì æ ì æ àà ì à æ à æ ì à à ì æ à ì æ à ì æ ì à æ ì à æ ì à ì æ à ì æ à æ ì à æ à ì æ à ì æ ì à ì æ à ì æ ì à æ ì æ à ì æ à æ ì ì à æ ì æ à æ ì à ì æ ì à æ à ì æ ì æ à ì æ à ì æ à ì æ ì àà æ ì à æ ì à æ ì à æ ì à æ ì à æ ì æ ì àà ì àà æ à ì ì æ à à æ ì æ à ì æ à ì à æ ì à æ ìæ à æ æ à æì ì æì ì à æ à ì ì à æ æ ì æ ì àà æ à æ ì æ àà æ ì ì æì à ì à ì æ à æ ì à ì ææ àà ì à ì ì à ææ æ à ìì à ì ææ à æ à æ æ àà ì æì ææ àà ì ææì à à ààà ìì æ ì àà à ææ àà àààà æ àà àààà à ì à àà àààà àà à à à àààà à ì

à Yahoo ì Phpbb

101

102

Search space size


103

Search space size

60% Fraction of cracked passwords


30%

105


60%

40%

104

æ Tianya

50%

Search space size

(a) Order-5 Markov-based attack on Chinese datasets 50%

103


40%

à ì ò ô ç

60%

æ Tianya

103

104

105

106

Search space size


107


50%

60%

æ Tianya



60%

50% 40%


30% 20% 10% 0% ìæà 100

æ ì æ æ ì à àà à ìì

101

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì ì æ ì ì æ ì æ æ ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì à ì à æ ì à æ ì à æ à æ à ì à ì æ à ì æ à ì æ à ì æ à ì à æ ì à ì à æ ì à æ ì à ì æ à ì æ à ì à æ à ì æ à ì æ à ì à æ ì à æ ì à ì à æ à ì æ à ì à æ ì à ì æ à ì à æ ì à ì æ à ì à æ ì à æ ì à ì æ à ì æ à ì ì æ æ ì àà æ ì à ì æ à ì à æ ì à æ ì à ì æ à à æ ì à ì æ à ì æ à ì æ à ì æ à ì à ì æì à ì æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì æ ì ì æ àà ì æ à ì æ à ì æ à ì æ à ì à æ ì à ì æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì ì æ ì æ àà ì æ ì à æ à ì æ à ì æ à ì æ æ ì à ì æ à ì æ à ì æ à ì ì æ à ì æ à æ ì æ à ì à ì æ à æ ì à ì æ ì à æ æ à ì ì à æ ì æ à ì æ à ì à æ æ ì à æ ì à à æ ì à æ ì à æ ì à ì æ à ì æ à æ à ì æ à æ ì à æ ì à æ ì à à ì æ à ì à æ à æ ì à æ ì à æ ì à æ à ì à æ ì à æ ì à æ ì à æ ì à æ à ì æ à ì à æ ì æ à ì à æ ì à æ ì à æ ì æ à à æ ì à ì à æ ì à æ ì à ì à ì æ à æ ì à à ì æ à æ ì à æ à ì æ à ì æ ì à æ à ì æ ì à æ ì æ à ì ì æ æ ì æ à ì à æ ì ì æ à ì æ à æ à ì æ æ à ì æ à ì à æ ì æ à æ à ì æ ì à æ ì ì à æ æ à ì à æ à à à æ æ à ì æ ì æ à ì à ì æ ì à ì æ à ì à æ ì æ ì à æ ì æ ì à æ ì à æ ì ì à ìæ à à ì æ ì æì æ æì àà ììì ì à æææ àà æ ìì à àà ìì ææ ì ììææ àà à æ à ì ææì ààà ìì à æææ ì æ ààààà à ààà à àà à àà à àà àààà à àà àà à ààààà à

102

103

104

105

106

107

Search space size


Fig. 2. Markov-Chain-based attacks on different groups of datasets (using Good-Turing Smoothing and End-Symbol Normalization. Attacks (a)∼(c) use 1 million Duowan passwords as the training set, while attacks (d)∼(f) use 1 million Rockyou passwords as the training set.)

R EFERENCES [1] [2] [3] [4]

M. Weir, S. Aggarwal, B. de Medeiros, and B. Glodek, “Password cracking using probabilistic context-free grammars,” in Proc. 30th IEEE Symp. on Security and Privacy. IEEE, 2009, pp. 391–405. R. A. Butler, List of the Most Common Names in the U.S., Jan. 2014, http://names.mongabay.com/most common surnames.htm. J. Goldman, Chinese Hackers Publish 20 Million Hotel Reservations, Dec. 2013, http://www.esecurityplanet.com/hackers/chinesehackers-publish-20-million-hotel-reservations.html. Sogou Internet thesaurus, Sogou Labs, April 17 2014, http://www. sogou.com/labs/dl/w.html.

[5]

A. Das, J. Bonneau, M. Caesar, N. Borisov, and X. Wang, “The tangled web of password reuse,” in Proc. NDSS 2014, 2014, pp. 1–15. [6] A. Narayanan and V. Shmatikov, “Fast dictionary attacks on passwords using time-space tradeoff,” in Proc. CCS 2005. ACM, pp. 364–372. [7] M. Dell’Amico, P. Michiardi, and Y. Roudier, “Password strength: an empirical analysis,” in Proc. INFOCOM 2010. IEEE, 2010, pp. 1–9. [8] J. Ma, W. Yang, M. Luo, and N. Li, “A study of probabilistic password models,” in Proc. IEEE S&P 2014. IEEE, 2014, pp. 538–552. [9] J. Bonneau, “Guessing human-chosen secrets,” Ph.D. dissertation, University of Cambridge, 2012. [10] W. Gale and G. Sampson, “Good-turing smoothing without tears,” Journal of Quantitative Linguistics, vol. 2, no. 3, pp. 217–237, 1995.

6

30% 20% 10% æ ô ç ì à 0% ò

100


æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ æ æ ôô æ ô æ ô æ ô æ ô æ æ ô æ ô æ ô æ ô æ ô ì æ ì ô æ ì æ ì æ ì ç ôô ì æ ç æ ô ç ç æ ô ì ç æ ì ô ç ì æ ô ç ì æ ç ô ì æ ç ì ô æ ç ì ô æ ç ì ô ç ì æ ô ì ç æ ì ô ç ì æ ç ô ì ç æ ô ì ì ô ç ì æ ô ì ç ô ì æ ç ô ç ì ô æ ì ç ô æ ì ç ô æ ç ì ì ç æ ô ì ç ì ô æ ç ì ô ç ì æ ô ç æ ì ô ç ì æ ô ç ì æ ç ô ì ç ô æ ì ç ì ô æ ô ç ì ô æ ì ç ô æ ç ì ô ì ç æ ô ì ç æ ì ô ç æ ì ç ì æ ç ô ì æ ç ì ô æ ì ç ô ì æ ç ô æ ì ç ô ì æ ç ì ô æ ô ç æ ì æ ì ç ô æ ô ì ç æ ô ì ç ô æ ì ç ô æ ì ô æ ç ì ô ì æ ç ô æ ç ô æ ç ô æ ì ô ç ì æ ô ç æ ì ô æ ç ì ô æ ç ì ô æ ô ç ì ô ò æ ç ì ò æ ô ì ò æ ç ô ì ò æ ô ç ì ò æ ô ç ò æ ô ì æ ç ô ì òò ô æ ò ç ì ô ò ì æ ç ô ò ì ô ç æ ì ò ô ì ò ç æ ô ì ò ç ô ì æ ò ô ì ç ò ô ì æ ò ç ô ì æ ò ô ç ì ò ì ô æ ç ò ì ô ç ì æ ò ç ô ì ò ç æ ô ì ò ç æ ô ì ò ç ô ì ò æ ç ô ì ò ç ì ô æ ò ì ç ô ò æ ç ô ç ò ô æ ç ò ô ì ç ò æ ô ì à ç ò ô ì ç æ à ò ì ô ç ì ò ô æ ç à ò ô ç ì æ ì à ô ò ç ì ô ò æ ç ì ô à ò ç ì æ ô ò ç ì à ô ò ç æ ì ç à ò ô ç ì ò ô æ à ç ò ç ô ò à æ ç ò ô ç à æ ò ç ô ì ò à ç æ ô ò ç à ì ò ç ô æ ò à ì ç ô ò ì ç à æ ì ç ò ô ç ò à ô æ ò ç à ò ô ì ç æ ò à ç ô ò ì ç à ô ò ì ç æ ò à ô ì ç ò à ô æ ò ì ç ò à ô ç æ ò ô à ç ò ì ò æ ô à ç ò ì à ô ç òò à ò æ ì ç ò ò à ì ç ò æ ì à ç ì ç à æ ì ç à ì òòòòòòò æ ç à ì æ ç à ì ç ô à æ ì ç à ô ì æ ç à ô òòòòòòòòòòòò ì ç ò à ô æ ò ò à ç ò ò ô à æ ç ò ò à ò ô ì ç ò æ ò ì à ò ô ç ì ò à ì ò æ ç ò ô à ì ò ò ç ì ô à ò æ òò ò à ì ò ô çòò ò ì à ò æ ç ò ô ì à ò ì ç æ ô à òò ì ò à ç ô æ à ì ç òò ô ì æ à òò ç à ô ì æ à ç ô à æ ì ç ô à æ òòòòò ô à ç ì ò à æ ò ô ì ç ò à æ ç ô à ì æ ô à ç òòò ì à ò æ ô ç à ò ì æ ç ò à ì ô à ò æ ô ç ì à ò à ì ô æ ò ç ì à ò ì ô æ à ç ò ì à ì æ ò ô ç à ì ò à æ ç ô ì à ò à ç æ ò ô ì à ç à ì ò æ ô à ì æ ò à ì ç ô à æ ò à ç ô à æ ò à ç ô à ò æ ì à ç ò ô ì à æ à ì ò ç à ô æ ì ò à ì ç æ à ô ò ì à ç ì à æ ô ò ì à ç ô æ ì ò à à à ì ç ô æ à ò ì à ç æ à ô ò à ç à ì æ ò à ô à ì ò ç æ à à ò ì æ ô à ç à ì æ ì ò à ô à ì ç æ ò à ì ô æ ç à ò æ ô ò ì ç æ ò à ô æ à ç ò à ô ì æ à ò ç ì à ô ò æ à ç ì ò ô æ à ç ì ò à æ ô ì à ç æ ò ì ô à ì ç ò à æ ì ô à ì ò ç à æ ì ò ô ç à æ ò ì ô à ç ò æ à ì ô ç ò ì à æ ô ì ò ç à æ ò ì à ô ì à ò æ ç ò à ô æ ò ç ò ô à æ ò æ ô ç ò à æ ò ì ò à ç æ ò ô à ç æ ì ò ô ì ò à ç æ ì ô ò à ç ò ì æ ô ò æ à ç ì æ ò ô ç ò æ ò ô ç à ì æ ç à ò ì æ ç à ò ç à ò æ ì ô ò ç à æ ò ô ç ò ì æ à ò ô ç æ ò à æ ì ô ò ç æ ò à ô ç æ ç ô à ì æ ô ò ç ò æ à ô ç æ ò à æ ç ô ç ô à ò æ ò ô ò ç ì à æ ô ç ò æ à ì ç ô ò æ ì æ ç ò ô à ì æ ò ç ô ì æ à ò ô ç ì ò æ ô à ò ì ç æ æ ò à ô æ ç ç ô æ ò à ç æ ç ô à ò ò ç æ ô ò à ç æ ò ô æ ç à ô ç æ ò à æ ò ç ò æ ì ô à ò ì ç æ ô ò à ì ç ô ç ì òò æ à ô ì ôç æ à ç æ ô òòò ç à ô ì ç æ à ô æ ì ô òòæ ç à ò ç æ ô ì ç à æ ô ì ç à æ ç ô ì à æ æ ô ç à ç ô ç ì à ì æ à ôô ç æ ô ç à ç ì ô æ à ç ì æ à ô ç ì æ à ô òò òòòòòòò ç ì æ ô à ç ô ç à ô ç ô ì æ ç ì ô à æ ç ô à ì ç ô à ô ç à æ ì ì ô ç à æ ç ô ì à æ ì ç ç ô à ì à ç ô à ô ç ì ô æ ç ò òòææ ì à ô ç à ì ì æ ç ô à ì ôç ç à ô à ì ç ì æ ô à æ ç à ç ô ô à à à ææ ôç æçç à ç à à ôìì à à ç ô ì ô à à ì ç ôì ò ææææ à ç à à à à à à ôô à ôì à çç à çì à ìì ààà ìì à à à ì à ç ôôì àà à ôìì à à à à çç ààà ç à ææ à à à à à à à àà ôôì æ àààà æ àà àà ààà à à ô çç ô ààààà àààà ç à ç ààà ì ò ìì ì àààà ààà àà à à à à ààààà

101

102

103

104

105

106

50% 40% 30% 20% 10% æ ô ç ì à 0% ò

107

100

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ æ ô æ ô ì æ ì ô æ ô ì æ ì ô æ ì ô æ ì æ ô ì ô æ ì ô æ ì ô æ ì ô æ ì ì ô æ ç ç æ ô ì ì ç æ ô ì ç æ ô ì ç æ ô ì ì ç æ ô ì ôç ç ì æ ô ç ì ô æ ì ç ì ô æ ç ì ô æ ì ì ô ç æ ì ç ô ì æ ç ô ì æ ì ç ô ì æ ô ç ì æ ô ç ì ì ç æ ì ôô ì ç æ ì ô ç ì æ ô ì ç æ ì ô ç ì ô æ ç ì ô æ ì ç ì ô æ ç ì ô æ ç ô ç ì æ ô ç æ ì ô ç æ ì ô ç ì æ ô ç ì æ ô ç ì ìô ôç æ ì ç ô ì æ ç ì ô ç æ ô ç æ ç æ æ ôô ô æ çç ô ç æ ì ô ì ç æ ô ì ç ô æ ì ô ç ì æ ô ç ì ô æ ì ç ô ç ì ô ç ô ç ô ææ ç ô æ ì ô ì ç æ ô ç æ ô ç ì æ ô ç ì æ ô ç æ ç ô ì ç æ ô ç ô æ ì ç ô ì æ ç ì ô æ ç ô ì æ ç ô ç ì æ ò ô ç ì ò æ ô ç ì ò æ ô ç ò æ ô ì ç ç òò ô æ ò ç ô ò ì æ ç ò ô ì ç ò æ ô ì ò ç æ ò ô ì ç ò ì æ ô à ò ç ò ì æ ô ç ì ò à ô ç æ ò ì ô à ò ç æ ô ò ì ç à ì ò æ ô ç ò ì ô æ ò à ì ô ç ô ì æ ç à òò ô ì ç æ ò ì à ô ò ç æ ì ô ò à ç ì ô ò æ ì ç ò à ì æ ò ç ì ôô ò à æ ç ô ò ç æ à ô ì ç ô æ òò ì à ô ç ò ì æ ç ô à ò æ ì ç ò ô à ì æ ç ô òò à ì ç æ ô ç ì æ à ô òò ì ç ò ì ô æ à ç ô ç òò à æ ç ò ô ò ì æ ç ô à ò ç ô æ ò à ì ç ò ô æ ì ò à ç ô ò ì æ ç à ò ì ô ò ç à ì ô æ ò ç ì ò ô à ì ç ò æ ô ì à ò ç ò ô æ à ç ì ò ô ì ç æ ò à ì ò ç ì à ô òò ì ç ò à æ ç ô ì ì à ç ô òòò æ ò à ç ô ì ò ò ç à ô ò ì æ ò ç à ô ò ç ò à ô ò æ ò ç ò ô à ì ò ç ò æ ô à ì ò ç à ì ç ô æ òòò ì ò ç à ò ô æ ò ç à ò ò ç ò à ì æ ò ç ò à ò ì ò æ ç à ò ì ò ò ç à æ ò ì ò ç à ò ì ò æ ç ò à ì òò à ô ì ç ò òò æ à ò ì ô ç ò ò à ô ò æ ç ì ò à ì ò ô ç à æ òò ô ç ò à ì ò æ à ç ô òò à æ ç ì ô à ì à òòò ô ç æ ò à ì ç ô à ì æ ì òòò à ç ô ò æ à ò ì ç ò ô à ì æ òò ç à ì ô ò à ò ç æ ô à ò ì ò ì à æ ç ô ì ò à ò ì æ ç ô òò ì àà ô ò ç ì æ ò à ô ò à ì ç æ ò ì à ô ò ç æ ì à ò ô ì æ ç à ò ô à ò ì æ à ç ò ì ô à ò ì ç à æ ô ò à ì ç æ ò ì à ô ì à æ ç ò ô à ì ò æ ç à ô ò æ à ç à ò æ ô à ç ò ì à æ ô ì à ç ò ì à æ ò ô ì à ç æ à ò ì ô à ç æ ò à ì ô ç à ì ò æ ì à ì ò à ô æ ì ò à ç ì ì æ à ô ò ç à ì ò æ ô à ì ç à ò ì ô ç à æ ò ô à ç æ ò ì à ô æ ì à ò ç ì à æ ô ì ò à ç æ ô ì à ò æ à ç ì ô ò æ à ò ç ìì ô à æ ç à ò ô æ à ì ò ç æ ô à ò ç à æ ô ò à ì ç æ ò ì ô à ò æ ì ç à ô ì ò æ à ô ç ì ò ì à æ ò ç ô à æ ì ò à ç ô ò æ à ì ô ò ç ì à æ ô ò ì ç æ ò ì ô ò ç ò à ô ì æ ç ò à æ ò ô à ç ò æ à ô ì ò ç æ à ô ò à ò ç ô æ ò à ô ò ç æ à ò ô ì ò ç æ à ô ì ç ò à ò æ ô ç ò à ì ô æ ì ç à æ ì ô ò æ ì ç à ì ò æ ç ô à ò ì ò ç æ ôç à ì ò æ ç ì à ô ò ò æ à ô ò æ ò ç ò ì æ ì ò ô ç æ à ò ô ì æ ç æ ò à ç ò æ à ì ò ç æ ì ô à ç æ ì ò ì ô à ç æ ô à ç ò æ ì ô ç à ò ô æ ò ç æ à ì ô æ ç ò ô à ç ô ò æ ç à ô ç æ òæ ò ô à ô ì ç ò ç ô à ò æ ì ò ô à æ ì ò æ ì à ôç ò ôç ç æ à æ ô òò ç à ç ò ô ò à ô ç ò ç ò æ à ç ô ç ô ì à òòòò ô æ à æ ôç ì ç ì à ô ç æ à ô æ à æ ç ôç à ì æ ô ç ì ô ç à òòòòæ ç ì æ à ô ì ô ç ç ì à æ ì ô æ ç ô à ì æ ô ç à ç à ô ì æ à ç ôæ ô ç à æ ì ç ô ì ç à ô ç òò òòòòòòòò à ç æ ô æ ç à ô ç æ ô à æ ç à ô æ ì à ç ì ô à æ ç æ ç ô æ à ô ô ç ì ç àà ôì æ ì à à ô æ ç ôç ì à ò òòææ à ç à ì ç ô à ì ô ç ì à ì ô ç æ à ì ôô à ç ç æ ô à ôì à ç à ô æ à ç à æ ç ôì æ à ì à çô ì à à ôììì ç à ô ç à à ôææ òò ô à ì ç à ç à æ à à à ç à ôô ç à ôìì à à çççì ì àà æ à æ à à ì à à àà à ç ôì à ôôì à à à ôô à àààà à çç à çææ à ç æ àà àà à ìì ààà ôì æ æ ààààà à à à ôç ô ààààà àà àà ç ç ò ìì ì àààààààààààà à àà à à ààà ààààà

à ì ò ô ç


101

102

Search space size

20% 10% æ 0% ìà

100

106

æ Rockyou_rest æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì ì ì æ ì æ ì ì æ ì æ ì ì æ ì ì æ ì æ ì ì æ ì ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì ì æ ì ææ ì æ ì æ à ì à æ à ì à æ à ì æ à à ì æ à à æ ì à æ à ì æ à à æ ì à æ ì à æ à ì æ à æ ì à à ì æ à æ ì à æ à ì à æ à ì æ à æ ì à æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à à æ ì à æ ì à ì à æ ì æ à ì à æ ì à æ ì à æ ì à ì æ à ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ ì à ì æ à ì æ à ì ì æ à ì à ì æ à ì à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ ì à æ ì à ì æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à ì à æ ì à ì æ à ì æ à ì à æ ì à æ à à æ ì à ì æ à ì à æ ì à æ ì à à æ ì à æ ì à à æ ì à ì à ì à à ææ ì à à ì æ à ì à æ à ì æ à ì æ à ì à æ à ì à æ ì à æ ì à à æ ì à ì æ à ì æ à ì à æ à ì æ ì ì àà æ à ì æ à ì à æ ì à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à ì æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à à æ ì à æ à ì æ à ì æ à ì æ à ì æ à ì à æ ì à æ à ì æ à ì à æ ì à ì æ à æ ì à æ à ì à æ ì à ì à æ ì à à æ ì à ì à æ ì à ì æ à ì æ à ì à æ æ ìì æ àà æ à ì à æ ì à æ ì à æ à ì à æ ì à æ ì æ à ì æ ì ì æ àà æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à à ì æ ì à æ à ì æ à ì æ à ì æ à ì æ à ì æ à æ à ì æ à ì æ à ì ì à æ ì à æ ì à ì æ à ì æ ì à æ ì à æ ì à ì æ à ì æ à ì à æ ì à ì æ à ì æ à ì æ à ì ì æ à ì æ à ì æ à ì æ à æ ì à æ ì à ì æ à ì à æ æ à æ à ì æ à ì æ à ì æ à æ à ì æ ì à æ ì à ì æ ì æ à ì æ æ à ì æ ì à æ ì ì à æ à ì ì à æ æ ì à æ ì à æ ì à æ ì à æ ì æ ì à æ ì à æ à ì æ ì à æ à æ æ ì ì æ à à ì æ à ì à æ ì æ à ì ì æ à æ à à æ ìì à ì æ à æ ìì ì à æ à æ ì ì àà æ ì à ì æ æ àà ì ææ ì ææ àà à æììì à à ææ ìì ææ àà à à àà à æ ì àà àà àà à ààà ææ à æ ì ààààà à à ìàà ààààà ààà à ìì

à Yahoo ì Phpbb

101

102

103

104

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ô æ ô æ ô æ ô ì æ æ ô ì ì ô ì æ ô ì æ ì ô æ ì ô æ ô ì æ ô æ ì ì ô æ ì æ ô ì æ ô ì æ ì ô ç ô ìç çç æ ì ô ç æ ô ì ç æ ì ô æ ì ô ì æ ô ì çç æ ì ô ç ì æ ô ç ì æ ì ô ç ì æ ô ç ì ô æ ì ç ì ç ì ô æ ç ì ô ç æ ì ô ç ì æ ô ì ç ô æ ç ô ì æ ç ì ô æ ì ç ô ì ç æ ô ì ç ô æ ì ç ô ì æ ç ô ì ì æ ô ç ôç æ ç ì ô ç æ ì ô ç ì æ ô ì ç ô ì æ ì ô æ ô ì çç æ ô ç ô ç æ ô ç æ ô ç ô ç æ ô ì ç æ ç ô ì ì æ ç ô ç ì ô ç æ ì ô ç ì æ ô ç ì ô æ ì ç ô ì ç æ ô ì ô ç æ ç ô ì ç æ ô ì ç ô æ ç ì ô æ ì ç ô ç æ ì ô ç ì ô æ ç ì ô æ ç ì ô ô æ ç ì ì ç ô ì ô ç æ ô ç ì æ ô ç ì ô ç æ ì ô ç ò ì æ ô ò ç ô ì ò æ ç ò ì ô ç ò æ ô ì ò ç ô ì ò æ ç ì ô ò ç æ ì ò ô à ç ò ô æ ì ò ç à ô ò æ ì ò ç ô à ì ò æ ô ç ò ì à æ ç ô ì ç æ ô à ô òòò à ç æ ò ô ç ò æ ì à ô ò ç ì ò ô æ à ç ò ì ô ò ç à ì æ ò ô ì ç ò à æ ô ò ç ì ò ô à ì æ ç ò ô ò ì à ç æ ò ô ì ç à ò ô ç æ ò à ô ì ò ç ì ò æ à ô ç ò ô ì ç à ò æ ì ò ô ç à ò æ ô ò ç à ò ì ô ç ò æ ì à ô ç ò ì ô ç æ à ò ì ò ç ìç ò æô ô à ç ò ì ô ò à ç ò æ ô ì à ç ò ò ô ç æ à ò ô ì ò ç à æ ô ç òò à ì ò æ à ò ì ò à ô ò çç æ ì à ô ò ç ì ô à ç òò æ ì ç ô à ì ò æ ç ò à ì ô ç à ì òò æ ô ç ò à ì ò ô ì ç à òò æ ò ì ç ô à ì ç ô æ ì òòòò ç ì àà ô ç æ ì òòò à ò ô ç ò à æ ç ò ô à ì ç ò æ à ô ì òò ç ò ì ô æ ç ò ì à ò ô ì ç à æ òò ì ô à ç à ì æ ô ç òòò ò à ì ò ç ô ò æ à ò ò à ô ç ò æ ò à ô ç ò ì à ò æ ò ç ì ô ò à ò ì æ ò ô ç à ò ò à æ ì ç ô ò ò à ò çò ì à ì æ à ç ì òòò ò à ì ç æ ò ò à ì ò ç à ò æ ò à ì ç æ ò ì à ì ç à æ ô à òò ì ç æ à ò ô ì ò à ç ì æ ô à ì ò ì à æ ç ò ì ô ò à ì æ ç à ò ì ô ì à ç ò æ à ô ì ò ì à ç ì æ ò à ì ô ç ò æ ì à ô ò ì æ ç ì ì àà ô ì ò æ ì ç à ì ô æ à ò ç à æ ò ô à ç ò æ à ô ç à ò æ ì ô à ç ì ò æ à ô ì æ ò à ç ì æ à ò ô à æ ç ô ò æ à ç ì à ô æ ò ì à ç ì æ ô ò à ì æ ç ô à ò æ ì ç à ò ô ì æ à ì ç ì ò æ à ô ì æ à ç ô ò à æ ò ô ç ì à ì ô æ ò ì à ç æ ì ô ò à ç ì ô ò à æ ç ì à ò æ ç ô à ò ì æ ò ô ì à ç ò æ ô à ò ç æ ô à ç ò ô à æ ò ì ç à ò æ ô ò ç à ì æ ò ô ç à ò ì æ ô ç à ò ì ô ç æ à ò ô ç à ò æ ç ò ô ì à ò ç ò à æ ô ç ò ì à ì ô æ ç ò ô ò à ì ò æ ì ç ô à æ ò ì ò æ ç ô à æ ì ò ç æ ô à ç ò ì ô ò æ à ç ò ì æ à ç ò ô ì à ò ç æ ô ò à æ ò ô ç à ò ô ò ì ò à ç ô æ ò ì ò ç æ ò à ô ì ò æ ô ò ç à æ ì æ ô ç ò à ô ç ô æ à ò ô ì ç ô ì æ ò òæ à æ ç ì ç ò ô ò ì ç ô ì ç ò æ ì ô òæ àà ç ô ò ò ô à æ ò ç æ ò ò ì ç ô à ç ì à æ ô æ ô ç à ô ç ì ô à ç æ ò ì ç à ò ì ç æ ç ò æ ç ò æ ò ì ò à æ ô ç ì à ô ç òæ æ ç à ô òæ òò ç ô æ à òæ ç ô ç à ô ç æ à ô æ ô ì æ ç à òòò ç ì ô à æ ç ì æ ô à ò ç òææ ô ç à æ ç ô ì à ç òòò æ ç à ô ì ì æ ç ô òòòòò à ç ò ô ò ò ç æ à ô ç ô à æ ç à ì æ ò ì à ç ì ì ç ò æ à ô æ ç ò ì ô ç æ à ì ô ç ô ì ò à ô à ç ç ô à ç ì ô à çç ôô à ç à ôô æ ì ç à æ ôô à æ ç æ ò ææ à ì ô æ à ôç ì çôç à ô ç à à ì ô æ à ì ç à ç ç ì à ô ì à ô ç ææ ô ì ç à ç æ ì ç ô à ì ô ç à ææ à ô ìì ç ì ôç à ç à ç ô ô à çæ à à ô ô à çç à òò æç à ææ à à ô çç à ôìì à à ôì ôæ à à ôçç à ç ç àà ôô à à æ æ à à æ à ì ì àà ì à à çççì ôìì àà à ôô à àà ààà ç çì àà ç æ àà ààà à ì àà à ôì à à æ æ ààààà à àààà ôç ô àààà à àà ç à ç ò ìì àààààààà ààà à ì à à ààà ààààààà

à ì ò ô ç

40% 30% 20% 10% æ ô ç ì à

0% ò 100

107


101

102

105

106

Search space size


107

50% 40% 30% 20% 10% æ 0% ìà

100

103

104

105

106

107

Search space size

(c) Order-3 Markov-based attack on Chinese datasets

60% Fraction of cracked passwords


30%

105


60%

40%

104

æ Tianya

50%

Search space size

(a) Order-5 Markov-based attack on Chinese datasets 50%

103


40%

à ì ò ô ç

60%

æ Tianya

60%

æ Rockyou_rest æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì æ ì æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ à æ ì à æ ì à ì à æ à ì æ à ì æ à æ ì à æ ì à ì à æ ì à æ à æ ì à ì æ à ì æ à ì æ à ì æ à ì à æ à ì æ à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à ì à æ ì à æ ì à æ ì à æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ à ì æ à ì æ ì àà æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì à æ ì à æ à ì æ à ì æ à ì æ à ì æ à ì ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì ì æ à ì à æ ì à æ ì à æ ì à æ à ì æ ì à ì æ à à æ ì à à ì æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì ì à æ à æ ì à ì æ à æ ì à æ ì à æ ì à ì æ à æ à ì ì æ à æ ì à æ à ì æ à ì æ à ì æ à ì æ à ì æ à ì à æ ì à æ ì à æ à ì æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ ì æ ì àà à ì æ ì à æ ì à æ à ì à æ ì à ì æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ à ì à æ ì à æ ì à ì æ à à ì æ à æ ì à à ì æ ì à æ à ì à æ ì à æ à ì ì à æ à ì æ à æ à æ à ìì à æ ì à æ à ì æ à ì ì æ à ì à æ à ì æ à à ì æ à ì æ à ì à ì à à ì ææ ì à æ ì à æ à ì à æ ì à ì æ à à ì æ à ì æ à ì à æ ì æ à ì à æ ì à æ ì à æ à æ ì à æ ì æ à ì æ à ì à æ ì à ì æ à ì æ ì ì æ àà ì à ì æ à à æ ì à æ à ì à ì æ ì à æ à ì æ à ì æ à ì ì à æ ì à æ à ì à æ ì æ à ì à æ ì à æ ì à ì æ à ì à æ ì à æ ì à æ ì à ì à æ à æ ì æ æ à æ à ì à æ ì æ ì à ì æ à æ ì æ à ì ì æ à ì æ à æ æ à ì æ à ì ì æ ì ì æ àà ì æ à ì æ à ì æ à ì æ à ì æ ì à æ ì à æ ì à æ à ì æ ì à æ à æ ì ì æ ì à ì æ à ì æ ì æ ì à æ ì à æ à ì æ ì à ì æ ì à ì à æìì à ì à ææ æ ì à æì æ àà ì ææ àà ì æ æ à æ àà ì à ì à ì æææ ààà ææ à æ à ììì ì æ æ æì àà à à àà ææì àà à à à ààà ææ ààààà æ ì à àààààà à à ìàà àààà à ìì

à Yahoo ì Phpbb

101

102

103

104

105

106

Search space size


107


50%

60%

æ Tianya



60%

æ Rockyou_rest

50%

à Yahoo ì Phpbb

40% 30% 20% 10% æ

0% ìà 100

æ ææ æ ììì ì àà

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ì ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì æ ì ì æ ì æ ì æ æ ì æ ì æ ì æ ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ ì à æ à ì æ ì ì àà æ ì à æ ì à æ ì à æ ì à ì æ à æ ì à à ì æ à ì æ à ì æ à ì æ ì à æ ì æ àà ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ à ì à æ ì à æ ì à ì æ à ì æ à æ ì à à ì æ à ì à æ à ì æ à ì à æ ì à æ ì à æ ì à æ ì à æ ì à ì à æ ì à æ ì à æ ì à à æ ì à ì à æ ì ì à æ ì à æ à ì æ à ì æ à ì æ à æ à ì à æ à ì æ à ì æ ì à ì æ à à æ ì ì à æ ì à æ à æ ì à ì æ à æ ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à ì æ à ì æ à æ ì à æ à æ ì à æ à ì æ ì à æ ì ì æ àà æ ì à æ ì à æ ì à ì æ à ì æ æ ì à ì æ à æ ì à æ ì à æ ì à æ ì à æ ì à æ ì à ì æ à ì æ à ì æ à æ ì à æ ì à æ à ì æ ì à æ à ì æ ì à æ ì à ì æ à ì ì æ ì à æ ì æ ì à ì æ ì à ì æ à ì à æ ì à æ ì à æ ì à æ à ì æ à ì æ à ì æ à ì à æ æ ì à ì à æ ì à æ ì à æ ì à æ à ì æ à æ ì à ì à ì æì ì à æ ì æ à ì æ ì à ì æ à ì æ à ì à æ à ì æ à ì æ à ì æ à à ì æ à æ à à æ ìì à æ ì à æ ì à æ à ì à æ æ à ì æ ì à ì æ à ì æ ì æ àà ì æ ì à æ ì æ à ì à æ ì à æ à à æ æ à ìì æ à æ ì à ì à æ ì à æ ì à æ ì à æ ì à æ à ì à æ ì à æ æ ì à æ ì à æ ì ì à æ æ à ì à ì ææ à æ ì à ì ææ à ì ì à æ ì æ à ì æ ì à æ à ì à ì æì à ì æ àà æ æ ì à æ ì à à ì æ à ì æ ì æ à ì à æ ì æì à ì à à æ ì à æ ì æì àà à æ æ à à ì æ àà ì æ à ì æ æ à æ àà ì æ à ì æ ì à ì ææ æ àà æ æ ì æì ì ààà æ ìì ì ææì ààà ààà ìì æææ ì àààà à ààà à ààààà à ààà àà ààààà à ààà à à ààààà

101

102

103

104

105

106

107

Search space size


Fig. 3. Markov-Chain-based attacks on different groups of datasets (using Good-Turing Smoothing and Distribution-based Normalization). Attacks (a)∼(c) use 1 million Duowan passwords as the training set, while attacks (d)∼(f) use 1 million Rockyou as the training set.

Understanding Passwords of Chinese Users

Understanding Passwords of Chinese Users

Suggest Documents

Graphical Passwords in the WildâUnderstanding How Users Choose ...

Understanding Chinese users' continuance intention toward online ...

Understanding Media Users

Understanding Media Users

Can Users Remember Their Pictorial Passwords ... - UX Metrics Geek

Chapter 3 Understanding users - Wiley

Understanding users' experience of interaction - CiteSeerX

ISBP: Understanding the Security Rule of Users

Understanding Types of Users on Twitter

Understanding Chinese management needs ... - ChinaAnalysis.Com

Understanding Chinese Developmental Dyslexia - CiteSeerX

Understanding users' experience of interaction - Sascha Mahlke

Users' Understanding of Search Engine Advertisements - SearchStudies

Understanding the Chinese Stimulus Package

understanding chinese and indian balancing

Understanding the Chinese Stimulus Package

understanding chinese international students ...

Understanding the Evolution of Users' Personal Information ...

Understanding Types of Users on Twitter

Towards a Finer Understanding of Lead Users.

Understanding the Evolution of Users' Personal ... - CiteSeerX

About Passwords

Follow Whom? Chinese Users Have Different Choice

Chinese Migrant Perceptions of Africans: Understanding ... - MDPI

Understanding Passwords of Chinese Users