2010 IEEE Conference on Open Systems (ICOS 2010), December 5-7, 2010, Kuala Lumpur, Malaysia
Using Genetic Algorithm To Break A Mono - Alphabetic Substitution Cipher S. S. Omran
A. S. Al-Khalid
College of Elec. & Electronic Techniques Foundation of Technical Education
[email protected]
College of Elec. & Electronic Techniques Foundation of Technical Education
[email protected]
Abstract- Genetic algorithms (GAs) are a class of optimization algorithms. GAs attempt to solve problems through modeling a simplified version of genetic processes. There are many problems for which a Genetic Algorithm approach is useful. It is, however, undetermined if cryptanalysis is such a problem. Therefore, this work trying to explore the use of Genetic Algorithms in cryptography. The focus is to be on substitution cipher. The principles used in this cipher form the foundation for many of the modern cryptosystems. The frequency analysis is used as an essential factor in objective function.
D. M. Al-Saady Foundation of Technical Education
[email protected]
In the substitution ciphers the value of character or character string is changed when transforming the plaintext into ciphertext, but the position of the original string and its value replacement correspond exactly in the plain and ciphertext [4,7,8,9,10,11]. For example, if we encrypt the plaintext genetic algorithm in cryptography using a single character substitution cipher with a certain key, the ciphertext will be as shown in Fig.2. Cryptographic ciphers
I. INTRODUCTION The field of cryptology today represents that branch of information theory which deals with the security of information confidentiality. Methods in cryptology may be subdivided into two classes, namely that of cryptography (methods applied by authorized information sharers to design and develop encryption schemes in order to ensure confidentiality of information) and that of crypt-analysis (mathematical and statistical attempts by unauthorized persons to break cipher in order to reveal the meaning of the underlying protected data). The ciphertext is created by choosing a permutation of the 26-character alphabet and using it to replace each letter in the plaintext message [1]. A Symmetric cryptography ciphers may in fact be sub classified into block ciphers (in which blocks of data, known as plaintext, are transformed into ciphertext which appears unintelligible to unauthorized persons) and stream ciphers (which involve streams of typically binary operations and are well suited for efficient computer implementation). In this paper we shall focus on block cipher just substitution cipher as shown in Fig.1 [2,3,4]. In 1993 Spillman [5] for the first time presented a genetic algorithm approach to break a substitution cipher using genetic algorithm. He has explored the possibility of random type search to discover the key (or key space) for a simple substitution cipher. In this paper different parameters of the genetic algorithm were tested such as the population size and the time required finishing the algorithm for different number of generations. II. SUBSTITUTION CIPHER
Block ciphers
Substitution ciphers
Stream ciphers
Transposition ciphers
Product ciphers
Fig.1 Schematic representation of cryptographic cipher classification Text Character: a b c d e f g h i j k l m n o p q r s t u v w x y z
genetic algorithm in cryptography ( key cipher ) : H W U G C T V A F K D Y Q P B R JLFI X M SOZ N (cipher text)
VCPCIEUHYVBLYIAQEPULZRIBVLHRAZ
Is a symmetric cryptography ciphers, the substitution cipher is classified into two parts (Mono alphabetic and Poly alphabetic) [2,3,6].
978-1-4244-9192-6/10/$26.00 ©2010 IEEE
Asymmetric Cryptographic ciphers
Symmetric Cryptographic ciphers
Fig.2 Example of a key of a single character substitution cipher
63
i.
III. GENETIC ALGORITHMS
ii.
A Genetic Algorithm is a general method of solving problems to which no satisfactory, obvious, solution exists. It is based on the idea of emulating the evolution of a species in nature and so the various components of the algorithm are roughly analogous to aspects of natural evolution [2,3,6,12,13,14]. The process begins by creating a random initial generation of individuals, sometimes called chromosomes that in some way represent the problem being solved. Pairs of members of the current population are selected and “mated” with each other by means of a crossover operation to produce members for the succeeding generation. Randomly selected members of the current generation also undergo mutation, in which random portions of “genetic” material are exchanged. The fittest members of each generation, as determined by a fitness function are then selected for the succeeding generation as shown in Fig.3. The crossover and mutation operations are controlled by the crossover rate and mutation rate parameters, which determine the proportion of the population that undergoes these changes [1].
iii. 5.
Apply crossover to get children (10 pairs). Apply mutation of 0.02% to the new children. Apply replacement to get a new population (20 keys).
g. Go to 4. Output is the best solution. V. FITNESS MEASURE
The technique used to compare candidate keys is to compare statistics of the decrypted message with those of the language. The letter frequency is used for attacks against cryptographic cipher. The frequencies of letters are typically occur in natural languages are known and well documented [15, 16, 17]. Fig.4 shows the expected number of letters occurrences of letters in English language text of length 10000 characters. A natural choice as measure of fitness of a candidate key k for the cipher would be [3, 13, 18]. ,
,
,
Initialization
, , Evaluation
Selection
, ,
1
, ,
Mutation
Here, A denotes the language alphabet (i.e., for English, [A . . . .Z]), K and D denotes known language statistics and decrypted message statistics, respectively, and the indices u, b and t denote the unigram, bigram and trigram statistics, respectively. The values of α, β and γ allow assigning of different weights to each of the three n-gram types [13].
Mating
Termination
Fig.3 The basic genetic algorithm IV. PROPOSED ALGORITHM The following is an outline of proposed algorithm: 1. Input the cipher text to the algorithm and relative character frequencies. 2. Initialize the algorithm parameters: maximum number of generations (M). 3. Generate the population p(0) keys randomly (for example 20 keys) each one with length of 26 letters. 4. For 1 to (M) do: a. Decrypt the cipher text by the 20 generated keys. b. Calculate the suitability of each key from every decrypted text using the formula of fitness. c. Sort the keys based on the increased fitness values. d. Keep 20% (2 pairs) of best fittest of p(0) for next generation. e. Use stochastically selection to choose 8 pairs from the 20 keys (parents). f. For 1 to 10 pairs do:
Fig.4 Relative frequency of letters in English text VI. IMPLEMENTING THE ATTACK FOR MONO - ALPHABETIC SUBSTITUTION CIPHER
The attack is implemented by generating an initial candidate key pool p(0) of even cardinality, consisting of permutations of the set { a, b, c, . . . z }. The first generation is generated
64
in the random binary vector. The outstanding permutation entries are filled into child 1 in the order in which they occur in parent 2. Similarly, child 2 was formed by copying the entries (in the same positions) from parent 2 corresponding to zero entries in the random binary vector. The outstanding permutation entries are filled into child 2 in the order in which they occur in parent 1. This process is illustrated in Fig.7 [12, 19, 20]. After mutation has taken place the resulting set of candidate keys form the new pool population P(1). The crossover and mutation procedures are applied to this new key pool in order to produce the population P(2), and so forth, until some final population P(T) is reached (either after a prespecified number of generations, or when the minimum candidate key fitness exceeds some acceptable threshold).
randomly using a simple uniform random generator. Thereafter, the cipher text is decrypted using each permutation as a key, enabling us to assign a measure of fitness by using equation (1) to each candidate key. Pairs of candidate key are then stochastically selected for producing offspring after applying a method of crossover to each pair. The stochastic selection method is applied by choosing some pairs from the candidate which they have best fitness. Then Stochastic Universal Sampling works by making a single spin of the roulette wheel. This provides a starting position and the first selected individual. The selection process then proceeds by advancing all the way around the wheel in equal sized steps, where the step size is determined by the number of individuals to be selected. So if we are selecting n individuals we will advance by 1/n x 360 degrees for each selection. Note that this does not mean that every candidate on the wheel will be selected. Some weak individuals will have very thin slices of the wheel and these might be stepped over completely depending on the random starting position, as shown in Fig.5 [12].
P1
PKXAMLTIUBESJFG
HCORQYDVWNZ
P2
XRPUWZANMGOVSCQ TDFBJEHLKYI Crossover point
Intermediate child 1 PKXAMLTIUBESJFG
TDFBJEHLKYI
Intermediate child 2 XRPUWZANMGOVSCQ HCORQYDVWNZ Fig.5 Stochastic selection Crossover is the process of taking two parent solutions and producing from them a child. After the selection process, the population is enriched with better individuals. Crossover is a recombination operator that proceeds in three steps: i. The reproduction operator selects at random a pair of two individual keys for the mating. ii. A cross site is selected at random along the key length. iii. Finally, the position values are swapped between the two keys following the cross site. The traditional genetic algorithm uses single point crossover (which is used in this paper), when the two mating chromosomes are cut once at corresponding points and the sections after the cuts exchanged. Here, a cross site or crossover point is selected randomly along the length of the mated key and letters next to the cross sites are exchanged. If any of the exchanged chromosomes are already appears in the child, then these positions of chromosomes are left blank, then the letters that do not appear in a child are inserted in it. The outcome of this procedure is shown in Fig.6 [12, 18,19].
Intermediate child 11 PKXAMLTIUBESJFG
□D□□□□H□□Y□
Intermediate child 22 XRPUWZANMGOVSCQ
H□□□□YD□□□□
Child 1 PKXAMLTIUBESJFG
CDNOQRHVWYZ
Child 2 XRPUWZANMGOVSCQ HBEFIYDJKLT Fig.6 Applying crossover between two parents
Generating random binary number (26 binary numbers (0, 1))
110 0 0 1 0 10 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 0 First Parent (chosen randomly)
PRXAWLTIUGEVJFQHDOBCYMSKNZ Second Parent (chosen randomly)
XKPHMZANUBQVSFGTCORJEDLWYI 1st Child
After crossover, some keys are subjected to mutation. Mutation prevents the algorithm to be trapped in a local minimum. A random binary vector of the same length of the cipher key is generated and then using this vector to produce two offspring key from two parent candidate keys in the following way. Child 1 was formed by copying the entries (in the same position) from parent 1 corresponding to unit entries
PRXKMLZIAGUVSFQHDOBTYCJENW 2nd Child
RXPHMTAGUEQVSFOBCYKJNDLWZI
Fig.7 Applying mutation for two parents
65
solution is for key 4 where the correct number of correct letters is 17. Bold lettering is used in table (1) for those letters in the population that appeared to be correct. The true key which is RNKIYUJEFCSZGOATMLPDWHVQBX is not in the final pool, this is due to the limited length of the ciphertext. Fig.9 shows the time required (elapsed) to finish the algorithm for different number of populations. The time required is increased as the number of populations is increased.
VII. RESULTS The attack to a mono-alphabetic substitution cipher was implemented for different number of populations. Fig.8 shows the relation between the fitness and number of generations, for population size of 20, 40, 60, and 80. It is clear from Fig.8 that the best fitness is reached after 300-400 generations. Table 1 shows the values of fitness for a 20 population (keys) and the number of correct letters obtained. It is clear that the best
No
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
100 200 300 400 500 600 700 800 900 1000 number of generations
fitness%
population size=40 91 90 89 88 87 86 85 84 100
200
300 400 500 600 700 number of generations
800
900 1000
fitness%
population size=60 90 88 86 84 82 80 78 100 200 300 400 500 600 700 800 900 1000 number of generations
300 400 500 600 700 number of generation
800
fitness%
RNCIYFJEBOSZDGAPMKUXLHVQTX RNKIYUHGTMJFVDAECLPZWOQSBX RNKIYBUELQHZWOASXFPDGTVCJM RNKIYJUELQHZGOASMFPDWTVCBX OSTIYBUELQKZWRAHXPFDGNVCJM RNTIYBUELQKZWOAHXPFDGSVCJM OHTIYBUELQNZWRASMPFDKGVCJX RHTIYBKELQAZWOGSXPFDVNUCJM RNTIYBKELQAZWOGSXPFDVHUCJM RNCIYFJEBOSZDGAPMKUXLHVQTW RNCIYBJEDGSZFKAOMLPTUHVQXW RNJVYULSFQKZICMTHEPDWOAGBX RNBIYUHGTMJFVDAECLPZWOQSKX RSJGYUKDLNXIWEAHMPFOVTZQBC RNSVYUJOLCGIWAEHMKPDZQTFBX AGKMWFJDLSCIUEYTNOPRZHVQBX RNSVYUEDPCJIWKLHMTFOZQGABX RNTIYJKELQAZWOGSXPFDVHUCBM ONTIYJKELQAZWRGSXPFDVHUCBM RNTIYBKELQAZGOWSXPFDVHUCJM
14 11 11 17 7 10 9 7 9 13 13 11 10 7 10 9 8 10 8 10
In this paper a genetic algorithm attack on a simple cryptographic cipher, called mono-alphabetic substitution, was implemented successfully. The algorithm was implemented using the MATLAB program. Different parameters were tested such as the number of population and the time required finishing the algorithm for different number of generations. It is apparent from the results that increasing the number of population above 20 was not helpful in retrieving the original key. This is evident in Fig.8 where the highest fitness was attained after 400 generations disregarding the number of the population. This can be
87 86 85 84 83 82 81 200
95 89 94 94 85 94 94 89 89 90 91 88 89 87 89 88 88 89 89 89
VIII. CONCLUSION
population size=80
100
Population
Correct letter
Table 1: The best solution for fitness value and correct no. of letters when the population is 20 and after 400 generations
96 94 92 90 88 86 84
Fitness
fitness%
population size=20
900 1000
Fig.8 Fitness value for different values of population
66
REFERENCES
explained by the great number of probable keys i.e. 26! that makes any population minor. Using GA to attack a mono-alphabetic substitution cipher proved to be an efficient method of cryptanalysis based on the aspect of comparing the frequency of letter occurrence in the model text.
[1]
[2] [3]
elapsed time(sec)
population size=20 [4]
400 350 300 250 200 150 100 50 0
[5] [6] 100 200 300 400 500 600 700 800 900 1000
[7]
number of generations
[8]
elapsed time(sec)
population size=40 [9]
1400 1200 1000 800 600 400 200 0
[10] [11] [12] 100 200 300 400 500 600 700 800 900 1000 [13]
number of generations
population size=60
elapsed time(sec)
2000
[14]
1500
[15]
1000
[16]
500
[17]
0 100 200 300 400 500 600 700 800 900 1000 number of generations
[18]
[19]
elapsed time(sec)
population size=80 3000 2500 2000 1500 1000 500 0
[20]
100 200 300 400 500 600 700 800 900 1000 number of genertaion
Fig.9 Elapsed time for different no. of population
67
M. Ralph & W.Ralph, A Word-Based Genetic Algorithm for Cryptanalysis of Short cryptograms, American Association for Artificial Intelligence (www.aaai.org). All rights reserved, pp.229233, 2003. D. Bethany , Genetic Algorithm in Cryptography , MSc. Thesis, Computer Engineering, Rochester Institute of Technology , Rochester, New York, July 2004. W. r. Grundlingh & jan, h. Van vuuren, Using Genetic Algorithms to break a simple cryptographic cipher, Retrieved March 31, 2003 from http://dip.sun.ac.za/`vuuren/abstract/ genetic.htm, submitted 2002. O. David, Evolutionary Algorithm for Decryption of Monoalphabetic Homophonic Substitution Ciphers Encoded as Constraint Satisfaction Problems, July 12-16, Atlanta, Georgia, USA, 2008. R. Spillman, M.Janssen, B. Nelson, & M. Kepner, Use of a genetic algorithm in the crypanalysis of simple substitution ciphers, Cryptologia 17(1), pp.31-44, January 1993. A. J. Clark, Optimization Heuristics for Cryptology, PhD, Thesis, Queensland University of Technology, February 1998. L. C. Washington, Introduction to cryptography with coding theory, Pearson Education, Inc., 2nd edition, 2006. A. K. Verma, Mayank Dave and, R. C. Joshi, Genetic Algorithm and Tabu Search Attack on the Mono-Alphabetic Substitution Cipher in Adhoc Networks, Journal of Computer Science 3 (3), pp.134-137, 2007. G. J. Simmons, Contemporary Cryptology, The Science of Information Integrity, The Institute of Electrical and Electronics Engineers, Inc., New York, 1991. D. Kahn, The Code breakers, The New American Library, Inc., USA, 1973. W. Stallings, cryptography and network security, principle and practices, Pearson Education, Inc., 4th edition, 2005. S.N.Sivanandam, S.N.Deepa, Introduction to Genetic Algorithms, Springer-Verlag Berlin Heidelberg 2008. T.Ragheb & A. Subbanagounder, Applying Genetic Algorithms for Searching Key-Space of Poly-alphabetic Substitution Ciphers, The International Arab Journal of Information Technology, Vol. 5, No. 1, pp.87-91, January 2008. R. L. Haupt & Sue Ellen Haupt, Practical Genetic Algorithms, John Wiley &Sons, Inc., 2nd Edition, New York, 2004. LD Callimahos & WF Friedman: Military cryptanalytics, part II, National Security Agency, Washington DC, 1956. LD Callimahos & WF Friedman: Military cryptanalytics, part I, National Security Agency, Washington DC, 1956. B Schneier: Applied Cryptography: Protocols, algorithms and source code in C, Jhon Wiley & Sons, Inc., New York, 1994. E. Pakize, O.Ali, T.Salih, Continuous Optimization Problem Solution With Simulated Annealing and Genetic Algorithm, 5th International Advanced Technologies Symposium (IATS'09), May 13-15, karrabul, Turkey, 2009. S. Tang, K.F. Man, S.Kwong and Q. HE, Genetic Algorithm and their Applications, ieee signal processing magazine, November, pp.22-37, 1996. D. F. Buthainah & A. A. Hamza, Enhanced Traveling Salesman Problem Solving by Genetic Algorithm Technique (TSPGA), World Academy of Science, Engineering and Technology 38, pp.296-302, 2008.