On the Cryptographic Patterns and Frequencies in Turkish Language Mehmet Emin Dalkılıç1 and Gökhan Dalkılıç2 Ege University International Computer Institute, 35100 Bornova, øzmir, Turkey
[email protected] http://www.ube.ege.edu.tr/~dalkilic 2 Dokuz Eylül University, Computer Eng. Dept., 35100 Bornova, øzmir, Turkey
[email protected] http://bornova.ege.edu.tr/~gokhand
1
Abstract. Although Turkish is a significant language with over 60 million native speakers, its cryptographic characteristics are relatively unknown. In this paper, some language patterns and frequencies of Turkish (such as letter frequency profile, letter contact patterns, most frequent digrams, trigrams and words, common word beginnings and endings, vowel/consonant patterns, etc.) relevant to information security, cryptography and plaintext recognition applications are presented and discussed. The data is collected from a large Turkish corpus and the usage of the data is illustrated through cryptanalysis of a monoalphabetic substitution cipher. A new vowel identification method is developed using a distinct pattern of Turkish—(almost) non-existence of double consonants at word boundaries.
1
Introduction
Securing information transmission in data communication over public channels is achieved mainly by cryptographic means. Many techniques of cryptanalysis use frequency and pattern data of the source language. The cryptographic pattern and frequency data are usually obtained by compiling statistics from a variety of source language text such as novels, magazines and newspapers. Such data is available for many languages. A good source is an Internet site “Classical Cryptography Course by Lanaki” (http://www.fortunecity.com/skyscraper/coding/379/lesson5.htm) which includes data for English, German, Chinese, Latin, Arabic, Russian, Hungarian, etc. No such data online or otherwise can be located for Turkish which is a major language used by a large number of people. Turkish, a Ural-Altaic language, despite its being one of the major languages of the world [1], it is one of the “lesser studied languages” [2]. Even less studied are the information theoretic parameters (e.g., entropy, redundancy, index of coincidence) and cryptographic characteristics (patterns and frequencies relevant to cryptography) of the Turkish language. Earliest work (that we are aware of) on the information theoretic aspects of the Turkish is presented by Atlı [3]. Although Atlı calculated the digram entropy and gave word length and consonant/vowel probabilities, he could not go T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 144−153, 2002. Springer-Verlag Berlin Heidelberg 2002
On the Cryptographic Patterns and Frequencies in Turkish Language
145
beyond the digram entropy due to insufficient computing resources. In a more recent time, Koltuksuz [4] addressed the issue of cryptanalytic parameters of Turkish where he extracted n-gram entropy, redundancy and the index of coincidence values up to n = 5. Adopting Shannon’s entropy estimation approach [5], present authors empirically determined the language entropy upper-bound of Turkish as 1.34 bpc (bits per character) with a corresponding redundancy of roughly 70% [6]. Data presented in this paper is obtained from a large Turkish text corpus of size 11.5 Megabytes. The corpus --contains files that are filtered so that they consist solely of the 29 letters of the Turkish alphabet and the space, is the union of three corpora; the first one is compiled from the daily newspaper Hürriyet by Dalkılıç [7], the second one contains 24 novel samples of 22 different authors by Koltuksuz [4], and the last one consists of mostly news articles and novels collected by Diri [8]. This paper presents a variety of Turkish language patterns and frequencies (e.g., single letter frequency profile, letter contact patterns, common digrams, trigrams and frequent words, word beginnings, vowel/consonant patterns, etc.) relevant to information security, cryptography and plaintext recognition. Presented data is put to use to solve the following mono-alphabetic substitution cipher problem found in [9]. DLKLEöPFÜ FLMLTU FLLö CBÖL ÖIBJL öÜNLEPö APEYPKÜBÜB SIBPJP MLYLB SÇKZPA DPBNPEPFÜBP SÜö. CEöLÖLYÜ öPZPFYCMH PZZÜF LÖLFUBL OPøÜE. ALøV JPZYPBZÜ öPYBPJÜ MHZ. FCB ÜDHNH öPYBPBÜB MLJELùUBÖL. Throughout the cryptanalysis example uppercase letters for ciphertext and lowercase for plaintext (typeset in Courier font) are used to improve readability.
2
Letter Frequencies
Turkish alphabet contains eight vowels {A, E, I, ø, O, Ö, U, Ü} and twenty-one consonants {B, C, Ç, D, F, G, ö, H, J, K, L, M, N, P, R, S, ù, T, V, Y, Z} totaling to 29 letters. In lower case, vowels {a, e, ı, i, o, ö, u, ü} and consonants {b, c, ç, d, f, g, ÷, h, j, k, l, m, n, p, r, s, ú, t, v, y, z} are written as shown. {I, ı} and {ø, i} being two different letters may be confusing to many readers who are not native (Turkish) speakers. Table 1 shows the individual Turkish letter probabilities if space is suppressed in the text. That table also introduces the frequency ordering of Turkish as AEøNRLIKDMYUTSBOÜùZGÇHöVCÖPFJ. Table 1. Normal Turkish letter frequencies (%) in decreasing order
A E ø N R L
Æ Æ Æ Æ Æ Æ
11.82 9.00 8.34 7.29 6.98 6.07
I K D M Y U
Æ Æ Æ Æ Æ Æ
5.12 4.70 4.63 3.71 3.42 3.29
T S B O Ü ù
Æ Æ Æ Æ Æ Æ
3.27 3.03 2.76 2.47 1.97 1.83
Z G Ç H ö V
Æ Æ Æ Æ Æ Æ
1.51 1.32 1.19 1.11 1.07 1.00
C Ö P F J
Æ Æ Æ Æ Æ
0.97 0.86 0.84 0.43 0.03
146
M.E. Dalkiliç and G. Dalkiliç
2.1 Letter Groupings Unlike letter frequencies and their order which fluctuate considerably, group frequencies are fairly constant in all languages [10]. The following are some useful Turkish letter groups where each group is arranged in decreasing frequency order. • • • • • • •
Vowels {A, E, ø, I, U, O, Ü, Ö} 42.9% High freq. consonants {N, R, L, K, D} 29.7% Medium freq. consonants {M, Y, T, S, B} 16.2% Low freq. consonants {ù, Z, G, Ç, H, ö, V, C, P, F, J } 11.3% High freq. vowels {A, E, ø, I} 34.3% Highest freq. letters {A, E, ø, N, R} 43.4% High freq. letters {A, E, ø, N, R, L, I, K, D} 63.8%
2.2 Cryptanalysis Using Letter Patterns Monoalphabetic substitution ciphers replace each occurrence of a plaintext letter (say a) with a ciphertext letter (say H). Table 2 shows the frequency counts for the example ciphertext given in Section 1. Clearly, an exact match between these counts and the normal plaintext letter frequencies of Table 1 is not expected. Nevertheless, it is very likely that highest frequency ciphertext letters {P,L,B,Ü} are substitutes for letters from the highest frequency normal letters set which is {a,e,i,n,r}. It is expected that the low frequency ciphertext symbols {Ç,O,T,V,ù,G,R} resolve to the letters from the low frequency plaintext letters set of { ú,z,g,ç,h,÷,v,c, p,f,j}. Table 2. Letter frequencies in the example ciphertext
P L B Ü ö F
3
Æ Æ Æ Æ Æ Æ
21 20 16 13 9 8
E Y Z J M Ö
Æ Æ Æ Æ Æ Æ
7 7 7 5 5 5
C H N A K S
Æ Æ Æ Æ Æ Æ
4 4 3 3 3 3
U D I ø Ç O
Æ Æ Æ Æ Æ Æ
3 3 2 2 1 1
T V ù G R
Æ Æ Æ Æ Æ
1 1 1 0 0
Letter Contact Patterns
Letter contact data (transition probabilities) is an important characteristic of any language because contacts define letters through their relations with one another. For instance, in Turkish vowels avoid contact, doubles are rare. These behaviors can be observed from Table 3 which holds the normal contact percentage data for Turkish. Table 3 is modeled after a similar table for English by F. B. Carter reproduced in [10]. Taking any one letter, say A: On the left, it was contacted 16% of the time by L, 10% by D, etc., and 99.43% of its total contacts on that side were to consonants. On
V
00.57 73.18 68.05 77.36 35.58 00.12 88.82 16.57 99.86 85.99 00.05 00.78 80.88 79.49 57.54 63.72 97.89 00.69 00.09 90.62 93.20 72.68 94.07 49.41 00.18 00.02 88.31 94.24 96.32
C
99.43 26.82 31.95 22.64 64.42 99.88 11.18 83.43 00.14 14.01 99.95 99.22 19.12 20.51 42.46 36.28 02.11 99.31 99.91 09.38 06.80 27.32 05.93 50.59 99.82 99.98 11.69 05.76 03.68
C3 H4 N5 S5 T5 B6 R8 M8 Y9 K9 D10 L16 L3 M3 N4 R7 E8 ø18 A39 ɖ 3 U3 I4 R4 ø4 O11 A21 E21 N22 K4 U5 R5 N5 ɖ 7 E13 A17 ø33 M3 Y3 ø4 L10 E12 A16 R18 N26 C4 K4 B4 S5 G5 V5 T6 R6 Y6 N7 M8 D15 L16 U3 ɖ 4 I4 O4 ø15 E23 A35 A3 ø3 E3 V6 U6 Z8 Y10 L17 R19 N20 ɖ 3 ɐ3 O10 U11 ø15 I16 E18 A24 R3 U5 E9 ø11 A59 Ɂ 3 ù4 Y5 ö6 K6 M6 T8 S9 L10 D11 N12 R12 Ɂ 3 G4 ö4 T6 M6 S6 K7 N8 L9 R10 D11 B14 ɖ 4 ø6 R15 E16 O26 A28 ù3 L3 ɖ 4 U4 O7 R7 ø10 I11 E18 A25 T3 ù3 L3 U4 Y5 R5 I6 K6 N7 O8 E10 ø12 A14 ù3 ɖ 3 T4 N4 U7 R8 I8 L9 E13 ø14 A17 O5 ɖ 6 U7 I17 E17 ø20 A23 B4 T5 D8 Ɂ 8 S13 K14 Y33 Ɂ 3 Y4 B9 K10 D12 S16 G41 ɖ 3 R3 U6 E10 O10 I11 ø12 A37 ɐ3 ɖ 4 U6 I6 O10 ø16 E22 A28 M3K4 ɖ 4 N5 I5 U5 R7 ø15 E18 A23 O3 R4 ɖ 8 E8 U9 I18 ø23 A26 L3 U4 T5 ø6 R8 ù9 S10 K10 A15 E18 C3 Y4 ö4 T6 S6 M7 K7 L8 R9 N10 D13 B14 Ɂ 3 ù3 Z3 S5 B5 K6 M6 N6 L6 R7 Y8 G11 D13 T13 ɖ 3 ø4 U5 A28 E41 O4 ɖ 5 ɐ6 U8 I10 E15 ø22 A25 E7 ɐ9 U10 ɖ 11 I15 A21 ø21
A B C Ɂ D E F G ö H I ø J K L M N O ɐ P R S ù T U ɖ V Y Z
R19 N16 L8 K8 Y6 M5 D5 S4 ù4 T4 Z3 H3 ö3 ø38 A22 U16 E13 ɖ 4 O3 A33 E31 U11 ø10 I9 ø22 I15 A15 E15 O14 ɖ 5 L4 M3 T3 E27 A24 ø19 I12 U9 ɖ 5 O3 R22 N17 L9 K8 T7 M6 D5 Y5 S5 V3 ö3 C3 A31 E20 ø12 I11 T6 L4 R3 O3 U3 E31 ø22 ɐ18 ɖ 15 A6 U3 I3 ø28 I27 U12 R8 A7 L7 E4 ɖ 3 A42 E18 ø12 U4 I3 ɖ 3 T3 O3 N31 R11 L10 K9 ù8 M7 Y7 Z5 ö4 S3 N22 R17 L11 Y8 M7 ù6 K5 Z5 S4 Ɂ 4 T3 ø36 A19 E17 L7 U6 I6 D5 O3 A26 ø14 L10 E10 I8 T7 O6 U6 ɖ 3 A29 E24 ø11 I8 D5 M5 U4 L3 A29 E22 ø15 I11 U7 L4 ɖ 4 D3 D17 E13 ø12 I12 A10 L9 U6 C4 M3 R28 L21 N16 K9 ö4 C4 Y4 R20 N19 Y17 Z15 L7 ö4 T4 S3 A31 E12 I11 L10 T6 O6 R5 ø5 M4 S3 A16 ø14 I12 D11 E9 L6 U6 M5 K4 T4 S3 A18 ø16 I16 E13 T9 O7 U7 ɐ3 ɖ 3 T16 I14 A13 L11 ø11 E9 M7 K6 U4 ɖ 4 A19 E16 ø15 I14 ɖ 8 U6 L5 T4 M4 O3 N20 R15 L10 M9 Y7 K6 ù5 Z5 ö5 T4 S4 N23 R15 Z9 K8 L7 ù7 Y6 M6 S5 E43 A24 ø8 R5 U5 L5 A29 O17 E15 L8 I7 ø5 ɖ 4 U3 D3 A19 E18 L14 ø11 I11 D7 ɖ 6 U4 M3
Table 3. Normal (expected) contact percentages of Turkish digrams
V 00.81 98.53 96.81 88.22 98.70 00.16 80.83 98.37 81.57 84.98 00.11 00.47 86.19 74.87 79.93 87.83 57.15 00.25 00.02 67.27 60.19 83.26 56.60 82.03 00.62 00.12 82.06 81.82 69.70
C 99.19 01.47 03.19 11.78 01.30 99.84 19.17 01.63 18.49 15.02 98.89 99.53 13.81 25.13 20.07 12.17 42.85 99.75 99.98 32.73 39.81 16.74 43.40 17.97 99.38 99.88 17.94 18.18 30.30
On the Cryptographic Patterns and Frequencies in Turkish Language 147
148
M.E. Dalkiliç and G. Dalkiliç
the right, it was contacted 19% of the time by R, and 0.81% of the time by vowels. Note that the table only contains contacts with a frequency of 3% or more. The most marked characteristics of Turkish letter contacts are vowels do not contact vowels on either side and doubles are rare. Since no consonant avoids vowel contact, vowels and consonants are very much distinguishable. Only significant doubles are TT and LL for consonants and AA for vowels. Doubles for vowels are so rare that even AA did not make into the Table 3. 3.1 Most Frequent Digrams and Trigrams In Turkish, among 292 digrams about one third and among the 293 trigrams about one tenth constitute the 96% of the total usage. In Table 4, none of the first 100 digrams contains doubles. Fifty of them are in the form consonant-vowel (CV), 42 are in the form vowel-consonant (VC) and only 8 are in the form (CC). Usage of the first ten digrams in Table 4 sums up to 16.9%, first 50 to 47.9% and first 100 to 69.5%. Table 4. The 100 Most Frequent Digrams in Turkish (Frequencies in 100,000) AR LA AN ER øN LE DE EN IN DA øR Bø KA YA MA
2273 2013 1891 1822 1674 1640 1475 1408 1377 1311 1282 1253 1155 1135 1044
Dø ND RA AL AK øL Rø ME Lø OR NE RI BA Nø EL
1021 980 976 974 967 870 860 785 782 782 738 733 718 716 710
NI AY YO EK RD TA AM DI SA øY Kø UN NA AD YE
703 698 686 683 681 670 638 637 624 619 618 606 602 592 588
OL Sø LI RE SI Mø TE ET øM Tø HA AS BU VE IR
586 578 576 566 565 564 562 560 541 537 528 527 516 508 503
Aù NL TI EM ÜN DU GE AT SE ED UR ON KL IL øù
500 496 494 494 492 487 480 479 457 452 452 452 447 438 434
BE KE EY ES IK RL MI øK CA LD CE NU Iù øZ LM
433 424 421 411 407 393 392 379 379 362 361 359 355 353 353
KI RU öø AZ øS Gø öI AH YL ÜR
350 349 347 343 343 342 340 338 324 319
Observe that Turkish letter contact data (Table 3) is not symmetric. For instance BU is in the table but its reverse UB is not. High frequency digrams (Table 4) with rare reverses (i.e., reverses are not in Table 3) e.g. ND, OR, RD, DI, OL, BU, NL, TI, ÜN, DU, GE, ON, RL, LD, YL and high frequency digrams with reverses which are also high frequency digrams (i.e., both the digram and its reverse are in Table 4) e.g., AR, LA, AN, ER, øN, LE, DE, EN, IN, DA, øR, KA, YA, MA, RA, AL, AK, øL, Rø, ME, Lø, NE, Rø, EL, AY, EK, TA, AM, SA, UN, NA, AD, YE, Sø, LI, RE, Mø, TE, øM, HA, AS, IR, SE, ED, UR, IL are useful for distinguishing letters from each other. Table 5 shows that none of the most common Turkish trigrams contain two vowels or three consonants in sequence; that is, no VVV, VVC, CVV, or CCC patterns. Most common trigram patterns in Table 5 are VCV (46 times), followed by CVC (35 times). CCV and VCC are seen 11 and 8 times, respectively. The first 10, 50 and 100 trigrams represent, respectively, 7.2%, 19.0%, and 28.7% of total usage.
On the Cryptographic Patterns and Frequencies in Turkish Language
149
Table 5. The 100 Most Frequent Trigrams in Turkish (Frequencies in 100,000) LAR BøR LER ERø ARI YOR ARA NDA øNø INI ASI DEN NDE RøN øLE
1237 952 949 764 757 643 521 482 432 428 387 383 383 372 367
ANI AMA RIN NLA DAN IND EDø ADA AYA KAR ALA LAN ENø SIN øND
362 357 345 338 338 336 326 321 316 299 298 296 294 294 291
ESø NIN YLE ADI øYO ELE øNE SøN ANL KLA ERE ALI ELø øYE BøL
283 280 277 273 271 271 266 265 263 262 262 258 256 255 246
øLø BAù ARD NøN RDU MIù OLA IöI Eöø EME INA ANA KEN øÇø IYO
245 243 242 239 231 229 227 226 223 223 222 220 218 217 217
RLA Møù YAN ECE AYI LMA øöø EDE TAN NDø KAL ONU UNU END ÇøN
216 213 212 209 207 207 207 206 205 204 204 201 200 199 198
AöI ORD GEL MAN ACA ÖYL KAD ERD ORU RAK DøY KLE VER EMø GÖR
194 194 194 192 192 191 187 183 178 177 172 171 170 169 169
RDI SON ILA BEN CAK øRø EYE AùI ÇIK KAN
169 168 167 166 165 163 163 162 160 159
3.2 Cryptanalysis Using Letter Contact Patterns Vowels usually distinguish themselves from consonants by not contacting each other. Table 6 shows the letter contact frequencies for the six most frequent ciphertext letters {P,L,B,Ü,ö,F}. Highest frequency letters {P,L} do not contact each other and almost certainly are vowels. Letters {B,ö} both contact {P,L}, and thus they are consonants. Letter Ü does not contact either of the {P,L} and it is likely to be a vowel. Letter F must be a consonant because it contacts two (presumable) vowels {L,Ü}. When the table is extended to include all ciphertext letters, it is determined that {P,L,Ü,C,H,I,Ç} are very likely to be vowels. Table 6. Letter frequencies in the example ciphertext
P L B Ü ö F
P --4 -4 --
L -1 1 -1 2
B 3 1 -4 ---
Ü --2 -1 2
ö 1 1 -1 ---
F 3 1 -1 ---
Now, let us focus on the two doubles LL in FLLö and ZZ in PZZÜF. These doubles may be associated to the only significant doubles of Turkish {tt,ll,aa} i.e., {L,Z} Æ {t,l,a}. Due to frequency and vowel analysis, L is a strong candidate for a, we may temporarily1 bind LÆa. This makes letter P the strongest candidate for letter e i.e., PÆe. Letter Ü is a vowel and it is either one of {ı,i }. BÜB is a repeated trigram with the ‘121’ pattern and in Table 5, the only ‘121’ patterned trigrams
1
Cryptanalysis is mostly a trial and error process and all bindings are temporary.
150
M.E. Dalkiliç and G. Dalkiliç
with starting and ending with a consonant (i.e., CVC) are NIN and NøN. Thus, we may bind letter BÆn and stop here to continue later.
4 Word Patterns Primary vowel harmony rule is that all vowels of a Turkish word are either back vowels {A,I,U,O} or front vowels {E,ø,Ü,Ö}. Secondary vowel harmony rule states that (i) when the first vowel is flat {A,E,I,ø}, the following vowels are also flat e.g., BAKIRCI, øSTEK, (ii) when the first vowel is rounded {U,O,Ü,Ö}, the subsequent vowels are either high and rounded {U,Ü} or low and flat {A,E}, and (iii) low and rounded vowels {O,Ö} can only be in the first syllable of a word. Last Phoneme Rule is that Turkish words do not end in the consonants {B,C,D,G}. Each of these rules has exceptions. Using a root word lexicon, Gungor [11] determined that only 58.8% of the words obey the primary vowel harmony rule. The secondary vowel harmony rule is obeyed by 72.2%. The most obeyed rule is the last phoneme rule with 99.3%. The average word length of Turkish is 6.1 letters about 30% more than that of English. Words with 3 to 8 letters represent over 60% of total usage in Turkish text. 4.1 Common Words, Beginnings, and Endings When word boundaries are not suppressed in ciphertext, frequent word beginnings, endings and common words provide a wealth of information. The most frequent 100 Turkish words and the most common 50 n-grams for each of the other categories together with their percentage in total usage are given below. Common words BøR VE BU DE DA NE O GøBø øÇøN ÇOK SONRA DAHA Kø KADAR BEN HER DøYE DEDø AMA HøÇ YA øLE EN VAR TÜRKøYE Mø øKø DEöøL GÜN BÜYÜK BÖYLE NøN MI IN ZAMAN øN øÇøNDE OLAN BøLE OLARAK ùøMDø KENDø BÜTÜN YOK NASIL ùEY SEN BAùKA ONUN BANA ÖNCE NIN øYø ONU DOöRU BENøM ÖYLE BENø HEM HEMEN YENø FAKAT BøZøM KÜÇÜK ARTIK øLK OLDUöUNU ùU KADIN KARùI TÜRK OLDUöU øùTE ÇOCUK SON BøZ VARDI OLDU AYNI ADAM ANCAK OLUR ONA BøRAZ TEK BEY ESKø YIL BUNU TAM øNSAN GÖRE UZUN øSE GÜZEL YøNE KIZ BøRø ÇÜNKÜ GECE (23%) Note that any n-gram enclosed within a pair of punctuation mark(s) and space(s) is counted as a word. For instance “BAKAN’IN” (minister’s) is taken as two words “BAKAN” and “IN”. The only one-letter word in Turkish is O(it/he/she). The list contains many non-content words such as BøR (one), VE (and), BU (this), DE DA (too/also), NE (what), Kø (that/who/which), AMA (but), øLE (with) and fewer conceptual words such as TÜRKøYE (Turkey), ZAMAN (time), øNSAN (human), GÜZEL (beautiful).
On the Cryptographic Patterns and Frequencies in Turkish Language
151
Digram word endings. AN EN øN AR IN DA ER DE Dø AK LE Nø NA NE NI øM DI Rø RI OR EK YE RA DU UN YA Kø øR LA IM Lø Sø IK IR LI ET TI Tø CE SI UM Iù RE öI øZ øK øù MA IZ Bø (73.1%) Trigram word endings. LAR DAN LER DEN YOR ARI INI NDA øNø ERø øNE INA NDE NIN NøN RDU YLE MIù AYA ASI Møù RAK IöI RIN CAK ESø RDI ARA øYE NRA MAK MEK TAN øöø DAR RøN EYE MAN LIK RUM UNU ADA RDø ADI KEN DIR TEN DøR LøK YLA (41.6%) Digram word beginning . Bø KA YA DE BA BU GE VE OL DA HA SA BE GÖ SO KO TA Gø SE NE HE AL GÜ YE AN Dø øÇ KE Kø AR TE ÇO DÜ KU øN VA øS ME KI DO PA ON øL ÇA DU YO MA TÜ ÇI Mø (67.3%) Trigram word beginnings. BøR BAù øÇø KAR GEL GÖR SON BEN OLA KAD YAP BøL KAL VER KEN ÇIK DEö VAR GÜN YAN GøB øST BAK DøY TÜR HER ARA OLM DED ÇOK DÜù DAH BUN GER OLD YER KON GEÇ PAR DUR KUR BøZ ANL ÇOC YAR YIL BUL SEN OLU YOK (30.6%) A careful observation of Turkish word endings and beginnings given above reveal a distinct feature of Turkish; the first two and the last two letters of a word contain a vowel. In other words, (almost) no Turkish word starts or ends with a consonantconsonant (CC) or vowel-vowel (VV) pattern. Very few words (about 2%), mostly foreign origin e.g. TREN (train), KREDø (credit), RøNG (ring) do not obey this rule. A vowel identification method is developed for Turkish using this “no CC or VV patterns at word boundaries rule” and presented in the next subsection. 4.2 A New Vowel Identification Method for Turkish When spacing is not suppressed in a ciphertext for a mono-alphabetic cipher the following technique can be employed to distinguish vowels from consonants. First, make a list of digram word beginnings and endings. Let us call it the PairList. Then pick a pair containing a high frequency (in the ciphertext) letter. Remove the pair from the PairList. Then, create two empty lists List1 and List2 and put one letter of the pair to List1 and the other to the List2. Next, repeat the following steps until all elements in List1 and List2 are marked (processed), (i) pick and mark first unmarked element (say X) in List1 or List2. In the PairList find each pair in the form XY or YX and put Y to the list that X does not belong to and remove that pair from the PairList. At the end of the process, remove duplicates from both lists and smaller list will be (very likely) vowels and the other list will contain consonants. For those few words that do not obey the “no CC or VV patterns at word boundaries” rule may cause a letter to end up in both lists. If that is the case, get two counts: separately count the number of times the letter contacts to the members of List1 and List2. Since a contact between the members of a list indicates either CC or VV pattern, if one count is dominant remove the letter from that list. For example if letter X is seen many times with the elements of List1 and few times with the elements of List2, remove it from List1, and keep it in List2. If no count is dominant, remove it from both lists.
152
M.E. Dalkiliç and G. Dalkiliç
At the end, the PairList may contain pairs whose letters not placed in either list because they do not make contact (at word beginnings or endings) to any other letter in lists (List1 and List2). In such situations, again count number of contacts to each list’s elements for both letters of the left behind pairs, this time using all contacts, not only the digrams at word beginning and endings. Then, using these counts determine whether they fit in the vowel list or the consonant list. Let us illustrate this method for our ongoing example: PairList = {DL,FL,CB, ÖI,öÜ,AP,SI,ML,SÇ,DP,SÜ,CE,öP,PZ,LÖ,OP,AL,JP,öP,MH,FC,ÜD, FÜ,TU,Lö,ÖL,JL,Pö,ÜB,JP,LB,PA,BP,Üö,YÜ,ÜF,BL,ÜE,øV,ZÜ,JÜ, HZ,CB,NH}. First, pick the DL pair and create List1 ={D}, List2 ={L}. Next, pick D from List1 and process pairs containing D i.e., DP and ÜD resulting in List1={D*}, List2 ={L,P,Ü} where D* means D has been processed. Then, pick L from List2 and process pairs containing L e.g., FL,ML,AL,… producing List1={D*,F,M,A,Ö,ö, J,B}, List2={L,P,Ü}, and continue until all letters in both lists are processed. Final lists are formed as List1={D*,F,M,A,Ö,ö,J,O,Z,B,E,Y,S}, List2={L,P,Ü, C,I,H,Ç}. Since List2 is shorter, it contains vowels. Pairs {øV} and {TU} are left in the Pairlist. In the ciphertext, ø occurs twice and contacts P, Ü, L and they are all vowels. Thus, ø must be a consonant and V must be a vowel. Similar analysis adds T and U to the consonant and vowel lists respectively.
4.3 Cryptanalysis Using Word Patterns We temporarily marked {L,P,Ü,C,I,H,Ç,U,V} as vowels and bound LÆa, PÆe, BÆn, and ÜÆ{ı,i}. The primary vowel harmony rule gives us a way to distinguish between ı and i: If ÜÆı association is correct, Ü will coexist in many ciphertext words with LÆa, otherwise ÜÆi is true and Ü will be seen together with PÆe in many ciphertext words. Since {PÆe,ÜÆi} seen together in eight words while {LÆa,ÜÆı} in only three words, the likely option is ÜÆi. Let us concentrate on the next two highest frequency letters ö and F which appear in a rare pattern öLLF Æ ?aa?. There are not too many four letter words with the unusual pattern ?aa?. There are only 7 matching words; faal, maaú, naaú, saat, vaat, vaaz, zaaf. Except the word saat each candidate contains a letter from the low frequency consonants group {f,ú,v,z}. Thus, it is likely that FÆs, and öÆt. At this point there are many openings to explore. -a-a-tesi sa-a-- saat –n-a –-n-a ti-a-et -e—-e-inin -ne-e –a-an ----e- -en-e-esine –it. –-ta-a-i te-es---e—-is a-as-na –e-i-. –a-- -e-—-n-i te-ne-i ---. s-n i---- te-nenin –a—-a—-n-a. The partial words -a-a-tesi, –it, s-n, te-nenin can easily be identified as Pazartesi(Monday), git (go), son (last), and teknenin (yatch’s). Furthermore, since we have already identified t, and a, only remaining candidate for l is Z. Putting all this together, we have the partial decryption: pazartesi sa-a— saat on-a –-n-a ti-aret –erkezinin gne-e –akan g-zle- pen-eresine git. Orta-aki telesko—-
On the Cryptographic Patterns and Frequencies in Turkish Language
ellis a-as-na –evir. –a-ip--- teknenin –a-ra—-n-a.
-elkenli
tekne-i
–-l.
153
son
Completion of the decryption is left to the curios reader. Full decryption reveals that the ciphertext contains a single substitution error. Can you find it?
5 Conclusion We have presented some Turkish language patterns and frequency data compiled from a large text corpus. The data presented here is relevant not only to the classical cryptology but also to the modern cryptology due to its potential use in automated plaintext recognition and language identification. We have also demonstrated two things; first, the data’s usage on a complete cryptanalysis example and the second, new insight can be attained through careful and systematic study of language patterns. We have discovered a distinct pattern of Turkish language and used it to develop a new approach for vowel identification. What we could not address due to the limited space are (i) the fluctuations of the data for short text lengths, and (ii) the application of the data to other cipher types, especially substitution ciphers without word boundaries and the transposition ciphers. Our future work plan includes the investigation of n-gram versatility (the number of different words in which the n-gram appears), and positional frequency of n-grams.
References 1. Schultz, T. and Waibel, A.: Fast Bootstrapping of LVCSR Systems with Multilingual Phoneme Sets. Proceedings of EuroSpeech’97 (1997) 2. Oflazer, K.: Developing a Morphological Analyzer for Turkish. NATO ASI on Language Engineering for Lesser-studied Languages, Ankara, Turkey (2000) 3. Atlı, E.: Yazılı Türkçede Bazı Enformatik Bulgular. Uyg. Bilimlerde Sayısal Elektronik Hesap Makinalarının Kullanılması Ulusal Sempozyum, Ankara, Turkey (1972) 409-425 4. Koltuksuz, A.: Simetrik Kriptosistemler için Türkiye Türkçesinin Kriptanalitik Ölçütleri. PhD. Dissertation, Computer Eng. Dept., Ege University, øzmir, Turkey (1995) 5. Shannon, C.E.: Prediction and Entropy of Printed English. Bell System Technical Journal. Vol. 30 no 1 (1951) 50-64 6. Dalkilic, M.E. and Dalkilic, G.: On the Entropy, Redundancy and Compression of Contemporary Printed Turkish. Proc. of. Intl. Symp. on Comp. and Info. Sciences (2000) 60-67 7. Dalkilic, G.: Günümüz Yazılı Türkçesinin østatistiksel Özellikleri ve Bir Metin Sıkıútırma Uygulaması. Master Thesis, Int’l. Computer Inst., Ege Univ., øzmir, Turkey (2001) 8. Diri, B.: A Text Compression System Based on the Morphology of Turkish Language. Proc. of International Symposium on Computer and Information Sciences (2000) 12-23. 9. Shasha, D.:Dr. Ecco’nun ùaúırtıcı Serüvenleri (Turkish translation: The Puzzling Adventures of Dr. Ecco.) TUBITAK Popüler Bilim Kitapları 24 (1996) 10. Gaines, H. F.: Cryptanalysis. Dover, New York (1956) 11. Güngör, T.: Computer Processing of Turkish: Morphological and Lexical Investigation PhD. Dissertation, Computer Eng Dept., Bo÷aziçi Univ., østanbul, Turkey (1995)