Oct 13, 2011 - were used in a discussion without ever hearing the ac- tual speech of the ... vert pre-recorded speech and analyze the output. As the number of ...
Eavesdropping on encrypted VoIP conversations: phrase spotting attack and defense approaches Oleksii Chyrkov School of ICT KTH Royal Institute of Technology Stockholm, Sweden
Vasily Prokopov School of ICT KTH Royal Institute of Technology Stockholm, Sweden
October 14, 2011
Abstract
To make things worse, it turns out that an eavesdropper can detect with high precision that specific phrases were used in a discussion without ever hearing the actual speech of the user previously. It is obvious that this phrase spotting technique may be harmful to privacy: it may be used to obtain banking, health-care, and other sensitive personal information. The paper is organized as follows:
Voice over IP (VoIP) has recently become an important part of our day to day life. As VoIP technology evolves, matures and becomes increasingly popular, it also gains the attention of attackers who wish to eavesdrop on VoIP conversations. In this paper we first describe an attack that can identify phrases spoken within encrypted VoIP calls under certain (but commonly occurring) circumstances. Then we propose and analyze several methods to protect against phrase spotting attack. Finally, we introduce a model of a voice coder (vocoder) protected from this type of attack.
1
• Section 1 introduces the topic to the reader of the paper, • Section 2 describes the phrase spotting technique used to eavesdrop on VoIP conversations, • In Section 3 we propose and analyze several feasible protection methods,
Introduction
• Section 4 provides information on exotic protection methods,
As VoIP technologies become more and more integrated into everyday life, millions of people speak over the Internet, but very few of them understand the security issues related to voice communications. While most people are aware of the fact that conversations over conventional Public Switched Telephone Network (PSTN) may be eavesdropped, users rarely think that someone could listen to their VoIP calls. Even those who do use encryption are probably not familiar with the attacks demonstrated in [1] and [2]. The authors of these two papers reveal that while it is non-trivial to eavesdrop on the entire conversation carried over an encrypted VoIP channel, it is quite easy to find out which language was used and, moreover, to recover many specific phrases uttered during the conversation.
• Section 5 is an introduction to the model of a Variable Bit Rate (VBR) vocoder robust against phrase spotting attack and simulated within the MATLAB environment, • Section 6 presents the results obtained from vocoder simulation, • Conclusions and future work complete the document.
1
2
Phrase spotting technique
• The entire encrypted conversation or a part of it that contains the whole phrase or word,
The vulnerability that lays ground for the attack technique revealed in [1] is that existing mechanisms of efficient voice encoding and encryption do not hide all the information contained in the original voice signal. Some contemporary and ubiquitous VoIP CODECs encode phonemes with packets of different lengths, and the correlation between packet length and phoneme remains after the encoded speech is encrypted. This occurs because commonly used CODECs perform per-packet adaptation by choosing different bit rates for each packet depending on the speech samples it contains. For instance, a CODEC may adjust its output bit rate based on the speaker’s pitch or volume. In the simplest case a CODEC may just detect and suppress silence which will also make its bit rate variable. Generally, this adaptive behavior attempts to achieve a good balance between audio quality and channel throughput usage. The set of variable bit rate CODECs includes:
• The language of the conversation, which can be determined using techniques described in [2], • Information on how phonemes are mapped to packet lengths for the particular CODEC. It is argued in [1] that acquiring the information about phoneme-length mapping is non-trivial but feasible, as the CODECs are widely available and an attacker can simply use them as “black boxes” to convert pre-recorded speech and analyze the output. As the number of phonemes in, e.g., the English language is sufficiently small, models can be created even for completely unknown words that are not contained in the dictionary. What makes the attack possible is the Code-Excited Linear Prediction (CELP) technique used by numerous popular voice CODECs, including all of those we have mentioned earlier. The CELP technique is described in [7]. Essentially, CELP CODECs use codebooks for mapping each speech sample to the particular codebook entry which is closest to the original speech pattern in terms of spectrum (Figure 1). Then the codebook entry indices are placed in the encoded VoIP packets that are to be sent across the network. This encoding keeps the overall bit rate required for speech encoding sufficiently low ([7] mentions a bit rate as low as 4.8 Kbps). The CELP technique is almost 30 years old and its designers did not take privacy into consideration. We should note that it is theoretically possible to perform a phrase spotting attack without using Markov chains. In that case the presence of specific phrases or words would be detected by the means of simple correlation analysis. However, the precision of such an attack would drop significantly. Despite this, such a method could be used under certain circumstances, e.g. when the phrase in question is long and specific enough and the conversation is also sufficiently long. Detection precision would also be higher if the attack is perpetrated against phonetically simple languages having fewer phonemes than English, such as French and Spanish [8].
• Qualcomm Code-Excited Linear Prediction (QCELP) [3], which is used in Code Division Multiple Access (CDMA) systems, • Speex [4], while operating in VBR mode, • Skype’s proprietary SILK CODEC [5], • Adaptive Multi-Rate Wide Band (AMR-WB), standardized by ITU-T as a speech CODEC in Recommendation G.722.2 [6] and designed for use in mobile networks. The logic underlying this phrase spotting attack is as follows: rather than eavesdropping on the entire conversation, the attacker simply wants to find out if a specific word or phrase has been uttered during the conversation. To achieve this, the phrase is being dissected into the most probable phonemes, which are then converted using the chosen CODEC into a set of packets of a particular lengths. Then shis set of packets is being matched against the packet stream of encrypted VoIP converstation. To deal with the fact that different speakers may pronounce the same phrase in quite different ways and that pronunciation varies from time to time, “a profile hidden Markov model is used to build a speakerindependent model of the speech” [1]. All the information an attacker needs to achieve a sufficiently high probability of detecting if a specific phrase has been uttered during the conversation is: 2
Input speech
Bit rate selection
20ms interval
Fixed codebooks Low quality Analysis by synthesis loop Medium quality
VoIP packet
Excitation codebook Adaptive codebook
High quality
Figure 1: Generic CELP encoder.
3 3.1
Feasible protection methods
the more unpredictable the outcome is at each particular point in time. However, this approach does not completely eliminate the eavesdropping threat as, for example, an eavesdropper may train the wiretapping system to work in a perfect environment. Then, when such perfect conditions occur during a VoIP call, i.e. the VoIP client has as much resources as it needs for operation, the eavesdropper will still be able to perform the attack on the conversation hence the output from the CODEC will depend only on speech.
CBR CODECs
Using a Constant Bit Rate (CBR) CODEC is a straightforward defense against this phrase spotting attack. In this case the output is represented by a uniform stream of fixed-size packets. If the payload is encrypted, the correlation between the speech and corresponding bit stream is completely destroyed, making the phrase spotting attack inapplicable. However, CBR prevents the VoIP system from decreasing resource usage because CBR CODECs do not adapt their output to the speech.
3.2
3.3
Adaptive codebook
VBR CODECs influenced by multiple Authors of the CELP technique mention in [7] that their fixed codebook speech encoding method delivers refactors sults worse in terms of efficiency than an adaptive codebook method. However, they state that the usage of an adaptive codebook is computationally intensive, specifically stating that “it took 125 seconds of Cray-1 CPU to process 1 second of the speech signal” [7]. After all, as Cray-1 was built in 1976, it was capable of performing 160 million floating-point operations per second and had only 8 megabytes of memory [9]. A
Another method that could potentially make phrase spotting attacks much harder is to use a special kind of VBR CODECs. These vocoders vary their output in accordance not only to speech parameters, but consider some additional factors as well, such as available network throughput, system resources etc. The more sources of influence taken into account by the vocoder, 3
review [10] says that Pentium-4 2.8 GHz is 31 times faster than Cray-1. Therefore, current technology is capable of actively using adaptive codebook encoding. Our opinion is that using an adaptive codebook, while not being a protection measure by itself, should be the option of choice as it makes eavesdropping harder and more resource-demanding.
3.4
The downside of the proposed approach comes out of the fact that reordering implies that there is some set of packets that can be reordered meaning that packets should be accumulated. However, accumulating packets at the sending side will introduce delay, which might be unacceptable for interactive VoIP conversations. Even though, this approach is relevant for noninteractive VoIP applications that allow some buffering. To address all these issues we propose to implement a reordering mechanism as follows:
Padding to a fixed length
It is stated in [1] that there is a technique which completely eliminates the possibility of eavesdropping on an encrypted CELP-encoded VoIP conversation. That is padding each and every packet to a constant length. As packets on which the experiment in [1] has been conducted can be up to 512 bytes long, the only way to reduce the attack efficiency to zero is to pad every packet to 512 bytes and then encrypt it. This is stated to add about 30% overhead to the conversation traffic. However, padding packets to 256 bytes reduces attack efficiency to about 4% in terms of the probability of guessing if a specific phrase has been said. The traffic overhead in this case is about 16% [1], which is affordable in many scenarios yet could have negative impact in some applications, e.g. in case of running a VoIP application over a dialup connection or over a high-tariff mobile network.
3.5
1. Both end-points have fixed pre-configured tables of n=p! permutation schemes potentially existing for a set of p packets. This set of p packets is referred to as a reordering segment. 2. The sending side system accumulates p VoIP packets, introducing the delay p times 20ms for the common case, and randomly chooses one permutation scheme out of n available from a fixed table. 3. The sender reorders packets within a segment according to the scheme chosen and sends the packets. 4. Along with sending the reordered set of VoIP packets, the sender also communicates to the receiver an index pointing to a particular permutation scheme stored in a shared table and information about which packets the permutation has to be applied to. This communication occurs over a cryptographically secured channel established at the beginning of the conversation. Public key cryptography can be used for this purpose.
Reordering packets
One of the possible ways to protect against a phrase spotting attack is to introduce a sophisticated packet reordering technique into the VoIP system. The idea is simple: as a particular word or phrase consists of a finite number of phonemes coming in a certain order and these phonemes are encoded into an ordered set of packets, we can easily destroy the correlation between actual speech and output bit stream by reordering packets. Although this seems to be a simple and effective solution, there are still some issues to be addressed in an actual implementation:
5. Now the receiver has all the information required to reproduce the original packet sequence. The proposed reordering technique is illustrated in Figure 2. Note that some additional (overhead) traffic is introduced by the separate encrypted control channel for sending information identifying the current reordering scheme. The control message itself is relatively small, but the process of establishing and maintaining a secure channel will consume some additional resources. One may notice that Real-time Transport Protocol (RTP) sequence numbers are still available in plain text while the payload is encrypted. This is the case even with Secure Real-time Transport Protocol (SRTP) [11]. Thus, if the packets are reordered by the sender, an eavesdropper is still able to order them properly by examining the sequence field in the RTP header of every
• The reordering scheme should be dynamic and unpredictable for an eavesdropper. • There should be a secure mechanism for the receiver to obtain information from the sender about how to correctly reassemble packets.
4
Permut. table 1 Scheme 1 2 Scheme 2 3 Scheme 3 4 Scheme 4 . ... n Scheme n
Shared packet network e.g. Internet
Sender
REORDERING
Receiver
REORDERING BACK
Data channel
Permut. table 1 Scheme 1 2 Scheme 2 3 Scheme 3 4 Scheme 4 . ... n Scheme n
Secure control channel 4
4
Figure 2: Reordering defense approach. tational resources to apply this for the whole conversation.
packet. The question is whether this helps an eavesdropper to defeat the proposed reordering technique and to figure out what the VoIP conversation is about. Here the answer is “no”, because the reordering takes place before RTP encapsulation, although both are application layer procedures. RTP-encapsulated chunks are pushed further down to the transport layer where they are being handled and encapsulated by User Datagram Protocol (UDP). This means that the chunks of voice information are being reordered before they are sent by the VoIP application to lower layers and sub-layers where each of these chunks receives an RTP sequence number. This illustrates a benefit of having a layered system Figure 3. The weak point of the proposed method is, however, that theoretically it might be defeated by brute force guessing of the reordering scheme used. If an eavesdropper has the whole set of VoIP packets containing the entire conversation and if the eavesdropper also manages to guess the segment length p, which might be different for various systems, then the eavesdropper may try to break the stream into segments and apply all possible reordering schemes to each of these segments. This will require a maximum of p! Guesses for each segment. However, two facts make this attack infeasible:
• Secondly, the segment is taken out of context and therefore may not represent a complete phrase or even a word, which makes the brute force guessing attack much more difficult if not completely infeasible. In general, the proposed reordering protection method seems to be feasible, however it is not a state of the art solution. The fact that this technique may be exposed to a brute-force attack and is not applicable to interactive VoIP environments forces us to look for other potential defense approaches.
3.6
Padding to a variable length
Another protection technique that we propose is based on padding each packet to a different size. The padding size is random. In order to use this method, each participant of the VoIP conversation needs to have an asymmetric encryption key pair (e.g. RSA) and has to be capable of producing a large amount of sufficiently random data. The process of communication would be as follows: 1. Speech is digitized and CELP-encoded, producing a sequence of p packets.
• First, after each guess the phrase spotting technique needs to be applied to the segment to see if there is some phrase that could be matched. It will take an enormous amount of time and compu-
2. For each packet p[i], a random number R[i] is generated. 5
Input speech
Output signal
Sender
Layer 7 Application
Receiver
ADC
DAC
Reordering
Reordering
RTP encapsulation
RTP
RTP decapsulation
Layer 4 Transport
UDP encapsulation
UDP RTP
UDP decapsulation
Layer 3 Network
IP encapsulation
IP UDP RTP
IP decapsulation
Figure 3: Place of reordering in a layered system. 3. Each packet is padded with R[i] random bytes what out of the scope of this paper, we will focus on the and then encrypted using conventional encryption second vulnerability. Until recently it has been quite difficult to obtain a technologies. large number of truly random numbers. The technique 4. The sequence of p random numbers is encoded which proved to provide the highest level of randomwith the receiver’s public key and sent to the re- ness was obtaining input data from something outside ceiver via a logically separate (out-of-band) chan- the computer itself, such as keystroke timing or, in some nel. cases, even processing photos of a lava lamp [12]. One 5. The receiver receives the encrypted random num- can notice that while being suitable for one-time use cases, e.g. generating RSA key pairs or initiating the bers and decrypts them with his private key. Diffie-Hellman algorithm, such sources are not suitable 6. The voice data packets are decrypted and then un- for streaming large amounts of data. padded according to the corresponding number obFortunately, in 2011 Intel announced a fully digitained from the out-of-band channel. tal hardware Random Number Generator (RNG) named “Bull Mountain” [13] that will be built into all new Intel It is clear that this method does not offer a strict one- processors starting from the Ivy Bridge model. In our hundred-percent resistance against the phrase spotting opinion this generator could become a base element of attack. The advantage of this method that it lowers the eavesdropping protection techniques employed by the probability of success for an attacker (to under 10% as new VoIP clients optimized for using modern procesdescribed in [1]), while introducing only slight increase sors. The RNG in question is capable of producing up in traffic overhead. to 3 billion 256-bit random numbers per second, which There are two obvious points of vulnerability in this is more than enough to be used with any VoIP converprotection method: compromising one’s private key and sation, even a multi-user conference. using an unreliable source of randomness. As methods of ensuring public key cryptography reliability are discussed in many sources and publications and are some6
4
Exotic methods of protection
The model performs per-frame adaptation based on the average power of input signal, calculated, accordingly, for each frame. If the average power exceeds a configurable threshold, then high bit rate CELP is applied, otherwise low bit rate CELP is used to encode the frame. Specifically for our case: silence and background noise are low-power signals, hence they are encoded in low bit rate mode. When the speaker is uttering a phrase, the average signal power exceeds the threshold, therefore high bit rate CELP is applied to the frame. In fact, the same “pure” CELP function is applied to the frame in both modes. It generates the codebook index k as well as several coefficients in form of full precision 64-bit numbers, as [17] suggests. Then, to make it practical, coefficient quantization is performed. Tables 1 and 2 [18] reveal the numbers of bits allocated for each parameter. After quantization, the length of the low bit rate frame is 192 bits and the high bit rate frame is 320 bits.
As mentioned in [1] and [2], there is one fact that makes the attack on an English conversation relatively simple. The fact is that the overall number of phonemes in English language is comparatively small: [8] states that there are 44 basic phonemes. The hidden profile Markov model used for the attack has to be trained on diphones (pairs of consecutive phonemes) and a certain number of triphones, so theoretically the total number of samples on which the model to be trained is around 14000. However, the real number of diphones and triphones used in speech is much smaller (e.g., German is estimated to have around 2500 diphones [14]. Theoretically this means that speaking in a language which has more phonemes than English would lead to a more resource-demanding and therefore expensive attack. Moreover, the attacker would have to record speech samples for a large dictionary of the language in question, which may not be as widely available as an English dictionary. While this is an example of security-through-obscurity, as there are few languages possessing both of these properties, one could consider speaking Ithkuil with its 82 elementary phonemes. Another method of protecting VoIP conversations from eavesdropping, or at least mitigating the attack technique described above, is abandoning any VoIP clients that use VBR CELP. While this could be technically possible for new VoIP users, the issue that arises here is the fact that many popular VoIP clients including Skype [15] use VBR CELP and millions of VoIP users over the world may not wish to change their software for a security improvement.
5
Table 1. Bit allocation for 16 Kbps CELP. Parameter
Bits/ param.
Bits/ frame
Codebook index, k 12 LPC coefficients Gain, θ0 Pitch filter coeff., b Lag of pitch filter, P
10 12 13 13 8
40 144 52 52 32 Σ 320
Table 2. Bit allocation for 9.6 Kbps CELP. Parameter
Vocoder implementation
Bits/ param.
Bits/ frame
Codebook index, k 10 40 In order to reveal the correlation between speech and 12 LPC coefficients 6 60 lengths of packets in an output bit-stream and also Gain, θ 7 28 0 to demonstrate the most promising method of protecPitch filter coeff., b 8 32 tion against phrase spotting attack, we have impleLag of pitch filter, P 8 32 mented a VBR CELP vocoder in MATLAB. A CBR vocoder implementation was taken as a foundation [16]. Σ 192 The flowchart describing the operation of implemented This vocoder implementation does not apply cryptomodel is available in Appendix A. Our simplest VBR graphic algorithms to encrypt the output it produces. vocoder has a codebook containing 1024 sequences of However, it estimates and considers the impact of enlength 40 and operates in two modes: cryption on the frame length. As should be clear by this • High bit rate (16 Kbps) CELP. point, the phrase spotting attack is based on analysis of frame lengths, and SRTP in the common case adds ex• Low bit rate (9.6 Kbps) CELP. 7
actly 32 bits (the authentication tag, [11]) of overhead to each frame. Thus an attacker would still be able to differentiate between low bit rate and high bit rate encrypted frames, as their sizes would be 224 and 352 bits respectively. Thus, the information required for successful attack perpetration remains available even after encryption. To obscure this information and to destroy the correlation between speech and bit rate of output stream, the VBR CELP vocoder implements frame padding. In the special case when the vocoder has only two possible output bit rates, the padding procedure is reduced to padding low bit rate frames to the size of high bit rate frames. Here we assume that the attacker performing phrase spotting attack knows which CODEC is used and is familiar with its modes of operation. Therefore, padding low bit rate frames to some random length between 224 and 352 bits does not make sense. An attacker would still be able to recognize low bit rate frame as long as its length is less than 352 bits. Though a straightforward approach to pad all short frames to 352 bits will provide from phrase spotting attack, it would not be a neat solution due to its overhead (see Section 3). That is why we have decided to pad only some percentage of short packets, one forth by default. This parameter is available for tuning in vocoder settings. The implemented model also synthesizes the error signal, plots and plays it back. Overhead statistics are provided including padding and encryption overhead. MATLAB code for the main vocoder module is available in Appendix B. The whole set of MATLAB files and speech samples used to build and test the model is available at [19].
6
attack, does not eliminate the benefit provided by VBR coder in terms of output stream size reduction.
6.1
Speech sample 1
Speech sample 1 (Figure 4) contains the continuousspeech utterance of the sentence: "The prices have gone up enormously in spite of the technological advances." In comparison with the other three samples is that this sample is the most saturated in terms of signal power. Therefore, the benefit of applying a VBR coder to this sample is relatively small as the output message is reduced by 18560 bits or 29% in comparison to the CBR output produced by 16 Kbps CELP. With padding enabled, the vocoder was able to reduce the size of the output stream by 21.3% on average compared to CBR 16 Kbps CELP.
6.2
Speech sample 2
Speech sample 2 (Figure 5) contains the continuousspeech utterance of the sentence: "In just ten years the climate changed noticeably in the country." Savings with padding disabled: 32.6%. Savings with padding enabled: 24.8% on average. Thus, about 8% is the expense of providing protection from the phrase spotting attack.
6.3
Speech sample 3
Speech sample 3 (Figure 6) contains the continuousspeech utterance of the sentence: “The speaker announced the winner." In contrast to sample 1, speech sample 3 is the least saturated in terms of signal power. Consequently, the advantage of using a VBR coder is the largest with the output message size reduced by 34.8%. With padding enabled the reduction percentage falls to 26.7%.
Simulation results
CELP vocoder was applied to four different speech samples from the TIMIT database [20]. Each signal is a 4-seconds long English sentence sampled at 8 kHz. Next step was to compare the lengths of the output bit streams generated by the CELP encoder in two different scenarios:
6.4
Speech sample 4
Speech sample 4 (Figure 7) contains the continuousspeech utterance of the sentence: "Why do you want to go alone?" Surprisingly, the savings ratio with padding disabled is the same for samples 2 and 4: 32.6%. With padding enabled savings are 24.3%.
• With padding disabled, and • With padding enabled (coefficient 1/4) The goal was to demonstrate that the padding mechanism, while providing protection from phrase spotting 8
Original signal/Error signal
Amplitude
0.5
0
−0.5
0
0.5
1
1.5 2 Time index Average power of the frame
2.5
3 4
x 10
0.06
P(F)
0.04
0.02
0
0
20
40
60
80
100 120 Frame, F
140
160
180
200
Figure 4: Speech sample 1. Original signal/Error signal 0.6
Amplitude
0.4 0.2 0 −0.2 −0.4
0
0.5
1
1.5 2 Time index Average power of the frame
2.5
3 4
x 10
0.02
P(F)
0.015 0.01 0.005 0
0
20
40
60
80
100 120 Frame, F
Figure 5: Speech sample 2. 9
140
160
180
200
Original signal/Error signal
Amplitude
0.4 0.2 0 −0.2 0
0.5
1
1.5 2 Time index Average power of the frame
2.5
3 4
x 10
P(F)
0.01
0.005
0
0
20
40
60
80
100 120 Frame, F
140
160
180
200
Figure 6: Speech sample 3. Original signal/Error signal
Amplitude
0.4 0.2 0 −0.2 −0.4
0
0.5
1
1.5 2 Time index Average power of the frame
2.5
3 4
x 10
0.02
P(F)
0.015 0.01 0.005 0
0
20
40
60
80
100 120 Frame, F
Figure 7: Speech sample 4. 10
140
160
180
200
6.5
8
Overall simulation conclusions
A table compiling together all simulation measurements is included in Appendix C. The main outcome derived from the gathered data is predictable: the proposed padding protection technique slightly reduces the benefit of using a VBR vocoder. The amount of reduction is defined by the padding coefficient. While having 25 percent of randomly padded packets in the output stream will considerably harden the phrase spotting attack, it will not completely eliminate its feasibility. Finding a suitable randomness coefficient is a separate issue, which might be addressed in future work.
List of Acronyms
ADC Analog-to-Digital Converter AMR-WB Adaptive Multi-Rate Wide Band CBR Constant Bit Rate CDMA Code Division Multiple Access CELP Code-Excited Linear Prediction CPU Central Processing Unit DAC Digital-to-Analog Converter PSTN Public Switched Telephone Network
7
Conclusions and future work
QCELP Qualcomm CELP
After suggesting several methods of protecting VoIP conversations against the phrase spotting attack, we have come to the conclusion that the most promising approach is to randomly pad the packets produced by a VBR coder to different lengths. The simulation in MATLAB that has been undertaken proves that this approach produces less traffic overhead than padding packets to a constant length. One should note that our calculations include only overhead for the data traffic, not control traffic, such as the initial dialogue to establish an encrypted control channel. In future work it would be logical to enhance our simple model of a VBR CELP vocoder to have more than two levels of output bit rate. Other enhancements could include a more efficient criterion of selecting the output bit rate, as the current criteria is based only on the average signal power. To make the system complete, it would also be rational to implement an encryption scheme based on public key cryptography to transfer the information about the padding length securely. The problem of destroying correlation between speech and corresponding packet stream can be solved twofold:
RNG Random Number Generator RTP Real-time Transport Protocol SRTP Secure Real-time Transport Protocol UDP User Datagram Protocol VBR Variable Bit Rate Vocoder Voice Coder VoIP Voice over IP
References [1] C. V. Wright, L. Ballard, S. E. Coull, F. Monrose, and G. M. Masson, “Uncovering spoken phrases in encrypted voice over IP conversations,” ACM Transactions on Information and System Security, vol. 13, pp. 35:1 – 35:30, Dec. 2010. [2] L. Khan, M. Baig, and A. M. Youssef, “Speaker recognition from encrypted VoIP communications,” Digital Investigation, vol. 7, pp. 65–73, Oct. 2010.
1. By manipulating resulting packet stream.
[3] “High rate speech service option 17 for wideband spread spectrum communication systems,” Technical Report 45, Telecommunications Industry Association, Nov. 1997.
2. By manipulating incoming speech. In this paper we were focusing on the first approach. Second approach, being as promising as the first one, defines a scope for a future work.
[4] J. Valin, “Speex: a free codec for free speech,” (Dunedin, New Zealand), Jan. 2006.
11
[5] K. V. Soerensen, K. Vos, and S. S. Jensen, “SILK speech codec.” tools.ietf.org/html/draft-vos-silk-02, Sept. 2010.
[19] V. Prokopov and O. Chyrkov, “MATLAB VBR CELP vocoder implementation files,” Oct. 2011. [20] J. S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus.” www.ldc.upenn.edu/Catalog/CatalogEntry.jsp ?catalogId=LDC93S1, 1993.
[6] “G.722.2 : Wideband coding of speech at around 16 kbps using adaptive Multi-Rate wideband (AMR-WB),” Recommendation G.722.2, ITU-T, July 2003. [7] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP): high-quality speech at very low bit rates,” vol. 10, pp. 937–940, Institute of Electrical and Electronics Engineers. [8] P. B. de Mareuil, C. Corredor-Ardoy, and M. Adda-Decker, “Multi-language automatic phoneme clustering,” (San-Francisco), pp. 1209–1212, 1999. [9] “Cray history.” cray.com/About/History.aspx. [10] “1979 review of the cray-1 supercomputer,” July 2006. [11] D. McGrew, “The secure real-time transport protocol (SRTP).” tools.ietf.org/html/rfc3711, Mar. 2004. [12] “How lavarand works.” web.archive.org/web/19980521144845/ http://lavarand.sgi.com/cgi-bin/how.cgi, May 1998. [13] G. H. Hofemeier, “Bull mountain – it’s just exciting and important technology,” June 2011. [14] M. Reference, Signal processing quick study guide for smartphones and mobile devices. Mobile Reference, Jan. 2007. [15] R. Chirgwin, “Linguists use sounds to bypass skype crypto,” The Register, May 2011. [16] A. Sørensen, “Speech coding and recognition course - exercises.” www.itu.dk/courses/TKG/E2005/exercises.html. [17] I. C. Society, I. of Electrical, E. Engineers, and I. S. Board, IEEE standard for floating-point arithmetic. New York, NY :: Institute of Electrical and Electronics Engineers„ 2008. [18] “MATLAB exercise 3: Speech coding and synthesis..”
12
Appendices A
VBR CELP vocoder operation flowchart START
Data input
For each N samples [frame]
Data output
Pure CELP analyzer
Input and error signal plotter
Average signal power calculation
Input and error signal playback
END Average power exceeds threshold?
no
Coefficient quantization for low bit rate output
yes
Coefficient quantization for high bit rate output
Padding and encryption estimation
CELP synthesizer
13
B
MATLAB code of VBR CELP vocoder (main block)
% c e l p _ v b r −−> CELP v o c o d e r % % % The s c r i p t i n v o k e s c a s c a d e d CELP a n a l y z e r and s y n t h e s i z e r which i s % a p p l i e d t o s i g n a l v e c t o r x i n f r a m e s o f l e n g t h N, and t h e s y n t h e s i z e d % s i g n a l i s r e t u r n e d i n x h a t . The LP a n a l y s i s p e r f o r m e d i n e a c h f r a m e i s % o f o r d e r M, and t h e p e r c e p t u a l w e i g h t i n g f i l t e r W( z ) = A( z ) / A( z / c ) i s % d e t e r m i n e d by t h e c o n s t a n t c . % % The e x c i t a t i o n p a r a m e t e r s k , t h e t a 0 , P , and b , u s e d t o g e n e r a t e t h e % e x c i t a t i o n sequence . % % CELP a n a l y z e r i s a d o p t e d t o g e n e r a t e v a r i o u s b i t r a t e o u t p u t f o r e a c h % f r a m e d e p e n d i n g on t h e a v e r a g e power o f i n p u t s i g n a l . The s y s t e m h a s % two l e v e l s : 16 Kbps CELP and 9 . 6 Kbps CELP . % % The f u n c t i o n a l s o i m p l i c a t e s p a d d i n g mechanism t o h a r d e n p h r a s e % s p o t t i n g a t t a c k . P e r c e n t a g e of padded frames i s c o n t r o l l e d , hence t h e % l e v e l of p r o t e c t i o n a g a i n s t phrase s p o t t i n g a t t a c k i s v a r i a b l e . % % < See Also > % c e l p a −−> C a s c a d e d CELP a n a l y z e r and s y n t h e s i z e r % % % V a s i l y Prokopov , ICT , R o y a l I n s t i t u t e o f T e c h n o l o g y − Sweden % O l e k s i i Chyrkov , ICT , R o y a l I n s t i t u t e o f T e c h n o l o g y − Sweden % % L a s t r e v i s e d : O c t o b e r 1 3 , 2011 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− % Vocoder s e t t i n g s vbr = 1; padding = 0; percentage = 4; threshold = 0.001;
% % % %
VBR mode e n a b l e d / d i s a b l e d Padding enabled / d i s a b l e d P e r c e n t a g e o f p a d d e d p a c k e t s ( 1 0 0 / p e r c e n t a g e %) T h r e s h o l d f o r s w i t c h i n g b e t w e e n 16 k / 9 , 6 k CELP
% I n i t i a l i z e input parameters l o a d ma1_2 ; x = ma1_2 ; % O r i g i n a l message N = 160; % Frame l e n g t h L = 40; % Block l e n g t h M = 12; % No . o f LPC c o e f f i c i e n t s , p r e d i c t i o n o r d e r c = 0.8; Pidx = [16 1 6 0 ] ; % Bounds f o r o p t i m i z a t i o n o f P randn ( ’ s t a t e ’ , 0 ) ; cb = r a n d n ( L , 1 0 2 4 ) ; % Codebook o f 1024 e n t r i e s
14
% I n v o k e CELP f u n c t i o n p r o d u c i n g VBR o u t p u t [ x h a t , e , k , t h e t a 0 , P , b , av_pow , msglen , F ] = = c e l p ( x , N, L ,M, c , cb , Pidx , t h r e s h o l d , vbr , p a d d i n g , p e r c e n t a g e ) ; % Plot the signals f i g u r e ( ’ o u t e r p o s i t i o n ’ , [ 0 0 500 4 0 0 ] ) subplot (2 ,1 ,1) plot ( [ x xhat ] ) t i t l e ( ’ O r i g i n a l s i g n a l / E r r o r s i g n a l ’ , ’ fontname ’ , ’ A r i a l ’ , ’ f o n t s i z e ’ , 1 0 ) x l a b e l ( ’ Time i n d e x ’ , ’ f o n t n a m e ’ , ’ A r i a l ’ , ’ f o n t s i z e ’ , 1 0 ) y l a b e l ( ’ Amplitude ’ , ’ fontname ’ , ’ A r i a l ’ , ’ f o n t s i z e ’ , 1 0 ) a x i s ( [ 0 32000 −0.4 0 . 5 5 ] ) subplot (2 ,1 ,2) p l o t ( [ av_pow ] ) t i t l e ( ’ A v e r a g e power o f t h e frame ’ , ’ f o n t n a m e ’ , ’ A r i a l ’ , ’ f o n t s i z e ’ , 1 0 ) x l a b e l ( ’ Frame , F ’ , ’ f o n t n a m e ’ , ’ A r i a l ’ , ’ f o n t s i z e ’ , 1 0 ) y l a b e l ( ’ P ( F ) ’ , ’ fontname ’ , ’ A r i a l ’ , ’ f o n t s i z e ’ , 1 0 ) a x i s ( [ 0 200 0 0 . 0 2 ] ) % Signal playback soundsc ( x , 8 0 0 0 ) ; soundsc ( xhat , 8 0 0 0 ) ; % C a l c u l a t e s a v i n g s and b e n e f i t s b e n e f i t =100 −( m s g l e n / 6 4 0 0 0 ∗ 1 0 0 ) ; f p r i n t f ( 1 , ’ \ n=======================================\n ’ , m s g l e n ) ; f p r i n t f ( 1 , ’ O u t p u t m e s s a g e l e n g t h , e n c r y p t i o n d i s a b l e d : %u b i t s . \ n ’ , m s g l e n ) ; f p r i n t f ( 1 , ’ O u t p u t m e s s a g e l e n g t h , e n c r y p t i o n e n a b l e d ( SRTP ) : %u b i t s . \ n ’ , m s g l e n +F ∗ 3 2 ) ; f p r i n t f ( 1 , ’ S a v i n g s compared t o 16 Kbps CELP : %.1 f%% o r %u b i t s . \ n ’ , b e n e f i t ,64000 − m s g l e n ) ;
15
C
Data gathered from VBR CELP vocoder simulation CBR CELP 16 Kbps output length
Sample 1 Sample 2 Sample 3 Sample 4
bit 70400 70400 70400 70400
Output length, padding disabled bit 51840 49563 48128 49563
VBR CELP 16/9.6 Kbps Output length, Benefit Benefit padding enabled % bit, average %, average 29 56755 21.3 32.6 54528 24.8 34.8 53325 26.7 32.6 54835 24.3
16
Expense of protection %, average 7.7 7.8 8.1 8.3