Covert Channel detection in VoIP streams - Semantic Scholar

3 downloads 178 Views 459KB Size Report
purposes had become an important threat to network security. In recent ... The VoIP analysis module of wireshark ... tool developed by Salare Security [10].
Covert Channel detection in VoIP streams Gonzalo Garateguy

Gonzalo R. Arce

Juan Pelaez

Department of Electrical and Computer Engineering University of Delaware Newark, DE 19711 [email protected]

Department of Electrical and Computer Engineering University of Delaware Newark, DE 19711 [email protected]

U.S. Army Research Laboratory Adelphi, MD 20783 [email protected]

Abstract—This paper presents two approaches to detect VoIP covert channel communications using compressed versions of the data packets. The approach is based on specialized random projection matrices that take advantage of prior knowledge about the normal traffic structure. The reduction scheme relies on the assumption that normal traffic packets belong to a subspace of smaller dimension or that can be included a convex set. We show that through the incorporation of this information in the design of the random projection matrices, the detection of anomalous traffic packets can be performed in the compressed domain with a slight performance loss with respect to the uncompressed domain. The validation of the detection algorithm is based on real data captured on a test bed designed to that end.

I.

SIP Signaling Server OpenSIPS

SIP client B

SIP client A 10.43.42.2/24

INTRODUCTION

VoIP is one of the most popular services in IP networks and is being used not only at the user level but also for inter and intra company communications as well as to replace traditional analog land lines. With the increase of the traffic volume due to VoIP services the suitability of using it for steganographic purposes had become an important threat to network security. In recent studies [1]–[4] many techniques to disguise steganographic information in media and call signaling protocols had been proposed, showing a surprising capacity to exfiltrate great amounts of information. Considering the inevitable convergence of voice, video and data communications in both commercial and tactical environments; new techniques to uncover VoIP covert channels are of high interest to avoid exfiltration of sensitive information. Among all the protocols used in VoIP communications (e.g. SIP, SDP, RTP, RTCP, etc) media protocols represent the biggest threat to network security. In an average call, media traffic packets correspond to 99% of the total number of packets transmitted. The method proposed in this paper focuses on the analysis of the RTP media transport protocol but the same technique can be applied to other signaling and control protocols. Since a typical VoIP call is in the order of minutes, considerable amounts of data have to be analyzed to effectively detect potential covert channel communications. In this context an in depth inspection of the packets becomes a very computationally intensive task. The analysis of packets can degrade the quality of a VoIP call or even make this communication impractical due to the sensitivity of voice traffic to latency. Our proposed solution uses compressive sensing techniques to acquire sketches of the data packets and then perform the detection and classification

Sniffing station (wireshark)

Fig. 1.

Test bed used to capture the VoIP traffic data

in the compressed domain, thus reducing processing time and storage capacity. We present a procedure to design specialized random matrices as in [5], [6] which take advantage of prior knowledge about the normal traffic structure. Once the data dimensionality is reduced, classification is performed using support vector machines based on training samples from both anomalous and normal data. We show that classification using support vector machines gives good performance for high compression ratios as stated in [7]. II. DATA CAPTURE AND FORMATTING The traffic used to test the detection algorithms was generated in a test bed built to that end. A diagram of the setup is depicted in Figure 1. The test bed consist of a signaling server executing Opensips [8], two clients which are capable of executing several types of softphones in both Windows and Linux environments and a sniffing station used to capture signaling and media traffic. The sniffing station runs wireshark software which includes a module to detect SIP signaling and identify individual RTP streams associated to a each of the active VoIP calls. The VoIP analysis module of wireshark allows to extract all the packets associated to a call and save them in the RTPdump format specified in [9]. The Client stations are dual boot machines capable of executing Windows

978-1-4244-9848-2/11$26.00©2011 IEEE

Stream of RTP Packets

bit offset

0-1

2

3

4-7

8

9-15

0

Ver.

P

X

CC

M

PT

32

16-31

........

Sequence Number Timestamp

64

SSRC identifier

96

CSRC identifiers (optional)

Data Matrix

RTP header Extension (optional) Payload RTP padding

payloads

headers

RTP count

SRTP master key identifier (MKI optional)

nh.L rows

................

Authentication tag

Fig. 2.

RTP packet format

m rows

(np-nh).L rows

and Linux operating systems and the clients used are X-lite softphone for Windows and Twinkle softphone for linux. In addition also an exfiltration client is installed at Client A allowing to transmit non-voice data to Client B. This packets are considered attacks and are the ones we try to identify. A. Data matrix formation The data used in the simulations is prepared as follows. The stream of packets captured is divided into groups and then each group of packets is arranged as a column of the data matrix D. As the size of the packets might change during a single call, we group a number of packets that does not exceed a preset number m of bytes per column. If the number of bytes exceeds the limits, the last packet in the group is mapped onto the following column and the remaining bytes of the present column filled with 0. In this way the data matrix D always has dimensions m × n where m is fixed and n can change according to the number of groups formed from the captured packets. The standard fields of RTP packet headers (see Figure 2) are mapped to the beginning of the columns. The remaining bytes in the packet (including the payload) are associated to one entry immediately after the headers (see Figure 3), all this values are normalized to the interval [0, 1] by dividing its values over 255. Considering the format of the RTP packets, most of the fields take values in a small set, Ver ∈ {2}, P ∈ {0, 1}, X ∈ {0, 1}, CC ∈ {0, .., 16}. Some increase monotonically from one packet to the next one, for example the Timestamp field and the Sequence number field. And some remain fixed for the whole call as the SSRC field. Since the Sequence number, Timestamp and SSRC fields take large values with respect to the other fields only the difference between the present and a previous packet is stored in the data matrix. B. Normal and anomalous data The exfiltration attacks were carried out using an attacking tool developed by Salare Security [10]. This tool inject the steganographic content in the payload of the RTP packets while keeping the values of the header fields in their typical values. Different types of data was exfiltrated, i.e. JPG images, PDF files and Text files. The codec declared in the headers

n colums

Fig. 3. nh is the number of fields in the header that are mapped at the beginning of each column, np is the maximum number of bytes per packet allowed and L is the number of packets per column. If a group of L packets exceeds np bytes the last one is mapped to the next column.

of exfiltration packets was G711 which was also the codec used in the normal traffic generated. Several minutes of normal traffic calls were recorded using speech audio content. III. D IMENSIONALITY REDUCTION AND SIGNAL SEPARATION

In the proposed algorithm the data used to classify the traffic is sampled and compressed while taking advantage of the prior knowledge about the normal signal structure. Random Projections have shown a great capacity to capture the fundamental characteristics of signals that have considerable structure, allowing to perform filtering, detection and classification in a dimensionally reduced space. It is demonstrated that the loss incurred by the dimensionality reduction via random projections, with respect to the classification in the high dimensional space, can be bounded with arbitrary precision as long as the number of dimensions of the reduced space and the random matrices are designed properly [11]–[14]. Moreover, if prior information with respect to the normal behaviour of the signal is known (i.e. a basis of a subspace containing normal signals), the performance of classification and detection can be further improved. If the data is represented as a vector x ∈ RN (i.e. one of the columns of our data matrix) the random projections are calculated as y = Φx with Φ being a random matrix of size M × N , M ≪ N . In this case we take advantage of the prior knowledge about the normal traffic structure in the design of the random matrix Φ. We assume two basic models, the first one is the subspace or affine space model and the second one is a convex set covering model. In the first case we assume that normal traffic belongs to a particular subspace S of smaller dimension than the ambient

978-1-4244-9848-2/11$26.00©2011 IEEE

space RN and also that anomalous vectors lie in a different subspace or affine space which may have a small intersection with the normal traffic space. In the second case, the model assumes that there exist a convex set that only contains normal traffic vectors excluding all the anomalous vectors. A. Subspace model Under the subspace model the random matrix Φ is designed as the composition of two linear transformations. The first transformation projects the data vectors over S ⊥ and then a random matrix is used to reduce the dimensionality. The orthogonal projection matrix is defined as PS ⊥ = I − B(B T B)−1 B T where B is a matrix whose columns generate the subspace S. The random projection is performed by means of a random matrix G of dimensions M × N with M < N and such that  p  + 3/M with probability 16   0 with probability 23 . gi,j = (1) p    − 3/M with probability 1 6

The specialized random matrix for the subspace model assumption is then Φ = G.PS ⊥ . This type of matrices are known to fulfill the RIP property if the dimension of the subspace S ⊥ is small enough [15], this ensures that distances are almost preserved in the low dimensional space. To estimate a basis of the subspace we use a training sample {xi }Ti=1 , with T > N taken from a segment of normal traffic. Before the estimation the mean value PTof the sample is subtracted x ˆi = xi − µ, where µ = T1 i=1 xi . We use a variation of MacQueen algorithm [16] to train a set of Nq points with Nq ≤ N , that generate the space of normal traffic. The algorithm used is the following. 1) Take Nq points at random from the training sample Nq A = {ˆ xi }Ti=1 and denote that set by Q = {zi }i=1 2) initialize indices ji = 1 for all i = 1, .., Nq 3) take a new point x ˆ from A\Q 4) Find a zi that is closest to x ˆ in L1 norm and update the x i +ˆ point zi by zi = jij.zi +1 5) Update ji associated with the above zi by setting ji = ji + 1 6) Repeat steps 3 to 5 until there are no remaining points in A\Q. This algorithm is similar to k-means algorithm for vector quantization and allows to find the optimal vector quantizer according to the distribution of the data and a given distance metric. In particular since our data is contained in a subspace, the quantization points would also be contained in that subspace. If the number of quantization vectors Nq is equal to the dimension of the subspace S, and the resulting vectors are linearly independent they form a basis of the subspace, while if the number is smaller than n the subspace generated by them is included in S. Changing the number of vectors Nq allows us to control the dimension of the

subspace generated and therefore the sparsity of the vectors PS ⊥ x ˆ. The drawback of choosing a number of vectors Nq smaller than the dimension of the subspace is that some of the normal traffic components will appear in the projection reducing the maximum achievable compression ratio. A way to estimate the dimension of the subspace is to set N q = N and then calculate the SVD decomposition of the Matrix Z Nq whose columns are the elements of Q = {zi }i=1 . Then select PNq PN Nq such that 1 − i=1 σi (Z)/ i=1 σi (Z) ≈ 0.001, where {σ1 (Z), ..., σN (Z)} are the singular values of Z. With this value of Nq run the clustering algorithm again to obtain a generator of S. Orthonormalizing the elements of Q we can simplify the computation of the orthogonal projection matrix to PS ⊥ = I − BB T . B. Convex set covering model Even though the subspace model appears to be a good model for the type of VoIP traffic tested in our experiments, it might be the case when attacks and normal traffic lay on the same or almost the same subspace but are still separable. If we can find a convex set that includes most of the normal vectors while keeping the anomalous vectors outside the set we can take advantage of it to improve the performance of the classifier in the compressed domain. In a similar fashion as we defined the projection over a subspace we can define the projection over the orthogonal complement to the convex set as PC ⊥ (x) = x − PC (x) where PC (x) = argmin||x − y||2 . y∈C

The calculation of the projection PC (x) for a general convex set can be very complex, however there are some sets like half planes, balls or ellipsoids in which it can be easily defined. In this work we use an elliptical closed convex set oriented along the principal vectors of a training sample with scale parameters along those axes given by the singular values of the sample. From the training sample of normal traffic {ˆ xi }Ti=1 we form N ×T a matrix A ∈ R whose columns are the elements of the SV D sample. This matrix can be decomposed as A = U ΣV T , where U and V are unitary square matrices and Σ is a rectangular matrix with non zero elements only in the diagonal. The principal directions along which the data is distributed are given by the columns of U while the scales along each of the directions are given by the associated singular values σi (A). Without loss of generality we can assume that N < T and then form the matrix ΣN by eliminating the last T − N columns of Σ. If we also restrict the matrix V in the same manner to obtain VN , we have that A = U ΣN VNT . Based on this restricted representation the projection over the ellipsoid is defined by 1/2

−1/2

PEρ (x) = U ΣN PBρ (ΣN

U T x)

(2)

where the operator PBρ is the projection over the centered ball of radius ρ in RN PBρ (x) =

( x

x ρ ||x|| 2

, if ||x||2 < ρ , if ||x||2 > ρ.

978-1-4244-9848-2/11$26.00©2011 IEEE

(3)

C. Classification Our main goal is to determine the feasibility of performing classification of the data packets sketches given that we have a priori information about the structure of normal and anomalous traffic. To that end, we employ compressed training samples from normal and anomalous traffic to determine the best separating surface between the two classes. The simplest and fastest classifier is a linear kernel support vector machine. Even though other kernels might yield better results, we leave this analysis for future work and focus on the basic linear kernel here. Given a training sample of normal traffic {ˆ xni }Ti=1 and a sample of attacks {ˆ xai }Ti=1 both with elements N in R we obtain the compressed training samples {ˆ yin }Ti=1 a T and {ˆ yi }i=1 by applying the dimensionality reduction operator to the original elements yˆin = GP (ˆ xni ) and yˆia = GP (ˆ xai ). Here, the operator P (·) refers either to the projection over the orthogonal subspace S ⊥ or the projection over the orthogonal complement to the convex set C ⊥ . Additionally we associate the labels lin = 1 to the samples in the first group, and the labels lia = −1 to the samples in the second group, forming the sets of labeled and compressed training samples {(ˆ yin , lin )}Ti=1 a a T and {(ˆ yi , li )}i=1 . The optimal separating hyperplane is given by two parameters, the normal vector to the plane w and a bias scalar parameter b. This parameters are the solutions of the following problem P 1 T (4) minimize i ξi 2w w + C w,b,ξi

subject to lin (wT yˆin + b) ≥ 1 − ξin , i = 1, .., T lia (wT yˆia + b) ≥ 1 − ξia , i = 1, .., T ξia,n ≥ 0

which finds the hyperplane with maximal margin and minimal misclassification over the selected training samples. The

1 average probability of correct classification

The projection defined in equation 2 is the composition of several transformations. First we rotate the vector x by means of the unitary matrix U which aligns the principal vectors with the coordinate axis. Then we rescale the vector by −1/2 means of ΣN . The composition of this two transformations, maps vectors inside an ellipsoid to vectors inside a sphere allowing to use PBρ (x) to find the projection over a ball and then inverting the rotation and scaling. Clearly the size of the ellipsoid is controlled by the parameter ρ. In the following we will see that a correct selection of this parameter is fundamental in the performance of the classifiers as the compression ratio increases. Since the matrix A is usually low rank, say rank(A) = r < min(N, T ) there would be N − r elements that are zero or near zero in the diagonal of ΣN . That presents a problem in −1/2 the calculation of ΣN but it can be easily overcome if we ˆ N = ΣN + δI with δ > 0 being a modify the matrix Σ by Σ small value. After calculating the projection over the orthogonal complement to the convex set the dimensionality is reduced using a random matrix of the class defined in equation 1. The final dimensionality reduction operator is Φ(x) = G.PEρ⊥ (x).

0.98 0.96 0.94 0.92 Nq=315

0.9

Nq=300

0.88

Nq=280

0.86

Nq=150

0.84

Nq=40

0.82 10

20

30

40 50 60 compression ratio in %

70

80

90

Fig. 4. The different graphs present the results using different number of vectors Nq = i to estimate a basis of the subspace S. Each point was calculated averaging the results of 10 different classifiers learned using different training samples. The compression ratio is the quotient between the number of rows and the number of columns of the random matrix G used in that series of experiments.

classification function based on this hyperplane is given by g(x) = sign(hw, xi + b). IV. E XPERIMENTAL RESULTS To test the two approaches for compressed classification we recorded several minutes of conversations in our test bed. The anomalous packets were generated injecting .pdf, .txt and .jpg files in the payloads of the RTP packets. The data matrices were then generated from this streams as described in section II-A. The number of rows in the matrix required to store a normal traffic packet is 9 for the header fields and 160 for payload bytes while for the anomalous packets is 9 for header and 161 for the payload. Taking that into account we set the number of rows of the data matrices to 340 which suffice to accommodate 2 normal or anomalous packets. As a consequence of different packet lengths, normal and anomalous data matrices differ in the last 2 rows with the normal data matrix having lower rank than the anomalous data matrix. For this reason we generate 2 different data sets. The first one corresponding to the original data matrices with 340 rows and the second one corresponding to the restriction of this matrices to the first 338 rows. The classification approach based on the subspace model was tested in the first data set while the convex set covering approach was tested using the second data set. Both datasets are available for downloaded at http://www.ece.udel.edu/∼garategu/CISS2011-data/. Figure 4 depicts the results of the classification using the subspace model over the first dataset. The probability of correct detection for each level of compression was calculated averaging the correct classification rate of 10 different classifiers trained from 1000 normal and anomalous samples. For each of the classifiers we calculate the rate of correct classification using a sample of 4500 points different from the ones used in the training stage. It can be seen that the performance of

978-1-4244-9848-2/11$26.00©2011 IEEE

1 average probability of correct classification

average probability of correct classification

1 0.98 0.96 0.94 0.92 Nq=2

0.9

Nq=10

0.88

Nq=30

0.86

Nq=90

0.84

Nq=180

0.82 10

20

30

40 50 60 compression ratio in %

70

80

90

Fig. 5. The graphs show the classification performance for the subspace model in the second data set, varying the level of compression and the number of vectors Nq used in the basis estimation

0.98 0.96 0.94 0.92 0.9 ρ=0 ρ=0.4 ρ=0.66667 ρ=0.8 ρ=1.0667

0.88 0.86 0.84 0.82 10

20

30

40 50 60 compression ratio in %

70

80

90

Fig. 6. The graphs shows the classification performance for different scales of the ellipsoid used to calculate the projections, using the second dataset

the classification decrease with the compression rate if the dimension of the estimated subspace is smaller that the true dimension, which is 321 in this case. When the number of vectors in the basis approaches the dimension of the subspace, all the normal vectors are mapped to 0 by PS ⊥ while the anomalous vectors still have components outside the subspace. The good performance achieved here can be attributed to the fact that the subspace assumption is clearly true. Normal traffic vectors have the last 2 components equal to 0 while anomalous vectors don’t. In the case of the second data set, the problem is more challenging since it is not obvious that the anomalous and normal vectors belong to different subspaces. Figure 5 shows the results of repeating the experiment of Figure 4, but this time using the second data set. We can see that the projection over the orthogonal subspace actually degrade the average performance of the classifier as Nq approaches the dimensionality of the normal traffic subspace. This results confirm the fact that the subspace model assumption doesn’t hold for this dataset. If we use the convex set covering model instead (see figure 6), the classification accuracy improves but strongly depend on the selection of the parameter ρ which define the size of the ellipsoid. In this simulation the average probability of correct classification was calculated averaging the results of 10 different classifiers for each compression ratio.

vector. The computation of Φ involves the estimation of a basis of the normal subspace, but this operation can be completed offline and may be repeated over long periods of time to account for variations in the normal traffic structure. On the other hand if the anomalous traffic packets are on the same or near the same subspace as the normal traffic the model becomes too broad (see Figure 5) and we actually lose separability by using the subspace information. The convex set covering model on the contrary is more powerful, but at the same time requires more computations in the high dimensional 1/2 −1/2 space. Even though the matrices U ΣN and ΣN U T can be pre computed offline from a training sample of normal traffic, the projection operator requires the computation of the norm of each data vector, a comparison operation and possibly one multiplication of a scalar by a vector before multiplying by the random matrix G. We think that this work shows promising results and demonstrates that the incorporation of prior knowledge allows to compress network traffic data while keeping relevant information that can be used for classification, detection or analysis of statistical behaviour. Future directions in this research includes the incorporation of non-linear kernels in the support vector machine, which might help to improve separability between classes. Another possibility to improve the separability is to simply increase the dimension of the data vectors. Augmenting the columns of the data matrix can be sufficient to separate the subspaces enough so that the subspace model can be easily applied.

V. D ISCUSSION AND C ONCLUSIONS

R EFERENCES

We have presented two simple methods to for the classification of VoIP data packets taking advantage of the knowledge about the structure of normal traffic. The Subspace model method have the advantage of being very simple and yield excellent performances if the vectors are clearly separable. It only requires the multiplication of one column of the data matrix with the projection matrix Φ, and the computation of the discriminative function g(Φx) to label each new data

[1] T. Takahashi and W. Lee, “An assessment of voip covert channel threats,” in Security and Privacy in Communications Networks and the Workshops, 2007. SecureComm 2007. Third International Conference on, pp. 371 –380, 2007. [2] J. LuBacz, W. Mazurczyk, and K. Szczypiorski, “Vice over ip,” IEEE Spectrum, vol. 47, no. 2, pp. 42–47, 2010. [3] J. Lubacz, W. Mazurczyk, and K. Szczypiorski, “Hiding data in voip,” in Proceedings of the Army Science Conference (26th), 2008. [4] W. Mazurczyk. and K. Szczypiorsk., “Steganography of voip streams,” in On the Move to Meaningful Internet Systems: OTM 2008, pp. 1001– 1018, Springer, 2008.

978-1-4244-9848-2/11$26.00©2011 IEEE

[5] Z. Wang, J. Paredes, and G. R. Arce, “Adaptive subspace compressed detection of sparse signals,” submitted for publication, 2010. [6] J. Paredes, Z. Wang, G. Arce, and B. Sadler, “Compressive matched subspace detection,” European Signal Processing Conf., 2009. [7] R. Calderbank, S. Jafarpour, and R. Schapire, “Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain,” ht tp://dsp. rice. edu/files/cs/cl. pdf, 2009. [8] OpenSIPS available at http://www.opensips.org/. [9] RTPdump, “Format specification.” available at http://www.cs.columbia.edu/irt/software/rtptools/. [10] Salare-Security webpage http://www.salaresecurity.com/. [11] M. Davenport, M. Wakin, and R. Baraniuk, “Detection and estimation with compressive measurements,” Dept. of ECE, Rice University, Tech. Rep, 2006. [12] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006. [13] M. Duarte, M. Davenport, M. Wakin, and R. Baraniuk, “Sparse signal detection from incoherent projections,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2006, pp. 305–308, 2006. [14] J. Haupt, R. Castro, R. Nowak, G. Fudge, and A. Yeh, “Compressive sampling for signal classification,” Signals, Systems and Computers, 2006. ACSSC’06. Fortieth Asilomar Conference on, pp. 1430–1434, 2006. [15] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008. [16] D. Quiang and T.-W. W., “Numerical studies of macqueen’s k-means algorithm for computing the centroidal vronoi tessellations,” Computers and Mathematics with Applications, vol. 44, no. 3, pp. 511–523, 2002.

978-1-4244-9848-2/11$26.00©2011 IEEE

Suggest Documents