A Memory Efficient Privacy Preserving Representation of Connection Graphs Jan Jusko
[email protected] Cisco Systems, Inc. San Jose, CA 95134 USA
Martin Rehak
[email protected]
Tomas Pevny
[email protected]
Faculty of Electrical Engineering Czech Technical University in Prague Czech Republic
ABSTRACT Connection graphs are often used for network traffic classification and P2P networks analysis. With the appearance of Software Defined Networks (SDN), a novel approach to proactive distributed network management based on multiagent paradigm, there is a need to develop specialized graph representations. Once transmitted between elements of SDN network, they provide answers to specific queries while protecting other information about the graph. In this paper we propose one such graph representation based on Bloom Filters and show that it provides considerable reduction of required memory and strong privacy while keeping low false positive rate that does not have negative impact on its intended use.
Keywords graph representation, probabilistic structures, privacy, persistence
1.
INTRODUCTION
Connection graphs are commonly used to describe communications in computer networks. Vertices in these graphs represent endpoints of the network communication (e.g. hosts, ip addresses) and edges represent some sort of communication. Definition of an edge is not strict and often depends on the purpose for which the graph is created. Connection graphs are equivalent to Traffic Dispersion Graphs that are used for several network security solutions [15, 14]. An example of a simple connection graph can be seen in Figure 1. Connection graphs can also be naturally used for P2P overlay representation and detection. There are several methods that utilize graphs an graph methods to detect P2P net-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. ACySe ’14, May 06 2014, Paris, France Copyright is held by the owner/author(s). Publication rights licensed to ACM. ˇ ACM 978-1-4503-2728-2/14/05ÂE$15.00. http://dx.doi.org/10.1145/2602945.2602947
Figure 1: Example of a simple connection graph with unweighted and undirected edges.
works. Various types of graphs are also used for theoretical evaluation of P2P networks’ properties [4, 13]. A lot of attention has been given to the field of P2P detection. That is mainly because of two reasons — first, file sharing P2P networks consume a lot of network bandwidth and network administrators try to limit their throughput using traffic shaping; second, several botnets are known to use P2P overlay networks as their C&C channel [10, 6]. In a P2P overlay or generally in a set of network connections one might be interested in those that have some special properties. In this work we focus on those connections that exhibit a long term character, i.e. are “persistent” Persistence was first formalized in [9]; we provide our formalization with minor changes in Section 4. Persistent connections are important because we believe that malware’s C&C connections are persistent. In this paper we propose a memory efficient representation of a graph whose vertices represent network communication endpoints and edges represent persistent connections. The graph and its representation need to be dynamic, to reflect changes in the network activity over time. We want the graph representation to enable in-line network elements to decide quickly which connections are to be escalated for further processing. On the other hand, we want a representation that prohibits extracting any other information from
data could be accessed only by a limited set of operations. The controller-element hierarchy suggested in the scenario is identical to Software Defined Networks (SDN) hierarchy [8], depicted in Figure 2.
3.
RELATED WORK
There are two established ways to represent graphs in memory [11]: • adjacency matrix, • adjacency list.
Figure 2: Visualization of SDN architecture. the graph. Moreover, we want it to follow the privacy best practices to avoid any compromise of information. The details of the proposed representation can be found in Section 5.
2.
MOTIVATION
Working with graphs brings some technical challenges. The most obvious one is the selection of the optimal in-memory graph representation. For the graph structure itself there are two well established methods — adjacency list and adjacency matrix. Having information about the graph structure itself is usually not enough, another information regarding vertices and edges is required as well. Discussion about graph representation can be found in Section 3. In the field of distributed or multi-agent systems, the question of privacy must also be considered. The graph representation should follow the principle of least privilege [1] and contain only the information it will be queried for. Specifically, in the field of SDN there are predictions that some devices might be exploited using policy misconfigurations with attackers getting complete control over the network [20]. Having structures that natively protect data they contain from exploitation is thus crucial, in order to enable delegation of decisions. Section 6 provides details on how our graph representation protects the data it contains. In our scenario we consider a memory-efficient graph representation that can be deployed on in-line network elements. This structure will allow targeted escalation of flows that may participate in peer-to-peer behavior for the inspection on a specialized device, allowing for example the discovery of information leaks from the protected networks, while maintaining the availability of VoIP services. This memory structure can be populated by higher-level controller process and pushed down on a network elements. Each network element can then use the implicit representation of the graph structure to take rapid decisions regarding enforcement of a specific network policy, e.g. the escalation for an inspection. In case the network element gets compromised, the
When using adjacency list, vertices are stored as objects, and every vertex stores a list of its adjacent vertices. The adjacency matrix is a two dimensional matrix where rows represent source vertices and columns represent target vertices. Adjacency list representation allows for storing additional information on vertices. Adjacency matrix allows for assigning weight to the edges — in that case the matrix is not of boolean elements. Time complexity of graph operations as well as space complexity of required storage of the two representations can be found in Table 1. Considering the space complexity, the table shows that it is more efficient to represent sparse graphs by adjacency list and dense graph by adjacency matrix. However, space complexities shown consider only the graph structure and do not include storing additional information about the vertices or edges in the graph. Usually when working with graphs, one wants to know what does each vertex represent, so at least this information needs to be stored. Other obvious cases are keeping the edge weight or other more intricate edge properties. It is possible, and also encountered in real world examples, that the amount of data required for keeping the vertices’ and edges’ properties actually exceeds amount of data required to keep the information about the graph structure. Another consideration should be taken about the complexity of structures used to describe various properties of vertices and edges. If these structures are too complex it can be a source of issues for interpreted languages like Java. In Java, the Garbage Collector is responsible for freeing memory of unused objects. If there is a vast amount of objects and the object structure is too complex, the garbage collection can take a considerable amount of time or fail altogether. This is explained in [17] and we observed the same behavior in our experiments. For Java garbage collecting, it is better to use a smaller number of larger objects than a vast number of small objects. Both graph representations provide a full control over the graph with every possible graph operation and it is possible to extract arbitrary information about the graph. However, such full spectrum of available operations on the graph is not always necessary. Other representations that offer lower space complexity can be used for graph representation if one needs to have access only to specific information about the graph and/or only to limited operations/queries on the graph. Bloom filters were used to approximately describe structure of a de Bruijn graph [16]. De Bruijn graphs are directed
Table 1: Space complexity for storage required by a graph representation and time complexities for various graph operations. Adjacency List Adjacency Table space complexity O(|V | + |E|) O(|V |2 ) add vertex O(1) O(|V |2 ) add edge O(1) O(1) remove vertex O(|E|) O(|V |2 ) remove edge O(|E|) O(1) contains edge O(|V |) O(1) graphs representing overlaps between sequences of symbols and are widely used in next-generation DNA sequencing. The probabilistic data structure is used to partition assembly graphs into components as a prelude to assembly of the exact graph. The memory footprint was reduced 20- to 40fold. A way to exactly represent de Bruijn graphs by bloom filters was provided by [3]. The goal of the authors was to exactly represent de Bruijn graph with efficient implementation of the following operations: • for any node, enumerate its neighbors, • sequentially enumerate all the nodes. Implementation using Bloom filters creates a probabilistic de Bruijn graph and the exact operation of enumerating node neighbors is attained by recording all troublesome false positives. There are some operations that the proposed graph representation does not support, e.g. evaluating membership of an arbitrary node. The compressed representation reduces the size of the graph from around 30GB to 3.7GB. Besides attempts to compress de Bruijn graphs there are also solutions for compression of graphs representing World Wide Web [19]. The proposed solution to graph compression is two fold. Firstly, the URLs used to label vertices are compressed by replacing frequent hosts by pointers to a look-up table. The same is done with frequent strings like index.html. The implementation assumes that URLs are in a lexicographic order. Then the further compression is attained by finding the longest common prefix of the two neighboring URLs, keeping prefix in the first URL and information about the length of the shared prefix in the second URL. Secondly, the structure of the graph is compressed using canonical Huffman code for the vertices with highest in-degree. Hence, these are represented by short encodings. The remaining vertices are then encoded by Colom code.
4.
PERSISTENCE OF A CONNECTION
The C&C channels utilized by botnets can take many forms and employ various techniques to avoid detection, and improve resiliency and reliability. One important property of all C&C channels that is crucial for botnets to keep their utility is the long term character of the C&C connection. Each bot in the botnet needs to receive new commands from its master and/or upload stolen or gathered data to be useful. For P2P networks specifically, persistent connection within the overlay usually signals that this connection
is important for the functionality of the overlay network. Especially in unstructured P2P network overlays, persistent connections are usually connections between ordinary nodes and supernodes. Supernodes serve as bridges between ordinary nodes and rest of the network. They were used for example in KaZaA that served as a basis for Skype’s protocol [12]. Today, supernodes can be found in Skype and Gnutella [7, 18]. In general, persistence of relationships can be linked to the efficient operations of any distributed collaborative system. To capture the live-long characteristic of connections that are very lightweight and scarce we use a sliding window, which is kept for each of the connections. A connection is represented by its endpoints. Endpoint definition can vary; for example it can be an IP addresses or tuple (ip:port). The sliding window is called observation window and is composed of several measurements windows also called bins. Each bin is associated with a specific time interval, e.g. one hour. If a connection between two endpoints occurs within this time interval, the bin contains value of 1, otherwise it contains 0. The persistence of a connection is calculated as p(c, W ) =
n 1 1c,bi n i=1
where c is the connection in question, W = [b1 , ..., bn ] is the observation window composed of n measurement windows and function 1c,bi is equal to 1 if the connection was active at least once during the measurement window bi , otherwise the function is equal to 0. If we take a look at the definition of the persistence, we see it is just the portion of bins in the observation window in which the connection was active. This measure gives the same weight to the lightweight connections as it gives to the heavy traffic connections. Thanks to this, we can discover ”regular” and lightweight communication. A connection is considered to be persistent, if its persistence exceeds the persistence threshold, which is set by means of a prior knowledge or experiments. In [9] authors show on their experimental data, that less than 20% of all connections have persistence higher than 0.2. This is in accordance with the assumption that most of the connections are active very sporadically or only once. The data also show, that the persistence threshold should be chosen somewhere between 0.5 and 0.8. Authors chose 0.6 as their threshold value. Size of the observation window determines how long we keep track of all connections, i.e. how big is the memory of the algorithm. Size of one bin determines resolution of the algorithm. If the observation window consists of ten bins and each bin represents one hour, it is well suited to track bots that contact its master once in an hour or 30 minutes. However, tracking bots that connect to their C&C once a day is impossible with this setting. In general, one cannot say that all bots connect to the C&C server with the same or similar time interval and we cannot state in advance what time intervals will be used. Therefore, we use several bin sizes simultaneously which gives us several resolutions to work with. This way we are able to detect channels occurring approximately every hour as well as channels occurring roughly once a day. When using sev-
Figure 3: Visualization of storage scheme for determining persistence of a connection [9]. eral bin sizes, we have several observation windows that all have the same size in terms of bins, but are of different size in terms of time. Thus we end up with several persistence values for a single connection. In this case, the overall persistence of the connection is determined as a maximum persistence over the resolutions p(d, W ) = max p(j) (d, W ). j
5.
PERSISTENCE GRAPH REPRESENTATION
The representation we propose is specially designed for graphs where vertices represent communication endpoints and edges represent persistent connections between them. In order to decide which connections are “persistent” and thus what edges should the graph contain, we need to keep additional data structures for each connection. We also require that graph changes with time according to the observed network communication. Since the graph changes with time it is ineffective to create static copies at certain time periods to be sent to the in-line network elements as these would be quickly obsolete. The most efficient solution is to share the same data structure by controller and in-line network elements. Presented persistence measurement scheme using observation windows is memory heavy, especially on huge networks. The heaviest burden is in the storing information about the connections — the two endpoints. The keys in map actually take more space than the information about the connections’ occurrences. If a certain level of error in deciding whether a connection is persistent (i.e. an edge is in the graph) is acceptable we can utilize probabilistic set representations. These do not strictly contain elements, but provide a way to say if some element is present in a set with a certain error. At the same time, the inability of probabilistic set representations to enumerate their elements is what we are looking for from a privacy preserving standpoint. Our choice is Bloom Filter [2], which is a memory-efficient set representation. Bloom filter can say whether a certain element is contained in the set it represents with one-sided error — it can never be wrong if it claims the element is not in the set, however it might be in error when claiming the element is in the set. Bloom filters use a combination of independent hashing functions and bit field, and guarantee a certain false positive probability if the number of elements does not exceed a certain threshold specified during the bloom filter creation. That being said,
Figure 4: Structure of the sliding window when using bloom filters. it is clear that while Bloom Filters can answer whether they contain some element, they cannot enumerate all elements they contain. When considering the intended use of graph representation some false positives do not pose a problem and would result only in escalating more flows for further inspection. Number of false positives can be kept in less than 1% which should not present any performance challenges in the further processing. The question is then • how do we measure persistence using bloom filter and • how do we represent graph structure that can say whether it contains a specific edge but reveals no additional information.
Persistence Measurement Using Bloom Filters. Instead of keeping a map with connections as keys and their respective observation windows as values we keep an array of n bloom filters, where n is size of the observation window. Each bloom filter represents a bin in the observation window. Unlike the original representation where we kept one observation window for each connection, here we keep only one observation window for all connections. Sketch of the data structure can be found in Figure 4. Recording an occurrence of a connection is simple — it is “applied” to the bloom filter that represents the currently active bin. Determining the persistence of a connection is simple as well. Every bloom filter in the array is queried whether it contains the connection in question. The persistence of the connection is the number of bloom filters that claim they contain the connection, divided by the number of bloom filters in the array. Finally, there is a question of how the sliding of the observation window is implemented. It is particularly simple in this implementation. Sliding of the observation window is done by removing one bloom filter in the array and adding a new one. There are three issues we need to address when using bloom filters instead of classic sets and maps: • what is the false positive rate of this implementation? • how do we determine size of the bloom filters (which we need to know in advance)? • how to protect Bloom Filters from overflowing?
Bloom filters provide a guarantee of the false positive rate given their capacity is not exceeded. The false positive rate can be chosen arbitrarily and impacts the size of the Bloom filter. In our use case we are not interested in a false positive rate of a single bloom filter, we want to know the probability that a connection which is not persistent is reported as persistent. This probability will differ based on its true value of persistence and the persistence threshold value and can be expressed as
n−p(c)
e(c) =
i=pt −p(c)
n − p(c) i
f ppi (1 − f pp)(n−p(c)−i) (1)
where p(c) is true persistence of connection as defined in Section 4 multiplied by the size of the observation window, pt is persistence threshold, n is size of the observation window and f pp is false positive probability of the used bloom filters. To give an example, if we use bloom filters with 1% false positive probability and persistence threshold of 0.6, the probability of connection with persistence of 0.5 to be reported as persistent is slightly less than 5% and the probability for connection with persistence 0.4 to be reported as persistent is around 0.1%. Please note that if the persistence of a connection is higher than the persistence threshold then e(c) = 0. All false positive probability guarantees of bloom filters are based on an assumption that its projected capacity is not exceeded. The projected capacity must be known before creation of the bloom filter. Network traffic volume follows trends based on the time of the day and day of the week and also depends on the type of the network, therefore we must adapt the size of used bloom filters as well. For example, in a corporate network the traffic volume is at its highest during working days between 8am and 6pm and the network is less utilized during the weekends. To determine the proper size of a bloom filter we employ a memory structure that stores maximal encountered traffic volume for specific days of the week and hours of day. When a bloom filter is about to be created, its size is determined based on the maximal traffic volume in the period of time the bloom filter should cover. It is possible that a sporadic event occurs in the network (e.g. a huge scan or DDoS) and number of connections in the network will rise considerably. In case of such an event we employ a safeguard that monitors number of inserts to the bloom filter. If the number of inserts reaches the projected size value, a new bloom filter is created and all following connection occurrences are stored into it. When querying for an occurrence of a connection, both the original bloom filter and additional bloom filters are queried.
Graph Representation Using Bloom Filters. A list of bloom filters provides a solution for effectively measuring persistence of a connection without storing information about specific connections. We actually do not need to add any other data structures to represent a graph that can only answer queries about existence of an edge. The graph has only edges that represent persistent connection, therefore a query whether the graph contains an edge is equivalent to determining whether the given connection is persistent. This can be evaluated by queries on the Bloom Filters as
stated earlier.
6.
PRESERVING PRIVACY
Deployment in a multi-agent or distributed system emphasizes the need to maintain privacy since different agents can have different goals and parts of a distributed system can be compromised. Therefore to limit the possibility of the information to be misused it is of utmost importance to limit the information provided to other parties or parts of the system to the absolute minimum. In this section we analyze options of attackers to retrieve information from the graph that were not intended to be shared. We calculate the expected false positive probability of the graph representation and show what impact it has on the results obtained by a brute-force attack. The proposed graph representation supports the following queries: • query about the existence of an edge, • query about the existence of a path (extension of the previous case). The graph representation is based on a combination of Bloom filters which have one-sided error. Therefore, every query for existence of an edge that is present in the graph always returns a correct answer. Situation is different when querying for an existence of missing edge — the answer can be wrong. To evaluate the probability of wrong answer we need to realize that existence of an edge in the graph is equivalent to the fact that connection the edge represents is persistent, i.e. its persistence is higher than the persistence threshold. This probability can be enumerated using Equation (1). It can be seen that probability of error depends on properties of used Bloom filters as well as on the persistence measurement setup. Table 2 shows probability of error for queries for a missing edge for two choices of persistence thresholds in the case that Bloom filters have the projected false positive rate 1%. The data in the table shows that probability of error drops considerably with decreasing persistence of the connection. The situation is similar when querying for a specific path — if the path exists, the query is always answered correctly. If the path does not exist, the probability of erroneous answer depends on the number of edges in the path that are not present in the graph and on persistence of connections that represent those missing edges. The probability of wrong answer when querying for the existence of a path can be expressed as e(c) (2) e(p) = c∈p
where p = (c1 , c2 , ..., cn ) is the path in question and e(c) is probability of error from Equation (1). Clearly the probability of error quickly decreases with every missing edge in the path. Bloom filters guarantee their false positive rate given their capacity is not exceeded and in our implementation we take
Table 2: Probability of error when querying for a missing edge given the specific persistence threshold (specified in first column) and true persistence of a connection the edge represents (specified in first row). The Bloom filters have projected false positive rate 1% and observation windows consists of 10 bins. pt 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.6 2.03 × 10−8 1.22 × 10−6 6.78 × 10−5 0.15 3.4 × 10−3 4.9 0 0 0 0 1 10−18 10−16 10−14 10−12 10−10 10−8 10−6 10−4 10−2 1
extra measures to guarantee that it never happens. Hence the error rates expressed by Equations (1) and (2) are guaranteed as well.
Enumeration of all vertices in the graph requires the same brute-force attack as enumeration of all edges in the graph, therefore the number of required queries is the same.
Since Bloom filters do not support enumeration of their elements, the following operations are not supported by the proposed graph representation:
Determining whether there was an indirect communication between two communication endpoints is equivalent to determining whether there is a path between two nodes in a graph. This can be determined by Dijkstra’s algorithm (or more precisely its special case for unweighted graphs, the Breadth-first search) [5]. However, to run the algorithm one must know the structure of the graph which means one needs to enumerate all nodes and edges. These two tasks were already covered.
• enumerate specific node’s communication partners, • list all nodes, • list all edges, • check for indirect communication between nodes. Nevertheless, the aforementioned queries could be answered by repeated querying for existence of an edge. In the following paragraphs we assert that even if the attacker managed to finish a successful brute-force attack and retrieved set of all edges as claimed by the graph representation, the resulting set would be poisoned by considerable number of false positives. For each of the attacks we determine the number of queries to contains edge() an attacker needs to make and then based on theoretical properties and experimental results we show the expected number of false positives contained in the attack results. To enumerate all communication peers of a specific endpoint (vertex in the graph) the attacker has to issue ∼ 109 (i.e. 2554 ) queries to contains edge() of the graph representation; that is if we consider IPv4 address space, if we consider IPv6 the number is even greater. The actual number of queries will be slightly lower than 2554 since the attacker does not need to query for connections to private ranges, e.g. 10.x.x.x. Enumerating all connections recorded in the graph is equivalent to finding all communication partners for all nodes in the graph. If the attacker does not know what is the IP range of addresses contained in the graph he needs to issue ∼ 1018 queries. If the edges are undirected the number of queries is reduced by factor of two. In reality, the graph always covers only communication of a small subset of IP addresses because the entity that created and keeps the graph has only partial knowledge of the communication on the Internet. Therefore it can only record connection leaving and entering range it has vision of and connections within this range. If an attacker knows this range, he can limit the set of possible communication endpoints combinations to those where at least one side of the communication is within this range. In case the attacker knows the IP range of the network he needs to issue ∼ 109 queries multiplied by the size of the known IP range.
We already showed what is the error rate for connections with various levels of persistence. To estimate the number of false positives obtained in the attack we need to determine prior probabilities of connections of various persistence. To do that we performed an experiment on an University network, which consists of approximately 1000 hosts. The network traffic was collected for 20 hours during a working day. Number of flows within one 5 minute interval ranges from 37, 000 at night to 240, 000 during peak hours. We measure persistence of connections with endpoints in form of ip:port, measurement windows with sizes 1, 2 and 4 hours and have 10 measurement windows in the observation window. The percentage of connections with given persistence can be found in Table 3. Please note that while in Table 3 we did not consider connections with persistence value 0 (connections that never occurred) we do consider them in the calculation of the expected number of false positives obtained in a brute-force attacks. Having the empirical probability of occurrence of a connection with specific persistence levels we can determine the expected probability of error of query contains edge() using the proposed graph representation pt −1
E=
e(c(i)) × po (c(i))
(3)
i=1
where e(c(i)) is probability of error for connection with persistence value i that is calculated according to (1) and po (c(i)) is probability of occurrence of connection with specific persistence based on an experimental data. Equation (3) evaluates to 3.2 × 10−9 when using persistence threshold 0.6. Further on we use this expected error and assume the attacker has no knowledge about the IP range covered by the connection graph. When enumerating communication partners of a given node an attacker needs to do 2554 queries with probability of error 3.2 × 10−9 . That yields approximately 14 false positives. In the remaining three attacks ∼ 1018 queries must be is-
Table 3: In the experiment we were monitoring the distribution of connection with specific persistence. The connection were in form of (ip:port, ip:port). Most of the connections occur only once and considerably fewer connection occur more than twice. persistence # of connections % 0.1 2, 052, 707 97.987 0.2 37, 186 1.775 0.3 2, 975 0.142 0.4 598 0.029 0.5 241 0.012 0.6 190 0.009 0.7 175 0.008 0.8 114 0.005 0.9 189 0.009 1.0 499 0.024 total 2, 094, 874 100 sued. Multiplying with the expected error rate it yields ∼ 2.9 × 1010 false positives. That is more false positives than there is real connection in the graph describing network from our experiment on the University network. The presented results show that the amount of false positives is so high that the brute force attack on the graph representation lacks any usefulness. Moreover, the false positives give the graph representation a plausible deniability property [1]. The owner of the graph can always claim that a specific connection in the graph is a false positive rather than a true value.
7.
PERFORMANCE EVALUATION
In the performance evaluation we focus on the following properties of the proposed graph representation: • memory consumption, • real number of false positives. In the experiment we create the proposed graph representation of the network communication from the network of Telco provider. The network encompasses tens of thousands of users and has throughput of 40 Gbps with number of flows per 5 minutes ranging approximately from 2 millions to 11 millions. The data set spans 3 days in November 2011. When measuring persistence of connection we used measurement windows of size 1, 2 and 4 hours. Observation window consists of 10 bins. Persistence threshold was fixed at 0.6. Finally, size of the Bloom filters was set to value that guarantees 1% false positive probability. In the first part of the evaluation we compared memory requirements of persistence measurement schema proposed in Section 4 together with graph represented by adjacency lists with memory requirements of the proposed graph representation. Whereas in the first case the data occupied 14 GB of memory, the latter required only 300 MB of memory, which is 46 times less. Another question is whether such compression does not cause an unacceptable rate of false positives. For the purpose of
comparison we imitated a real use case using the same data set; where graph is created based on network traffic and then queried for existence of edges that represent connections occurring in the network. First, we created a graph based on the connections occurring in the data set. Some of the connections were persistent but most of them not. Then we queried for existence of edges representing real connections in the data set. When querying the exact graph representation, we get the exact set of persistent connections, denoted Se . The same set of queries done on the proposed graph representation creates Sp . Please note that due to the properties of Bloom filter Se ⊂ Sp . We determine false positive probability of the proposed graph representation as pf p =
| Sp \ Se | | Sa |
(4)
where Sa is set of all connections occurring in the data set. The attained false positive rate was 0.4%.
8.
CONCLUSION
In this paper we presented a specific connection graph representation that can be used in SDN. It enables in-line network elements to decide quickly which connections are to be escalated for further processing but prohibits extracting any other information from the graph. Moreover, the graph itself is dynamic and is adjusted with time to reflect the current state of network communication. We showed that the memory requirements were reduced approximately 46 times and the real false positive rate was 0.4%. Even thought a brute force attack can be used to extract additional information from the graph, we show theoretically that the obtained results would be heavily poisoned by false positives produced by the probabilistic structures. This guarantees a plausible deniability property.
9.
ACKNOWLEDGMENTS
This work was supported by the project of the Czech Ministry of Interior No. VG20122014086.
10.
REFERENCES
[1] Anderson, R. J. Security Engineering: A Guide to Building Dependable Distributed Systems, 2nd ed. John Wiley & Sons, Inc., New York, NY, USA, 2008. [2] Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422–426. [3] Chikhi, R., and Rizk, G. Space-efficient and exact de bruijn graph representation based on a bloom filter. In Algorithms in Bioinformatics, B. Raphael and J. Tang, Eds., vol. 7534 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, pp. 236–248. [4] Davis, C. R., Neville, S., Fernandez, J. M., Robert, J.-M., and McHugh, J. Structured peer-to-peer overlay networks: Ideal botnets command and control infrastructures?. In ESORICS (2008), ˜ spez, Eds., vol. 5283 of Lecture S. Jajodia and J. LA¸ Notes in Computer Science, Springer, pp. 461–480.
[5] Dijkstra, E. W. A note on two problems in connexion with graphs. NUMERISCHE MATHEMATIK 1, 1 (1959), 269–271. [6] Dittrich, D., and Dietrich, S. P2p as botnet command and control: a deeper insight. In In Proceedings of the 3rd International Conference On Malicious and Unwanted Software (Malware 2008 (2008), pp. 46–63. [7] Ehlert, S., Petgang, S., and Magedanz, T. Analysis and signature of Skype VoIP session traffic. In 4th IASTED International (2006). [8] Foundation, O. N. Software-defined networking: The new norm for networks. https://www.opennetworking.org/images/stories/ downloads/sdn-resources/white-papers/wp-sdnnewnorm.pdf. Accessed: 2014-3-15. [9] Giroire, F., Chandrashekar, J., Taft, N., Schooler, E., and Papagiannaki, D. Exploiting temporal persistence to detect covert botnet channels. In Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection (Berlin, Heidelberg, 2009), RAID ’09, Springer-Verlag, pp. 326–345. [10] Grizzard, J. B., Sharma, V., Nunnery, C., Kang, B. B., and Dagon, D. Peer-to-peer botnets: overview and case study. In Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets (Berkeley, CA, USA, 2007), HotBots’07, USENIX Association, pp. 1–1. [11] Gross, J. L., and Yellen, J. Graph Theory and Its Applications. 2005. [12] Guha, S., Daswani, N., and Jain, R. An Experimental Study of the Skype Peer-to-Peer VoIP System. In IPTPS’06: The 5th International Workshop on Peer-to-Peer Systems, Microsoft Research. [13] Ha, D. T., Yan, G., Eidenbenz, S., and Ngo, H. Q. On the effectiveness of structural detection and defense against p2p-based botnets. In Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’09) (Estoril, Lisbon, Portugal, June 2009). [14] Iliofotou, M., chul Kim, H., Faloutsos, M., Mitzenmacher, M., Pappu, P., and Varghese, G. Graption: A graph-based p2p traffic classification framework for the internet backbone. Computer Networks 55, 8 (2011), 1909 – 1920. [15] Iliofotou, M., Pappu, P., Faloutsos, M., Mitzenmacher, M., Singh, S., and Varghese, G. Network monitoring using traffic dispersion graphs (tdgs). In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (New York, NY, USA, 2007), IMC ’07, ACM, pp. 315–320. [16] Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J. M., and Brown, C. T. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proceedings of the National Academy of Sciences (2012). [17] S., C. Object count impact on garbage collection performance. http://jroller.com/slobodan/entry/ object_count_impact_on_garbage, September 2005.
[18] Stutzbach, D., Rejaie, R., and Sen, S. Characterizing unstructured overlay topologies in modern p2p file-sharing systems. IEEE/ACM Trans. Netw. 16, 2 (Apr. 2008), 267–280. [19] Suel, T., and Yuan, J. Compressing the graph structure of the Web. In Data Compression Conference, 2001. Proceedings. DCC 2001. (2001), IEEE, pp. 213–222. [20] Venugopalan, R. 2014 threats predictions: Software defined networking promises greater control while increasing security risks. http://blogs.mcafee.com/mcafee-labs/2014threats-predictions-software-definednetworking-promises-greater-control-whileincreasing-security-risks. Accessed: 2014-3-6.