international journal of critical infrastructure protection 5 (2012) 108 –117
Available online at www.sciencedirect.com
www.elsevier.com/locate/ijcip
A global reference model of the domain name system Yakup Koc- a, Almerima Jamakovicb, Bart Gijsenc,n a
Department of Multi-Actor Systems, Systems Engineering Group, Delft University of Technology, P.O. Box 5015, 2600 GA Delft, The Netherlands b Communications and Distributed Systems Group, Institute of Computer Science and Applied Mathematics, University of Bern, Neubru¨ckstrasse 10, CH-3012 Bern, Switzerland c TNO – Netherlands Organization for Applied Research, P.O. Box 5050, 2600 GB Delft, The Netherlands
ar t ic l e in f o
abs tra ct
Article history:
The domain name system (DNS) is a crucial component of the Internet. At this time, the DNS
Received 17 May 2012
is facing major changes such as the introduction of DNSSEC and Internationalized Domain
Accepted 9 August 2012
Name extensions (IDNs), the adoption of IPv6 and the upcoming extension of new generic
Available online 24 August 2012
top-level domains. These changes can significantly impact the behavior of the DNS. This
Keywords:
paper presents a global DNS reference model for predicting DNS traffic behavior under specific
Domain name system (DNS)
conditions. The quantitative reference model is intended to be used for analyzing ‘‘what-if’’
Global reference model
scenarios—for example, how would DNS query rates at the recursive and authoritative name
Traffic behavior prediction
servers increase if DNSSEC validation errors were to cause more ServFail responses to be sent to DNS clients? The DNS reference model takes into account all relevant components present in the DNS architecture. Real-world data from recursive resolvers is analyzed statistically in order to characterize the system variables that describe query behavior at each of the independent system components. In addition, experimental results that characterize DNS client behavior and data from the literature are used to model the behavior of authoritative name servers. The reference model is validated by comparing the model predictions with the behavior observed in real-world operations. The validation results demonstrate the accuracy of the model predictions. A what-if scenario dealing with the effect of ServFail responses on DNS traffic flow is also presented to demonstrate the applicability of the model. & 2012 Elsevier B.V. All rights reserved.
1.
Introduction
The Internet has become an essential part of modern society. As a consequence, the stability of the Internet, including the domain name system (DNS) as a key component, is crucial. The DNS is primarily used to translate human-readable domain names into the corresponding Internet protocol (IP) addresses that are used for routing purposes. For instance, because of the DNS, one can use cnn.com instead of 157.166.255.19. The data for mapping domain names to n
IP addresses is stored in a tree-structured distributed database, where the mapping responsibility for each domain is assigned to designated authoritative name servers (NSs). The authoritative NSs are responsible for their particular zones, which typically are the root, top-level domain (TLD) and second-level domain (SLD). This distributed mechanism makes the DNS distributed and resilient against failure [10]. The top layer of the DNS hierarchy is facing major changes. These include cryptographically signing the authoritative NS with DNSSEC, deploying new generic TLD names by allowing
Corresponding author. E-mail addresses:
[email protected] (Y. Koc-),
[email protected] (B. Gijsen).
1874-5482/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ijcip.2012.08.001
international journal of critical infrastructure protection 5 (2012) 108 –117
domains such as .bank, and deploying internationalized domain name extensions (IDNs), possibly with non-ASCII characters. Another key change is the uptake of IPv6 that is intended to make the Internet ‘‘future proof.’’ These developments can impact DNS stability and, indirectly, the continuity of the entire Internet. For example, the query load towards an authoritative NS is expected to increase [5,12] and a specific type of DNS query response (i.e., ServFail) may increase [2]. These challenges require enhanced public awareness and more research focused on understanding DNS behavior in the evolving DNS landscape. This paper presents a global DNS reference model for analyzing what-if scenarios—for example, how would DNS query rates at the recursive and authoritative name servers increase in the event that DNSSEC validation errors cause more ServFail responses to be sent to DNS clients? The global reference model takes into account the typical DNS architecture: starting from a client operating system with its stub resolver and application browser, then to the recursive resolver (present mostly at an Internet service provider), and finally to the authoritative NS, which includes the root, TLD and SLD servers. Real-world data from recursive resolvers is statistically analyzed to characterize the system variables that describe query behavior at each of the independent system components; the data used was provided by SURFnet, which serves a large number of academic customers in The Netherlands. DNS client behavior is characterized using an experimental study conducted by TNO and SIDN, and the DNS behavior of authoritative name servers is characterized in detail using data from the literature. The reference model is validated using Monte Carlo simulation to generate DNS behavior predictions, which are compared with behavior seen in real-world data. The validation results demonstrate the accuracy of the model predictions. Finally, a what-if scenario involving the impact of ServFail responses on DNS traffic is presented to demonstrate the applicability of the reference model. Overall, the paper establishes a path towards the proper understanding of the DNS behavior in the increasingly evolving DNS landscape.
2.
Related work
The measurement and characterization of DNS traffic is an active field of research. Most contributions are based on DNS query monitoring and data analysis, focusing mainly on the upper DNS hierarchy. For example, the vast majority of research attempts to address the issue of characterizing DNS traffic at the root: CAIDA and the Measurement Factory have performed numerous monitoring studies focused on the traffic at the root (see, e.g., [4,5,13]). Aside from traffic analysis at the core component of the upper DNS hierarchy, researchers have attempted to characterize DNS traffic at the recursive resolver and at the client side. For example, Brandhorst and Pras [3] provide a statistical analysis of DNS traffic at the recursive resolver while other researchers [1,15] compare the performance of caching recursive resolvers with respect to query response time and querying behavior towards the root. Also, Gijsen [7] has
109
attempted to characterize the querying behavior of specific clients (operating systems and application browsers). Equally relevant to our research are experimental studies that seek to understand the effectiveness of DNS caching [6,8,16]. Several researchers lament the lack of data for longterm research and analysis in support of DNS performance, stability and security. For example, Castro et al. [5] highlight the need for data to evaluate the DNS during its expected transition phase. Although there is substantial literature on the characterization of the traffic and querying behavior at each individual hierarchical level of the DNS, hardly any attention has been paid to the study of the entire DNS. Several researchers (see, e.g., [1]) pinpoint the limitations of the current DNS deployment and its influence on the performance of applications. However, to this day, there is a little understanding of how the DNS behaves as a whole, especially when the expected changes are incorporated and the querying mechanism needs detailed analysis. By introducing a reference model of the entire DNS, this paper makes an important contribution to the fundamental understanding of DNS behavior in the evolving DNS landscape.
3.
DNS reference model
This section discusses the details of the DNS reference model, including its general features and assumptions, and system variables.
3.1.
General features and assumptions
A primary concern is the scalability of the DNS system when, for example, redundant DNS traffic is sent to the recursive resolvers and authoritative NSs. We therefore create a reference model at the flow level, being only interested in the query flow distribution at an arbitrary point in time. Consequently, the notion of time does not play a role and the distribution of DNS queries only depends on the behavior of various components of the DNS system. We choose to distinguish between the following generic components in the DNS: (i) clients with their operating systems (and the corresponding stub resolvers) and application browsers; (ii) recursive resolvers; and (iii) authoritative NSs with the root, TLD and SLD NSs. Fig. 1 shows the generic components of the DNS system and the interactions between them. In our model, we assume that all clients of the same configuration type are independent and have identical querying behavior, so they can be modeled as one aggregate client. The same holds true for the recursive resolvers and authoritative NSs (root, TLD and SLD). This assumption enables us to control the entire system by only adjusting the input parameters for a single client, a single recursive resolver, and a single root, TLD and SLD. Furthermore, we model the querying behavior with a system variable called the ‘‘query multiply factor,’’ which reflects the number of queries that are re-initiated by a component after a negative query response. The caching behavior of a component depends strongly on the TTL values and inter-arrival time of queries, having a stochastic and state dependent
110
international journal of critical infrastructure protection 5 (2012) 108 –117
Fig. 1 – DNS modeling structure. behavior [8]. Moreover, the caching mechanism is not modeled as a state (i.e., whether or not the domain name is in the cache), but rather by a probability that a queried domain name is in the cache of the corresponding system component. We refer to this system variable as the ‘‘cache hit ratio.’’ Finally, we assume that the type of a response to an initial query at the authoritative NS follows a certain distribution. This variable is referred to as the ‘‘response distribution at authoritative name servers.’’ The values of these system variables are obtained by analyzing real-world data comprising 10 sets of 30,000 DNS packets captured at an UNBOUND resolver over a duration of 14 s.
3.2.
System variables
This section describes the system variables in detail.
3.2.1.
Cache hit ratio
The ‘‘cache hit ratio’’ expresses the probability that a queried domain name is in the cache of a system component under consideration. The cache hit ratio notion is different for the client and recursive resolver sides; the two sides are, therefore, treated separately. The value at the client side expresses the probability that a query is answered with a certain response type from the cache. Whether or not the received DNS data is cached depends on the response type. For instance, application browsers cache only DNS response types that provide valid data (i.e., valid, valid 4512 B and truncated) and the NXdomain response type. The cache hit ratio values for the operating system and application browser are rather complicated to determine. Therefore, considering the relative scale of the DNS reference model, we assume ‘‘rule of thumb’’ values for the operating system and application browser (Tables 1 and 2). For specific cases where the cache hit ratios are known, the known values should be used in place of the table values. Note also that the tables demonstrate that we distinguish between operating systems and browsers because the cache behavior can depend on them. In Tables 1 and 2, ‘‘Total’’ represents the percentage of the total traffic that is responded to from the operating system/ application browser cache. Correspondingly, the 22% value for the IE8 application browser indicates the amount of traffic
Table 1 – Cache hit ratio values for four operating system types. Response type
Windows XP (%)
Windows 7 (%)
Linux (%)
Mac OSX (%)
Total Valid Valid ð4512 BÞ NXdomain Truncated
25 22 1
25 22 1
0 0 0
25 22 1
1 1
1 1
0 0
1 1
Table 2 – Cache hit ratio values for three application browser types. Response type
IE8 (%)
Firefox (%)
Safari (%)
Total Valid Valid ð4512 BÞ NXdomain Truncated
25 22 1 1 1
25 22 1 1 1
25 22 1 1 1
that is responded with the ‘‘Valid’’ response type. The cache hit ratio values for the operating system and application browser are typically lower than the corresponding values for the recursive resolver because these are single client caches. The cache hit ratio values at the recursive resolver are quite different from those at the client side. From the caching point of view, queries arriving at the recursive resolver are classified into four groups:
Non-cached queries are queries that are not in the cache.
These queries have to be sent to the root directly and domain name resolution is performed by the resolver until the name is resolved. TLD-cached queries are queries for domain names for which only the NS for the extension is in the cache of the resolver. This means that TLD-cached queries are sent directly to the TLD NS, skipping the root. SLD-cached queries are directly sent to the SLD NS. The TLD and SLD of these queries are known by the caching resolver.
111
international journal of critical infrastructure protection 5 (2012) 108 –117
Domain-cached queries occur when the entire request is in the cache. The probability that an incoming query is located in one of these groups is given by the cache hit ratio. The cache hit ratio values for the UNBOUND resolver are determined by analyzing the data captured at an UNBOUND resolver in the SURFnet network. This dataset comprises 300,000 DNS packets divided into 10 subsets of 30,000 DNS packets each. Table 3 presents the cache hit ratio values for the recursive resolver for one of the subsets. To verify the stochastic nature of the parameter under consideration, we analyzed the 10 subsets of 30,000 DNS packets. After verifying that the subsets were large enough to obtain statistically independent results, we concluded that the variance of the average cache hit ratio values was low. Also, using the statistical Shapiro–Wilk test [11], we concluded that the hypothesis stating that the values were distributed according to a normal distribution cannot be rejected. Readers are referred to [9] for additional details regarding the statistical analysis.
3.2.2.
Response distribution at authoritative name servers
The ‘‘response distribution at authoritative name servers’’ variable expresses the fraction of response types issued at an authoritative NS in response to incoming initial queries. These values are different for the root, TLD and SLD NSs. Table 4(a) shows a detailed breakdown of the response types at authoritative NSs while Table 4(b) presents the values of the variable for one of the SURFnet datasets. Because we used a relatively small dataset, only seven queries towards the root are seen in Table 4(a). It is interesting to note that each of the seven initial queries to the root are replied with an NXdomain response type. This observation is explained by the effectiveness of the caching mechanism in the recursive resolver. Since the dataset was not large enough to determine the response distribution at the root NSs, we have used the values presented in [14]. The distribution of response types at the root are provided in the column labeled ‘‘Root(%)’’ in Table 4(b). As in the case of the cache hit ratio variable, we examined the variance in the average response type distribution values for 10 data subsets. The conclusion is similar: the variance around the average relative values presented in Table 4(b) is low. Also, based on the statistical Shapiro–Wilk test [11], the hypothesis that the values are distributed according to a normal distribution function cannot be rejected.
3.2.3.
Query multiply factor
The ‘‘query multiply factor’’ variable expresses the number of queries that are (re)sent by a component in reaction to specific responses depending on the response type (NXdomain, valid Table 3 – Cache hit ratio values for the UNBOUND recursive resolver. Cached Domain
UNBOUND (%)
TLD-cached SLD-cached Domain-cached Non-cached
4.1 41.1 54.7 0.1
Table 4 – Response distribution at authoritative name servers. Response type
Root
TLD
SLD
(a) Referrals A AAAA CNAME MX PTR NXdomain Not Imp. Refused ServFail NS SOA TXT Format Error
0 0 0 0 0 0 7 0 0 0 0 0 0 0
377 0 0 0 1 2 31 0 5 2 0 0 0 0
741 1,072 20 673 23 105 510 89 270 199 3 1 6 5
(b) Response type Valid NXdomain ServFail Refused
Root (%) 8.1 91.5 0.4 0
TLD (%) 90.9 7.4 0.5 1.2
SLD (%) 71.1 13.7 7.9 7.3
responses, etc.). Determining the values for this system variable involves a detailed characterization of the querying behavior at the client and recursive resolver sides. In the case of the client, experiments in a laboratory environment have shown that when a negative response is received for an initial query, the client may automatically resend new identical repeat queries [7]. The number of repeat queries depends strongly on the client operating system and application browser. Clients with different operating system and application browser combinations react differently when they receive a given type of response to their initial queries. Table 5 shows the query multiply factor values per response type for different operating system and application browser types [7]. Note that the values in the table include the initial queries (in the case of the ServFail response, a Linux-Firefox client sends eight queries, including the initial query). To understand the querying behavior of the recursive resolver, we analyzed the data of the two most popular resolvers, UNBOUND and BIND9. The resulting query multiply factor values for the recursive resolvers are presented in the second and third columns of Table 6. The fourth and fifth rows in the table present the values of a binary model input parameter called ‘‘shielding factor.’’ A shielding factor value of zero indicates that repeat queries from the client that were already repeated by the resolver are not be sent towards the authoritative side again. For example, when a query times out, an UNBOUND resolver retries the query six times. Also, any retries that are sent from the client during the retrying period are not forwarded to the authoritative side by the UNBOUND resolver.
4.
DNS model operation
As mentioned previously, the model reflects the DNS architecture: (i) client with its operating system (and the
112
international journal of critical infrastructure protection 5 (2012) 108 –117
Table 5 – Query multiply factor values for various client operating system and application browser types. Response type
Windows XP
Windows 7
Linux
Mac OSX
IE8
Firefox
Safari
Valid NXdomain Partial ServFail Time-out Refused Truncated
1 1 1 1 4 4 2
1 1 1 1 4 4 2
1 2 2 4 4 4 2
1 2 2 4 4 4 2
1 1 1 1 1 1 1
1 2 2 2 2 2 1
1 1 1 1 1 1 1
Table 6 – Query multiply factor and shielding factor values for two recursive resolvers. Response type
QMFUNBOUND
QMFBIND9
ShieldUNBOUND
ShieldBIND9
Valid NXdomain Partial ServFail Time-out Refused Truncated
1 1 1 5 7 5 1
1 1 1 2 7 1 1
0 0 0 0 0 0 0
0 0 1 1 0 1 0
corresponding stub resolver) and application browser; (ii) recursive resolver; and (iii) authoritative NS at the root, TLD and SLD NSs. Fig. 2 shows the DNS reference model as it is implemented in Microsoft Excel. Essentially, the reference model is a linear flow-based model that follows from the summation of multiple distinct flow paths. As seen in Fig. 2, the left-hand side (client side) of the reference model is divided into three different parts: (i) query to root; (ii) query to TLD; and (iii) query to SLD. This partitioning helps to determine the number and types of responses going back to the client from the root, TLD and SLD NSs. Three steps are involved in the operation of the DNS reference model. In Step I, the initial queries go from the client side to the authoritative NS side. In Step II, the responses to the initial queries are returned from the authoritative NS side to the client side. In Step III, the repeat queries determined by the response type and query multiply factor are sent from the resolver or the client side to the authoritative NS side. This section discusses three steps in detail. The scenario parameters specified in Table 7 are used to illustrate the main concepts.
4.1.
Step I
In the first step, the client sends the initial queries to the authoritative NS via the client’s operating system, application browser and recursive resolver. This process is observed in the first row of the DNS reference model in Fig. 2. In this example, the client generates 1.1 queries per time unit (qpt). It sends queries to the application browser, which responds with 25% of the query rate (0.27 qpt) from the cache (Table 1) and forwards 0.83 qpt to the client operating system. In Fig. 2, the number of incoming queries at the operating system (Linux in our example) is equal to the number of outgoing queries. This is because Linux does not implement
caching (Table 2). Because we assume that 1000 clients simultaneously query the resolver (Table 7), the rate of queries arriving at the resolver is 825 qpt. The queries arriving at the resolver are classified into four groups. The distribution of the queries over these classes is based on the cache hit ratio for the resolver (Table 3). According to Table 3, 0.1% of the queries belong to the noncached group (i.e., the queries are forwarded directly to the root) and 4.1%, 41.1% and 54.7% are forwarded to the TLD, SLD and the domain-cached group, respectively. Consequently, 54.7% of the queries are directly answered by the recursive resolver while 45.3% of the queries undergo (a portion of) the recursive resolution process. After the recursive resolution has been initiated at the root, it is performed until the entire domain name is resolved. Thus, queries that are responded at the root with the valid response type are sent to the TLD NS by the recursive resolver. Queries that are again qualified as valid at the TLD NS are sent to the SLD NS. After receiving the response from the SLD NS, the domain name resolution process for the initial queries is completed. The distribution of response types at the root, TLD and SLD can be found by using the response distribution for authoritative NS values shown in Table 4(a). The total numbers of the queries going from a particular resolver to the root, TLD and SLD NSs are 0.83, 40.49 and 368.2 qpt, respectively. Recall that 454.6 qpt is answered by the resolver itself. Taking into account the scenario parameter that specifies that 100 resolvers are querying the root, TLD and SLD NSs, the total numbers of the initial queries at the root, TLD and SLD NSs are 83, 4049 and 36820, respectively.
4.2.
Step II
In the second step, the responses from the authoritative NS are sent back to the client. The response stream from the authoritative NS to the client is distinguished by response type. We assume that the authoritative NS answers all the queries. As seen in Table 6, for each ServFail response, the resolver initiates four extra repeat queries to the authoritative NS; six new repeat queries are initiated for each time-out response. In the DNS reference model, we assume that the repeat query has the same response as the initial query. Therefore, in total, five ServFail responses are gathered at the resolver although just one of them is sent back to the client. The behavior for time-out responses is similar. Following this line of reasoning, negative responses from the root, TLD and SLD NSs are sent back from the resolver to the client side
international journal of critical infrastructure protection 5 (2012) 108 –117
113
Fig. 2 – DNS reference model overview as implemented in Microsoft Excel. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
Table 7 – Reference model parameters for the scenario. Number of simultaneously active DNS clients Fraction of IPv6 clients relative to the total number of clients Number of simultaneously active recursive resolvers Primary and secondary NS (average number)
1000 10% 100 1
while positive responses are only sent after the recursive resolution process has been completed. Because we assume that repeat queries have the same response as the initial queries, positive responses at the SLD can only be the result of the positive responses starting from the root. Since we assume that each resolver serves 1000 identical users simultaneously, the number of responses at the client can be found by simply dividing the value at the resolver by 1000. These responses are sent to the user via the operating system and browser. There are two additional points concerning the transfer of valid responses to the user. The first point concerns valid 4512 B responses that may lead to time-out responses when going from the recursive resolver to the client operating system. This point is included in the model to analyze the effect of residential gateways that can block UDP packets with size larger than 512 B. In this case, a valid 4512 B response does not arrive at the client and is treated as a time-out response by the client. The second point concerns responses by the client operating system and browser. As
explained in Step I, a fraction of the initial queries is immediately returned because the two components have already queried the domain names in their caches. These responses (from Firefox and Linux in our example) are aggregated to the total response. They are shown in Fig. 2 between ‘‘User,’’ ‘‘Firefox’’ and ‘‘Linux’’ using green arrows that point to the valid, valid 4512 B, NXdomain and truncated responses. Recall that these two client components cache only valid query response types (valid, valid 4512 B and truncated) and the NXdomain response type.
4.3.
Step III
In this step, for each negative response, new repeat queries are reinitiated from the client towards the authoritative side. Since response streams coming from authoritative NSs are modeled as separate flows, it is possible to determine how many new repeat queries are reinitiated from the client to the authoritative side. The repeat queries from the operating system and application browser are sent based on the query multiply factor values in Table 5. Whether or not a repeat query is sent towards the authoritative side depends on the type of resolver and the type of response for which a repeat query is reinitiated. For example, UNBOUND caches valid and NXdomain responses. Hence, all repeat queries due to NXdomain responses are in the cache and they are answered by UNBOUND. According to Fig. 2, 0.061 qpt of the NXdomain responses go back from the SLD NS to the client. From Table 5, the query
114
international journal of critical infrastructure protection 5 (2012) 108 –117
multiply factor has a value of two for an NXdomain response for both Firefox and Linux. Therefore, the value of 0.061 qpt is multiplied by two when passing through Firefox and again by two when going through Linux. Consequently, a total of 0.244 qpt is sent by the client towards the resolver instead of the initial 0.061 qpt. Recall that the query rate determined by the query multiply factor equals the total number of queries sent (initial plus repeat queries). Therefore, 0.179 repeat queries (in addition to the 0.061 sent initially) are sent from the operating system to the recursive resolver. The same procedure is followed for other response types and all the repeat queries are sent to the resolver. However, the resolver (based on its caching property) sends to the authoritative NS only those responses for which caching does not play a role. For example, repeat queries due to NXdomain responses arrive at the resolver, but they are not sent to the authoritative NS because the resolver deploys negative caching (Table 6). As a result, 1178 and 3240 repeat queries arrive at the root, TLD and SLD, respectively.
5.
DNS model validation
In this section, the DNS reference model is validated using a new dataset captured at an UNBOUND recursive resolver in the SURFnet operational network. The new dataset comprises 30,000 DNS packets collected over a time period of 51 s. Although model validation using a single dataset captured at a specific time of day is rather limited, it still provides a good indication of the capability of the DNS reference model to capture DNS querying behavior. The model is validated by first analyzing the data and obtaining input parameter values that are used to enable the model to predict traffic flows. Then, simulations are executed and the model predictions are compared with the real-world data. Finally, a sensitivity check is performed on the model. Anomalies in the dataset were removed before analysis. In particular, we observed that some misconfigured clients sent many repeat queries for the same domain names although they had received positive answers to their queries. For the ‘‘cleaned’’ dataset, the number of the initial queries may be determined at different points of interest (POIs) in the system. The determination of the initial query numbers is crucial because the DNS reference model is calibrated with the initial queries at the different points of interest. The notion of a ‘‘repeat query’’ is used to determine the number of initial queries. The second query of two queries in defined as a ‘‘repeat query’’ if it has the same domain name, query type and destination level as the first query. Additionally, the time difference between the two queries must be smaller than some value d. For repeat queries at a recursive resolver, d is determined to be 13 s; for an authoritative NS, d is 3 s. These values are determined by analyzing the behavior of the client and recursive resolver. Recall that in the case of a ServFail response, Linux clients send seven repeat queries towards the recursive resolver. The time difference between the initial query and the last repeat query is measured to be at most 13 s. On the other hand, in case of a ServFail response, the time difference between the initial query and the last repeat
query from the UNBOUND towards the authoritative NS is measured to be 3 s. Having defined the repeat query notion, initial queries from a given dataset of aggregated queries (initial and repeat queries) were obtained for the recursive resolver and the root, TLD and SLD NSs. The corresponding values are shown in Table 8. The last model input parameter to be obtained from the real-world data is the distribution of initial queries over the operating system types (Windows, Linux and Mac). In fact, this parameter indicates the fraction of the initial queries generated by a specific client type. The operating system corresponding to each DNS packet is identified by its fingerprint (IP TTL value). The results are shown in Table 9. The input parameter values obtained from the real-world data serve as a starting point for the validation process. Because the data is captured at a single UNBOUND recursive resolver, the number of simultaneously active resolvers is one. Additionally, the model is calibrated at the initial query rate found at the resolver instead of the number of initial queries at the users. As a consequence, the number of simultaneously active DNS clients is calculated backwards – we found that for a query rate of 1 qpt, 9700 users generate 7131 initial queries at the recursive resolver. Therefore, in this case, the number of simultaneously active DNS clients in the scenario is 9700. The effect of secondary NSs is ignored by assuming that there are no secondary NSs and that the fraction of IPv6 is zero for all clients. Having obtained the input parameters for the model, we proceeded to execute the model. Note that we were interested in the numbers of initial and repeat queries at different points of interest in the model. The simulation was repeated 30,000 times for each of the combinations of client operating system (Windows, Linux and Mac) and application browser (Windows-IE, Mac-Safari and Linux-Firefox). The outcomes of the simulations are histograms showing the distributions of the initial and repeat queries at each point of interest. Fig. 3 shows an example histogram with the repeat query distributions (i.e., number of runs with a particular number of repeat queries) at the recursive resolver. The average values from the distribution of initial and repeat queries at each point of interest are provided in Table 10. Note that the outcome values are weighted by the distributions of client operating systems obtained from the real-world dataset (Table 9). Next, we compare the simulation results with the results obtained from the real-world data. In particular, we compare Table 8 – Initial and repeat queries at different points of interest in real-world data. Query type
Resolver
Root
TLD
SLD
Initial Repeat
7131 2360
13 0
204 8
3414 723
Table 9 – Distribution of queries over operating system types based on initial queries. OS
Linux (%)
Windows (%)
Mac (%)
Fraction
60.1
30.2
9.7
115
international journal of critical infrastructure protection 5 (2012) 108 –117
Recursive Resolver Repeat Queries
Table 12 – Repeat-initial query ratios at different points of interest for the real-world data and the DNS reference model.
60 Frequency
50 40
Repeat-initial ratio
Resolver (%)
TLD (%)
SLD (%)
30
Real world data DNS reference model
24.8 17.5
3.8 1.4
17.5 10.4
20 10 11 1 11 0 7 12 0 3 12 0 9 13 0 5 14 0 1 14 0 7 15 0 3 15 0 90 16 5 17 0 1 17 0 7 18 0 30 18 9 19 0 5 20 0 1 20 0 7 21 0 3 21 0 90
0 Query number Fig. 3 – Repeat query distribution at recursive resolver. Table 10 – Initial and repeat queries arriving at different points of interest in the DNS reference model. Query type
Resolver
Root
TLD
SLD
Initial Repeat
7204 1530
8 0
350 5
3,030 350
Table 11 – Query ratios at different points of interest obtained from the real-world data and the DNS reference model. Query ratio
Resolver (%)
Root (%)
TLD (%)
SLD (%)
Real world data DNS reference model
68.5 70
0.1 0.1
1.5 2.8
29.9 27.1
the fraction of total queries as well as the ratio of the repeat and initial query rates at each point of interest. For instance, in the DNS reference model, the fraction of total queries at the recursive resolver is the ratio of the number of queries at the resolver and the number of queries in the entire system, i.e., (7204þ1530)/(7204þ1530þ8þ350þ5þ3030þ 350)¼ 70%. Table 11 shows the fractions of total queries at the points of interests for the real-world data and the DNS reference model. It can be seen that the DNS reference model predicts the query ratios for the points of interest with reasonable accuracy. The small errors are most likely due to the stochastic nature of the system variables. The second test point of the DNS reference model concerns the repeat-initial query ratio at the points of interest. This ratio indicates the fraction of repeat queries with respect to the total number of queries at each point of interest. For instance, in the DNS reference model, the fraction of repeatinitial queries at the resolver is equal to 1530/(1530þ7204)¼ 17.5%. Table 12 shows the repeat-initial ratio at different points of interest for the real-world data and the DNS reference model. Note that the model underestimates the fraction of repeat queries. Apparently, the repeat query rates in practice are larger than the repeat behavior included in the model. A difference of 7.3% is observed at the recursive resolver. This error is likely due to IPv6 clients. IPv6 clients send two
queries, A and AAAA, as a pair for address resolution. When a negative response is received from the recursive resolver, repeat queries are also re-sent in pairs, meaning that more repeat queries are sent than indicated by the query multiply factor values given in Table 5. The error at the authoritative NS might be due to the secondary NSs. For example, in case of a ServFail response, each additional NS causes five extra queries. First, two repeat queries are sent to the primary NS. If a ServFail response is received, then the secondary NS is queried. If the secondary NS also returns a ServFail response, then UNBOUND again queries the primary NS. This querying pattern continues until each NS is queried five times. Thus, a ServFail response produces a total of 10 queries instead of five in the case of a single NS. We also investigated the sensitivity of the model predictions with respect to statistical variations of the input parameters. We concluded that the dispersion in the distributions is small and that all the values are concentrated around the mean. Also, the variances of the system input variables and output values are comparable. This implies that the DNS reference model does not amplify the uncertainty due to the random system input variables.
6.
Case study
As stated earlier, the DNS is facing several major changes, among them the introduction of DNSSEC. DNSSEC introduces the risk of an increase in ServFail responses due to validation errors and other factors. For example, any error made in a DNSSEC signature at the authoritative NS side results in a validation error at the recursive NS. By default, the recursive NS feeds back the validation error to the client side as a ServFail response. We evaluate the impact of the potential increase in ServFail responses and quantify the increase of DNS traffic towards the recursive resolver, the TLD and SLD NSs as a function of the relative increase of ServFail responses. First, we investigate the impact of an increase in responses from the TLD NS that trigger the resolver to return a ServFail response to the client. In this case, we keep the fraction of ServFail responses at the root and SLD NS constant. We vary the value of the model input parameter response distribution for ServFail at the TLD (Table 4(a)) and observe the values predicted by the model for the DNS traffic increase from clients towards the resolver and from the resolver towards the TLD NS. For the other model input parameters, we use the default parameter values presented in the tables in Section 3.2. In this case, we focused on the behavior of the UNBOUND resolver. For the DNS query volumes and the
116
international journal of critical infrastructure protection 5 (2012) 108 –117
7.
Conclusions
The global DNS reference model is intended to help assess the scalability of the DNS system in a variety of what-if scenarios. The reference model incorporates components
TLD Servfail responses increase's impact on DNS traffic 40
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
We also investigated the traffic increase towards the recursive resolver and towards the authoritative NS as a consequence of the ServFail increase at the SLD. The results are shown in Fig. 5. Note that the DNS traffic towards the authoritative NS increases significantly with the increase in ServFail responses. For example, in case of 10% additional ServFail responses at the SLD, the DNS traffic increases 35% towards the SLD NSs. We also observe that the increase in ServFail responses at the SLD NSs has a larger impact on the DNS traffic than the increase in ServFail responses at the TLD layer.
DNs traffiic increase towards TLD NS (%)
DNS traffic increase towards recursive resolver (%)
distribution of traffic for the various operating system and browser combinations, we used the real-world data as described in the previous section. The results are presented in Fig. 4. The x-axis represents the percentage of increase in ServFail responses at the authoritative TLD NS. For example, a value of 2% implies that an additional 2% of ServFail responses are added to the percentage of ServFail responses specified in Table 4(a). The y-axis represents the predicted relative increase of DNS traffic towards the recursive resolver and towards the authoritative NS. The figure shows a more or less linear relationship between the ServFail and DNS traffic increases. However, the DNS traffic increase towards the TLD is much larger than from the client towards the resolver. More detailed analysis (not presented here) can explain these results. Not explicitly shown in Fig. 4 is the observation that additional ServFail responses triggered by responses from the TLD do not increase DNS traffic between any points of interest.
0
35 30 25 20 15 10 5 0
2 4 6 8 Servfail response increase at TLD NS (%)
0
2 4 6 8 Servfail response increase at TLD NS (%)
SLD Servfail response increase's impact on DNs traffic 50
25
DNS traffic increase towards SLD NS (%)
DNS traffic increase towards recursive resolver (%)
Fig. 4 – Impact of increase in ServFail responses at TLD NS on DNS traffic.
20
15
10
5
0
0
2 4 6 8 10 12 Servfail response increase at SLD NS (%)
40
30
20
10
0
0
2 4 6 8 10 12 Servfail response increase at SLD NS (%)
Fig. 5 – Impact of increase in ServFail responses at SLD NS on DNS traffic.
international journal of critical infrastructure protection 5 (2012) 108 –117
that capture the DNS behavior of client operating systems and application browsers, recursive resolvers and authoritative name server levels. Key parameters used to model DNS behavior include cache hit ratio, query/response distribution at the authoritative side and query multiply factors. The values of these parameters are obtained from real-world data, from experiments with DNS clients and from the results of data analyses conducted by other researchers. Validation results demonstrate that the model predictions closely pproximate the behavior seen in real-world data. Also, the outputs of the DNS reference model are not very sensitive to the variations in the input parameters. The applicability of the model is highlighted using a case study that demonstrates how a potential increase in ServFail responses at some domain level impacts DNS traffic flows throughout the global DNS system. Further work needs to be done to validate the model with data from different DNS environments. Although a dataset comprising 300,000 DNS packets was used in the validation effort, analysis of larger datasets is required to determine the response distribution at the root. Since the response distribution at authoritative name servers variable is critical to the initial-repeat query ratio, it is important to extend the data analysis with additional and larger datasets to obtain representative values for all the system variables. Also, extending the model with the effects of a secondary NS, more negative caching features and IPv6 enabled hosts are important areas of future work. Indeed, we expect that modeling these factors would reduce errors and yield more accurate predictions of DNS querying behavior. Several researchers have noted that a significant portion of the DNS traffic in real-world datasets is generated by a limited number of clients. Our analysis has confirmed this observation, which may result, for example, from misconfigured name servers or clients. Detecting these types of behavior requires some data engineering in order to obtain clear interpretations of DNS data. Moreover, developing advanced algorithms for detecting DNS anomalies will contribute to a better understanding of DNS behavior that will ultimately benefit current and future Internet operations.
r e f e r e n c e s
[1] B. Ager, W. Mu¨hlbauer, G. Smaragdakis, S. Uhlig, Comparing DNS resolvers in the wild, in: Proceedings of the Tenth Annual Internet Measurement Conference, 2010, pp. 15–21.
117
[2] R. Arends, R. Austein, M. Larson, D. Massey, S. Rose, RFC 4035: Protocol Modifications for the DNS Security Extensions, 2005. [3] C. Brandhorst, A. Pras, DNS: A statistical analysis of name server traffic at local network-to-Internet connections, in: Proceedings of the IFIP International Workshop on Networked Applications, 2006. [4] S. Castro, D. Wessels, M. Fomenkov, K. Claffy, A day at the root of the Internet, ACM SIGCOMM Computer Communication Review 38 (5) (2008) 41–46. [5] S. Castro, M. Zhang, W. John, D. Wessels, K. Claffy, Understanding and preparing for DNS resolution, in: Proceedings of the Second International Conference on Traffic Monitoring and Analysis, 2010. [6] B. Duska, D. Marwood, M. Feeley, The measured access characteristics of World-Wide-Web client proxy caches, in: Proceedings of the USENIX Symposium on Internet Technologies and Systems, 1997, p. 3. [7] B. Gijsen, DNSSEC client analysis, presented at the Domain Name System Operations Analysis and Research Center Workshop, 2011. [8] J. Jung, E. Sit, H. Balakrishnan, R. Morris, DNS performance and the effectiveness of caching, IEEE/ACM Transactions on Networking 10 (5) (2002) 589–603. [9] Y. Koc-, Quantitative Modeling of DNS, Master’s Thesis, Department of Telecommunications, Delft University of Technology, Delft, The Netherlands, 2011. [10] P. Mockapetris, K. Dunlap, Development of the Domain Name System, ACM SIGCOMM Computer Communication Review 18 (4) (1988) 123–133. [11] S. Shapiro, M. Wilk, An analysis of variance test for normality (complete samples), Biometrika 52 (3/4) (1956) 591–611. [12] T. Toyono, K. Ishibashi, K. Toyama, Clear and present increase in the number of DNS AAAA queries, in: Presented at the Domain Name System Operations Analysis and Research Center Workshop, 2006. [13] D. Wessels, Is your caching resolver polluting the Internet?, in: Proceedings of the ACM SIGCOMM Workshop on Network Troubleshooting, 2004, pp. 271–276. [14] D. Wessels, M. Fomenkov, Wow, that’s a lot of packets, in: Proceedings of the Fourth International Workshop on Passive and Active Measurement, 2003. [15] D. Wessels, M. Fomenkov, N. Brownlee, K. Claffy, Measurements and laboratory simulations of the upper DNS hierarchy, in: Proceedings of the Fifth International Workshop on Passive and Active Measurement, 2004, pp. 147–157. [16] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, A. Karlin, H. Levy, On the scale and performance of cooperative web proxy caching, in: Proceedings of the Seventeenth ACM Symposium on Operating System Principles, 1999, pp. 16–31.