Proceedings of the 8th IEEE Workshop on Information Assurance, U.S. Military Academy, West Point, NY, 2022 June 2007
The Observability Calibration Test Development Framework Barbara E. Endicott-Popovsky, IEEE Student Member and Deborah A. Frincke, IEEE Senior Member
Abstract— Formal standards, precedents, and best practices for verifying and validating the behavior of low layer network devices used for digital evidence-collection on networks are badly needed—initially so that these can be employed directly by device owners and data users to document the behaviors of these devices for courtroom presentation, and ultimately so that calibration testing and calibration regimes are established and standardized as common practice for both vendors and their customers [1]. The ultimate intent is to achieve a state of confidence in device calibration that allows the network data gathered by them to be relied upon by all parties in a court of law. This paper describes a methodology for calibrating forensic-ready low layer network devices based on the Flaw Hypothesis Methodology [2,3]. Index Terms—digital forensics, life cycle, networks, network forensics I. INTRODUCTION
T
his paper describes a methodology for calibrating forensic-ready network devices—a first step toward an independent calibration standard for calibrating low layer network devices used to collect forensic data. Such calibration would support attestation that all pertinent data was collected without loss, or to indicate when data loss may have occurred through actions of a low layer device. Manufacturers rarely provide conclusive information about low layer network device performance [4,5]. While they may provide general specifications, few guarantee actual device behavior in relevant circumstances [4]. Some may be reluctant to provide this information because they deem that information to be proprietary [1]. Others may be engaging in what members of the IETF Network Working Group have called "specmanship"— "giving their products a better position in the marketplace, often using smoke and mirrors to confuse potential users" [6]. Regardless of the reason, a method of accurately calibrating device behavior is needed.
Manuscript received March 21, 2007. B. E. Endicott-Popovsky is with the Center of Information Assurance and Cybersecurity, the Information School and UWIT Tacoma, University of Washington, Seattle, WA 98105 (phone: 206284-6123; fax: 206-216-0537; e-mail:
[email protected]). D. A. Frincke, is with the Pacific Northwest National Laboratory, Richland, WA 99352 (e-mail:
[email protected]).
2007 IEEE
1
Customers have placed little demand on vendors to change this status quo. Because most networks run well below capacity, typically 10-30%, boundary behavior has not been a major concern [4,5]. When traffic volumes increase beyond 30%, the simplest solution is to add more capacity [4,5,7,8]. Thus, whether network devices perform at the limits of their capacity, or according to specification, is not an issue. Even discerning the true nature of the device used in “collecting” data can require additional effort on the part of the consumer. In a study of a dozen commercially available 'hubs,' Allen discovered that many were actually switches in terms of their functionality [9]. The rationale put forward by vendors is that customers who purchase such hubs are actually getting the increased functionality of a switch for the price of a hub, and therefore are not harmed [9]. However, the behavior and performance of network devices can arise under certain specific circumstances where issues of trust and reliance are involved, such as with advanced troubleshooting or when collecting forensic data [1,4]. And, in the situation of switches versus hubs, Allen asserts that there could indeed be harm if network troubleshooters are misled by these devices and spend unnecessary effort in pursuing wrong solutions [9]. Use of wrongly described low layer network devices can likewise have a negative impact on forensic evidence collection. A majority of these devices were never designed for forensic use, although they may be marketed as “forensic” [10]. Producing true “forensic” low layer features is not an easy task. Even if vendors wished to provide forensically sound documentation of device performance, there is no clear guidance for designing validation tests. Standards and precedents for digital evidence and forensic tool validation are just beginning to emerge as case history in the judicial system builds [11]. In this paper, we focus on a framework for device calibration. While consideration has been given to validating software tools used for forensic data recovery, little to no attention has been paid to the calibration of the hardware devices used to produce network data and how they behave “in the field” [1,12]. Unfortunately, the consequences of failing to explore and validate the behavior of these devices could lead to inadmissible evidence and failed legal action, as described in [1]. One way opposing counsel seeks to discredit an expert witness is by challenging foundation testimony. In cases relying
Proceedings of the 8th IEEE Workshop on Information Assurance, U.S. Military Academy, West Point, NY, 2022 June 2007
on digital evidence, foundation would involve describing the computer forensic tool used, the data that it collects, and why it works [11]. With information technology and cyber-enabled activities becoming ever more important factors in legal proceedings [13], the need for both network device calibration and associated calibration regimes will become increasingly important to law enforcement and users alike [14]. II. SOLUTION FROM THEORY & PRACTICE We propose the Observability Calibration Test Development Framework (OCTDF) to guide the development of appropriate calibration test protocols for low layer network devices. The perspective taken is that of the first responder to a digital incident, i.e. network administrator, on whom the responsibility will fall to collect forensically sound data that will survive expected courtroom challenges [1]. In addition, these individuals are likely to be called as expert witnesses [13]. They must be able to speak competently to the reliability of these devices in order to establish confidence in the accuracy and completeness of the data they collect [13]. The following areas of theory and practice provide a basis for the OCTDF: 1) Baselining, defined in [14] as "systematically viewing network point-to-point dataflow to analyze communication sequencing, extract accurate metrics, develop a technical synopsis of the network and make recommendations." 2) RFCs (Request for Comments), guidelines developed by the Internet Engineering Task Force (IETF), that provide Internet best practices developed and published collaboratively as RFCs. 3) Flaw Hypothesis Methodology (FHM), a penetration test methodology first developed for identifying operating system flaws, then widely used as the basis for developing software testing protocols for 1) detecting the presence of security flaws [2, 3]. A. Baselining The goal of baselining is to "use 2) quantitative statistical measurements to identify key network issues relevant to supporting the mission of the business." [14] It is used to provide assurances that a network is stable and operating reliably, and also to support network decisionmaking regarding investment in increased capacity. Baseline testing typically employs devices such as protocol analyzers, which are tools that provide characterization metrics of network workloads. 2007 IEEE
2
At a minimum during baselining, available capacity and utilization are determined—the latter defined as "the amount of capacity used by a specific network segment over a specific time interval" that encompasses a typical business cycle [14]. Both average and peak utilization are of interest in determining adequate capacity, as well as behavior when malformed packets are present, given the volumes of bad traffic on networks related to criminal and misuse activities [15]. Throughout baselining, utilization vs. capacity is constantly monitored and compared to previous history. In a benign situation, utilization is expected to mirror business operations; anomalies indicate potential problems that would require further testing and troubleshooting. While the majority of networks are not well-tested, baselining should be performed at regular intervals (i.e., annually, semi-annually) and when changes are made to a network [14]. Usually the period baselined is a week, although some enterprises might also experience quarterly and annual peak network loads, requiring development of additional baseline performance data during these times. While the goal of baselining is to provide a basis for managing network reliability, some of the general principles and elements of baselining may be applicable to calibration testing: 1) Confirming capacity. Any device used for the purpose of collecting forensic data should be neutral to the data collection process within the target system. In other words it should not interfere, i.e., drop packets, under typical network loads. Such behavior would raise doubts as to the reliability of any evidence collected by the device. In some cases, especially with high bandwidth networks, typical load may be near capacity. 2) Testing at regular intervals, at peak loads and when there are changes to the network. Calibration testing should be repeated on a regular basis and when the network is reconfigured. Devices can deteriorate over time due to mishandling or aging of components and networks change dynamically. 3) Historical documentation. Records should be kept that provide a history of device performance during calibration testing. This will provide support for foundation evidence that devices behave as intended. 4) Baselining tools. Protocol analyzers, specialized test beds, automated test suites, tools for baselining networks—all may be useful tools for calibrating network devices. Besides providing assurances that the device is behaving correctly, the kinds of information in items 1-3 above, coupled with actual baseline performance data for the network segment where the device is operating, would
Proceedings of the 8th IEEE Workshop on Information Assurance, U.S. Military Academy, West Point, NY, 2022 June 2007
allow characterization of the forensic data collected by the device, such as a description of its probable completeness at the time it was collected. For example, if the network typically runs at 20% capacity, and it is known from calibration testing that the device functions at full line rate capacity under these conditions, then an expert witness might assert that the data collected by the device at that time is, for all intents and purposes, complete. B. Requests for Comments (RFCs)
Figure 1a Test configuration with single testing device [6]
RFC guidelines are produced by volunteer working groups of practitioners and academicians under the governance of the Internet Engineering Steering Group (IESG), the Internet Architecture Board (IAB) or the Internet Research Task Force (IRTF), which are all chartered by the Internet Society (ISOC) to ensure the smooth operation of the Internet [16]. RFCs are advisories, as opposed to standards, and develop through collaborative iteration among interested volunteers over extended periods of time. RFC documents consist of "technical proposals, standards and ideas about packetswitched networks," and cover "many aspects of computer networking including protocols, procedures, programs and concepts as well as meeting notes and opinions [17]." Among the RFCs there are two that have relevance to this research: RFC2544- Benchmarking Methodology for Network Interconnect Devices [6] and the companion document, RFC 1242 Benchmarking Terminology for Network Interconnect Devices [18]. RFC2544 discusses and defines tests that can characterize performance of network devices for the purposes of comparing products from different vendors [6]. It acknowledges that vendors do not adhere to agreed-upon standards, making it difficult for specifications to be relied upon [6]. Therefore, it provides an olio of tests that can be assembled like building blocks to create an appropriate test suite for assessing a device for use within a target network. RFC1242, written several years earlier, defines terminology used to describe benchmarking tests and test results in 2544. These RFCs can be adapted to support design and development of calibration test protocols. Some of their most relevant guidelines are: 1) Recommendations for test bed configuration. RFC2544 acknowledges that "off the shelf" test gear to perform all suggested tests does not exist, making test bed configuration an important aspect of the calibration protocol. Further, they note that "the ideal way" to implement RFC2544 tests is to use a testing device with both sending and receiving ports [6] (Figure 1a), explaining that a single testing instrument can easily determine if all packets are sent and received. In contrast, the configuration in Figure 1b requires that "both sender and receiver be controlled by a single computer which is time intensive and tedious to set up" [6]. 2007 IEEE
3
Figure 1b Test configuration with both sender and receiver [6] DUT=Device Under Test
2) Recommendations for traffic configuration. Basic recommendations are: the bi-directional performance of a duplex device should be run with the "same data rate offered in both directions." The document implies the use of UDP packets, as opposed to TCP/IP packets and suggests that frame size in any single test be consistent. [6] C. Flaw Hypothesis Methodology (FHM) [2,3] The Flaw Hypothesis Methodology was initially conceived to probe the weaknesses of an operating system. It has since become the basis for systematically exposing the security flaws in any computer system for penetration testing in general. Preparatory to any application of the FHM, evaluators must: 1) establish test goals, 2) define the boundaries of the system to be tested, 3) assume the viewpoint of a potential penetrator, and 4) determine available testing resources (which may limit the testing approach). After the preparatory phase, the FHM four-step process is followed [2,3]: 1) Flaw generation. This step of the process begins with an analysis of the target system. A team of experts convenes to conduct a design review to hypothesize flaws in design. There are ten categories for identification and generation of flaws: a) Known flaws from similar systems b) Unclear design architecture c) Security control bypass d) Incomplete interface design
Proceedings of the 8th IEEE Workshop on Information Assurance, U.S. Military Academy, West Point, NY, 2022 June 2007
e) Deviation from security policy f) Deviation from initial assumed conditions g) System anomalies h) Operational practices i) Development practices j) Implementation errors
Observability Calibration Test Development Framework (OCTDF)
2) Flaw confirmation. Once identified, flaws are confirmed and prioritized. The preferred method of confirmation is through 'desk checking,' examining system documentation, code, etc., to determine the existence of flaws. (Live testing is conducted only when desk checking is not possible.) 3) Flaw generalization. Once confirmed, flaws are generalized into categories that enable a team of evaluators to discover patterns in the system, leading to further identification of similar flaws. 4) Flaw elimination. The final step involves repairing the system. This step may not be the responsibility of a penetration testing team, but it could be a follow-on activity. Once fixed, the system may be re-tested to ensure the flaw has been eliminated To appropriately calibrate devices used for collecting digital evidence from networks, it will be important to consider the sort of challenges that might be posed to foundation testimony regarding the performance and reliability of these tools as data collectors. Such an inspection could result in identifying additional tests that would make the calibration protocol more robust from an evidentiary view. The Flaw Hypothesis Methodology, adapted for the purpose of developing an appropriate suite of calibration tests, provides an approach that will take into account foundation challenge considerations. Although the Flaw Hypothesis Methodology has received wide acceptance [19,20], it is highly dependent on the experience, skills and abilities of the team assembled to conduct the penetration test. For this reason, recruiting the right participants at the beginning of the process is a key to success. The above three methodologies, combined, formed the basis for developing the Observability Calibration Test Development Framework (OCTDF). III. THE OCTDF The OCTDF generalizes an approach to calibration, from the perspective of first responders to a digital crime, who are typically network operators within organizations who rely on low-layer network devices, such as switches and taps with span port capability, to collect network traffic data as evidence of malicious or negligent behavior. This 3-Step Framework is presented and described in Figure 2 2007 IEEE
4
Step 1: Identify Potential Challenge Areas and Environment Identification is accomplished by briefly modeling network interactions of interest, given a particular digital forensic investigation scenario; then using that information to identify whether or not lost network data could damage evidence value. The result will be a set of pairs of behavior to be observed (digital forensic investigation activities and potential legal challenges to digital forensic data collection), and the circumstances under which that observation is expected to take place (digital forensic investigation scenario). Step 2: Identify Calibration Testing Goals Given each pairing of behaviors to be observed, and the likely circumstance for observation, calibration test goals are identified that are supportive of demonstrating evidence value. Step 3: Devise a Test Protocol. From the testing goals identified in Step 2, a testing protocol is devised that will provide appropriate calibration for the device in question. Figure 2. Framework for calibration test protocols
In Step 1, potential legal challenge areas are identified. The perspective assumed is that of the expert witness testifying to the validity of the data collected by a particular device. The question is asked: 'In what ways would loss of network data, under typical usage scenarios, be challenged by opposing counsel?' and further, 'What tap behaviors must an expert witness understand to be credible in front of a jury?' This step corresponds to the Flaw Generation step of the FHM in which design flaws are hypothesized in the target system, assuming the viewpoint of a potential cyber intruder. In the OCTDF, the 'flaws' are foundation evidence vulnerabilities related to device performance that could be challenged as negatively affecting evidence about network forensic data collection. Step 2 describes calibration test goals designed to overcome the challenges identified in Step 1, given the perspective of an expert witness testifying to the validity of network forensic evidence. The expert witness would want to confirm that the device is neutral to the forensic data collection process, i.e. that it would be reasonable to expect that all important data was gathered—that no data packets were dropped, given the level of activity experienced on the network at the time. (Note that the
Proceedings of the 8th IEEE Workshop on Information Assurance, U.S. Military Academy, West Point, NY, 2022 June 2007
level of activity on the network would be confirmed by previous baselining studies showing typical activity levels by day of the week and times.) This step corresponds to recommendations in RPC2544 that testing goals be established before tests are designed [6,20]. It's important to "ask the right question" when designing a calibration test in order to clarify the goal of the test: 'What is it you are trying to prove/disprove?' [21]. Step 3 devises a testing protocol that addresses the testing goals from Step 2. The purpose of this step is to ensure that the test and test environment are appropriate for supporting expert testimony. The process involves developing a “comprehensive” suite of stress tests that examines the behavior of the device in isolation so it can be characterized appropriately. This corresponds to the Flaw Generalization and Flaw Elimination steps of the Flaw Hypothesis Methodology. The potential flaws in foundation testimony caused by mischaracterization, or lack of understanding, of the behavior of the device are matched to tests that verify the specific behavior of the device necessary for confirming its performance as a forensic tool. RPC1242 and RPC2544, which describe the types of tests recommended for comparing network devices, are sources of building blocks that can be organized into an appropriate calibration test protocol. The OCTDF is not specific to any one kind of low layer network device. It can be applied to any device from simple to complex. To execute properly, it requires an understanding of how the device is employed in the collection of forensic data, what type of data it collects and under what circumstances. A beginning reference would be the manufacturer's documentation that accompanies the device and any procedures available describing how it is used. IV. APPLICATION The Observability Calibration Test Development Framework has had limited application to date. In an exemplar case, a low layer network device, specifically an aggregator tap, was selected and calibrated following the OCTDF. The results, discussed in [1], demonstrated the usefulness of FHM-thinking in network device calibration. As a result, the OCTDF methodology was further refined and presented here. The Observability Calibration Test Development Framework is, itself, part of a larger methodology under development, the Network Forensics Development Life Cycle (Figure 3), first introduced at the 7th IEEE Workshop on Information Assurance in 2006 [22]. The purpose of the NFDLC is to embed forensics into networked systems [22]. Current network forensic investigations are costly, non-scalable, ad hoc activities. 2007 IEEE
5
Forensic readiness is a strategy for improvement [22, 23]. We anticipate that organizations will find selective implementation of forensic readiness is good security policy.
Figure 3 OCTDF in the NFDLC
V. CONCLUSIONS AND FUTURE WORK As courtroom admissibility requirements become important considerations for networked systems, the demand for forensically ready systems over costly and non-scalable ad hoc investigations will necessitate a change in network protection strategies and design, including calibration of low layer network devices used to collect forensic data in networks. Calibration tests can provide a source of foundation evidence that these devices function as expected and that the data they collect can be relied upon. The consequences of the failure to validate the behavior of these devices could lead to inadmissible evidence and failed legal action. Although there are no protocols for such tests, a methodology can be developed from advanced practices—network baselining techniques and specific RFCs addressing benchmark testing—and an adaptation of the Flaw Hypothesis Methodology. The Flaw Hypothesis Methodology is highly dependent on the knowledge and skills of the team assembled to conduct the penetration test, making recruiting the right participants a key to success. This holds true for the OCTDF. In order to develop a useful and robust calibration protocol for low layer devices, the right team of participants must be assembled to collaborate on the work. By implication the RFC process, with its access to the main architects of the Internet and leading networking academicians and practitioners, will
Proceedings of the 8th IEEE Workshop on Information Assurance, U.S. Military Academy, West Point, NY, 2022 June 2007
be ideal for vetting the framework and any calibration methodology that is created using it. Future work will involve further refinement of the OCTDF, creation of a generally accepted calibration standard for low layer network devices, and continued development of the NFDLC methodology, leading to implementation in a user network. VI. REFERENCES [1] Endicott-Popovsky, B.E., Chee, B. and Frincke, D. "Role of calibration as part of establishing foundation for expert testimony," in Proceedings 3rd Annual IFIP WG 11.9 Conference, January 29-31, Orlando, FL. [2] Weissman, C. (1973). "System Security Analysis: Certification, methodology and results." Tech Report No. SP-3728, System Development Corporation. [3] Weissman, C. (1995). "Penetration testing." In M. Abrams, S. Jajodia, and H. Podell, (Eds.), Information Security: An Integrated Collection of Essays, pp. 269-296. Los Alamitos, CA: IEEE Computer Society Press. [4] Allen, N. Level 3 Escalation Engineer, Personal interviews. Fluke Networks, Fluke Corporation: Everett, WA, 2006. [5] Chee, B. Director of the Advanced Network Computing Laboratory, University of Hawaii at Manoa, Personal interviews. Honolulu, HI, Summer, 2006. [6] Bradner, S. and McQuaid, J. (1999) "RFC2544 Benchmarking Methodology for Network Interconnect Devices," IETF Network Working Group. Retrieved from the World Wide Web March 14, 2007 http://www.faqs.org/rfcs/rfc2544.html. [7] Mauffer, T. Director of Technical Marketing, Personal telephone interviews. Mu Security, San Jose, CA, 2007. [8] Newman, D. Chief Network Testing Engineer, Personal telephone interviews. Network Test, Inc., San Jose, CA, 2007. [9] Allen, N. (2006, September) "Are You Seeing What you Expected?: Troubleshooting in a switched network environment." Presentation given at the Agora, University of Washington: Seattle, WA. [10] Fisher, A. "Monitoring for piece of mind: NetOptics taps provide complete visibility without interference," in The Tao Of Network Security Monitoring. Retrieved from the World Wide Web August 13, 2006: http://www.netoptics.com/news/default.asp?pageid=33&p rid=393 [11] Orton, Ivan, King County (Washington) Deputy Prosecutor. Personal interviews. August 24, 2006 and September 29, 2006. [12] NIST, Computer Forensics Tool Testing (CFTT) Project, Retrieved January 16, 2007 from the World Wide Web: http://www.cftt.nist.gov/ [13] Smith, F. and Bace, R. (2003). A Guide to Forensic Testimony. San Francisco: Addison Wesley. 2007 IEEE
6
[14] Nassir, D. (2000). Network Performance Baselining. MTP: Indianapolis, IN. [15] Mirkovic, J., Dietrich, S, Dittrich, D and P. Reiher. (2005). Internet Denial of Service: Attack and Defense Mechanisms. Upper Saddle River, NJ: Prentice Hall Professional Technical Reference [16] IETF, "Overview." Retrieved February 20, 2007 from the World Wide Web: from thehttp://www.ietf.org/overview.html [17] RFC Editor. (November 2006). "Overview of RFC Document Series." Retrieved from the World Wide Web: March 15, 2007. http://www.rfc-editor.org/RFCoverview.html [18] Bradner, S. (1991) "RFC 1242-Benchmarking Terminology for Network Interconnection Devices." IETF Network Working Group. Retrieved from the World Wide Web March 14, 2007 http://www.faqs.org/rfcs/rfc1242.html [19] Bishop, M. (2003). Computer Security: Art and Science. New York: Addison-Wesley. [20] Bishop, Matt. (2004). Introduction to Computer Security. New York: Addison-Wesley Professional. [21] Chee, B. "Testing Notes: Asking the Right Questions." Slashdot Post, June 15, 2006. [22] Endicott-Popovsky, B and Frincke, D., "Embedding forensic capabilities into networks: addressing inefficiencies in digital forensics investigations" in Proceedings from the Seventh IEEE Systems, Man and Cybernetics Information Assurance Workshop, 21-23 June 2006, United States Military Academy, West Point, NY, pp.133-139. [23] Tan, J. (2001). "Forensic Readiness," Cambridge, MA: @Stake.