Testing Distributed Systems with TMT Klaus Berg, Sabine Canditt, Anja Hentschel, Erwin Reyzl, Peter Zimmerer Common address and Fax for all authors: Dept. ZT SE 1 Siemens AG Otto-Hahn-Ring 6 D 81730 Munich Germany FAX 0049-89-63640898
Abstract TMT is a Test and Monitoring Tool for distributed systems. It has a graphical monitoring front-end that visualizes system internal registration, communication and synchronization events based on recorded traces. TMT employs a flexible concept of tracing backends that allows automated tracing for different types of distributed systems (for standard middleware as well as for proprietary systems). Among various trace analysis functions, TMT offers special test functionality: the interaction mode allows the user to directly influence the System Under Test during runtime. This is useful to set up specific test conditions as well as to enable fault injection during robustness testing. Test Reports serve as documentation of test runs. Using the regression test support, the user can perform a Test for Identity and a Test for Equivalence that considers the concurrency aspects of a distributed system.
Keywords: Monitoring – Test – Instrumentation – Trace Analysis – Distributed Software 1 Motivation Distributed software systems play an increasingly important role in many technology areas (e.g. ecommerce, business applications, telecommunications, traffic, industry automation). Their importance is augmented by the availability of standard middleware platforms (Microsoft COM/DCOM, OMG CORBA, Java RMI) for which a growing dissemination is anticipated [1]. To deliver high quality in a short time, there must be efficient methods to validate correct behavior and find errors. However, typical failures in distributed systems do not arise during the test of single components, but for the first time during the integration and system test. Furthermore, they are often sporadic, unforeseeable and thus, especially hard to find. There is no doubt about the importance of efficient testing; the tool market for automated testing of software and systems is anticipated to grow by approx. 30 % per year [2]. There is a wide spectrum of tools for test case definition, generation, management and execution. However, little support is offered to locate typical „distribution errors“. TMT [3] is a Test and Monitoring Tool that addresses this problem and alleviates the testing effort in the system integration phase. Based on recordings of application-internal processing, TMT graphically visualizes the course of events. Additionally, it offers functionality for trace analysis and test and thereby helps to understand, validate and debug the complex behavior of distributed applications. By spanning a bridge between the source code view during unit test (“whitebox”) and the application view during system test (“blackbox”) TMT opens a “greybox” view of the distributed system. As opposed to comparable tools that are suitable only for certain middleware platforms (SilkObserver [4], VS Analyzer [5] and EWatch [6]) TMT employs a flexible concept of tracing backends. As such, TMT can be applied on proprietary systems as well as on standard middleware systems and on heterogeneous systems composed of various components.
In this article, we introduce TMT with its basic characteristics, concentrating on its test functions.
2 TMT Overview 2.1
Principles
The TMT architecture is characterized by the following features: • TMT is based on a generic abstraction model that is applicable for arbitrary distributed systems (standard, proprietary, heterogeneous). Offline in files • A general trace event interface allows for arbitrary Tracing Backends (TBE): individually traced proprietary systems Distributed as well as automatically traced sys- System Online tems using standard middleware plat( + User Interaction ) Test & Monitoring Tool forms. Figure 1: TMT Tracing Concept • A test- and monitoring front-end with a graphical user interface centrally visualizes the course of events and supports analysis, validation and error tracking. There are two ways to transport trace information to TMT (Figure 1): • sending it to TMT by socket (online mode), • storing it in a local file, that is later read by TMT (offline mode). TMT is written in pure Java 2 to support the mainstream platforms Windows NT and Unix.
2.2
Tracing
To inform TMT of an event, the System Under Test (SUT) has to be instrumented, i.e. tracing points have to be added. This can be done automatically (i.e. without touching the application source code) if a preinstrumented standard middleware platform is used. Another possibility is manual instrumentation, where library calls have to be inserted at the tracing points (i.e., the source code has to be modified). If the SUT already delivers tracing information in a different format, converters can easily be attached to transform it to the TMT format.
2.3
Abstraction Model
TMT interprets the recorded traces based on its generic abstraction model. The elements (components) of the SUT are grouped as follows: • The SUT is a network of Devices. • Devices house Processes. • Processes house active (Tasks) and passive entities (Service Providers). Tasks and Service Providers communicate via message passing or Remote Procedure Calls (RPCs). TMT also knows certain control mechanisms (Task creation, start, stop and destruction; join mechanism). Single actions to be recognized by TMT are called events. For example, a message passed between Tasks produces at least 2 events: a TASK_SEND_MESSAGE (created by the sending Task) and a TASK_RECEIVE_MESSAGE event (created by the receiving Task). The abstraction can be applied to a wide range of distributed applications (e.g. mapping objects to Service Providers, threads to Tasks).
2.4
Graphical and Textual Representation
TMT displays traces in different views: • The List View in the Control Panel (Figure 2) is the textual representation of the trace, i.e. all events are displayed in the order of their appearance.
•
The Sequence Chart View (Figure 3) is the graphical representation of the course of events. Tasks and Service Providers are displayed as horizontal lines, events are displayed as icons on these lines. TMT determines and displays causal relations between events (e.g. the connection between a TASK_SEND_MESSAGE and a TASK_RECEIVE_MESSAGE event is drawn as a line). Events may be displayed either in logical order with constant distances or according to their timestamps.
Figure 2: Control Panel and List View •
The Event Details View (Figure 4) is the textual representation of the complete event information (e.g. message contents and call parameters). • The Hierarchy View (Figure 5) is the tree-like representation of the application structure (composed of Devices, Processes, Tasks and Service Providers). The Control Panel is operated like a video recorder with Figure 3: Sequence Chart View buttons for (fast) forward and backward navigation in the trace. Events selected in one view are automatically selected in the other views. It is possible to step and jump comfortably within one view and between views. The combination of the views presents the trace information in a way that is easy to handle and to survey. TMT offers sophisticated analysis functionality that help to deal with distributed systems (merging trace files, condensing information, checking traces for incorrectness, dealing with unsynchronized device clocks, overseeing self-defined diagnosis conditions, data and performance monitoring).
Figure 5: Hierarchy View Figure 4: Event Details View
3 Test There are various tools on the market supporting the test process with different purpose. TMT offers specific test support based on single test runs, but no help in planning the overall test process. Its typical usage is not during the execution of an entire test suite (e.g. „does the enitre system behave in the
same way as it did before?“) but in the detailed analysis (e.g. „where exactly is the difference between two test runs?“). The following list gives a more detailed classification of the TMT test features: • test of robustness (interaction) • test documentation • comparison of test runs (regression testing) TMT does not offer any help in: • test planning and management • requirement analysis and generation of test cases • code analysis • coverage analysis • load test • GUI test • test driver TMT offers active (influencing the SUT) and passive (documenting and comparing test runs) Normal mode Task TMT ServiceProvider test functionality.
3.1
Interaction
RPC_CLIENT_BEGIN
Interaction is a means to RPC_SERVER_BEGIN directly influence the SUT via instrumentation. It can be used in the Interaction RPC_CLIENT_BEGIN TMT online mode together with the following TMT_MODIFY_CLIENT_BEGIN communication events: • RPC Client Begin / Changed params End Interaction • RPC Server Begin / RPC_SERVER_BEGIN End Interaction • Task Send / Receive Message Interaction The operation is illustrated using the RPCexample in Figure 6: Figure 6: RPC with User Interaction When a Task wants to trace an RPC, it sends its information to TMT. The tracing of an RPC starts with the TMT event RPC_CLIENT_BEGIN. In normal mode, the calling task then continues with its operation (i.e. invoking the RPC on the ServiceProvider). In interaction mode, the application stops after the submission of the RPC_CLIENT_BEGIN event and waits for notification by TMT. TMT opens an interaction dialogue window (Figure 7) that offers the following possibilities: • Change Values (RPC call parameters) • No Change • Abort RPC The calling Task then actually performs the changes required by the user (i.e. changes the parameters before invoking the RPC or continues without executing the RPC).
Using interaction, the user can influence the system’s behavior during runtime. With it specifc test conditions can be set up easily, erroneous information can be corrected and faults can be injected on purpose to test the system’s robustness.
Figure 7: Interaction dialogue
3.2
Test Documentation
TMT knows two different ASCII-file-types for documentation: the Snapshot and the Test Report. (Figure 8).The Snapshot is created for one specific view. It contains the meta-information of the test run: the trace file, the last processed event, the date and time, and some arbitrary comments. The Test Report which is generated based on the data given in the Snapshot contains all the relevant information of the view. View-specific attributes as time, event number, event thrower, identifiers etc. are documented in a table-like fashion. Using filters, it is possible to exclude attributes (columns) as well as specific events (lines) from this Test Report table. With it, various Test Reports (with different filter settings) may be generated based on one Snapshot; the meta-information has to be entered only once.
3.3
Regression Test Support
Filter
Reference
Current version
Trace 1
Trace 2
Snapshot 1
Snapshot 2
Test Report 1
Test Report 2
Filter
Compare Identical
Equivalent
Figure 8: Test File Generation and Regression Test Procedure
During regression testing, a run reflecting the current system status is compared to an older, correct one (Figure 8). To pass the regression test it is required that the two runs are “the same”. However, the definition of “the same” is not straightforward in distributed systems. There are attributes that naturally differ between two runs, e. g. the timestamps of events and the element and communication identifiers that depend on the sequence of occurrence. The sequence of events is not deterministic due to the underlying concurrency of tasks operating independently. With TMT one can precisely define the criteria for two test runs to be “the same”. The comparison of runs is based on generated Test Reports in a textual table format. Applying filters, one can exclude irrelevant attributes (columns) from the comparison (e.g. time and identifiers) as well as specific events (lines).
TMT can perform a „Test for Identity“ or a „Test for Equivalence“ (Figure 9). The basic test is the Test for Identity that compares two Test Reports containing all the view-specific trace attributes not explicitly excluded by filtering. For two traces to be identical, the sequence and contents (which has not explicitely exluded by filtering) of events have to be the same. This test is enhanced by the Test for Equivalence that considers the concurrency aspects of a distributed system. For two traces to be equivalent, the sequence and contents (which has not explicitely exlu2 2 3 3 Server ded by filtering) of events have Trace 1 Client1 1 4 to be the same only with Client2 1 4 respect to each single Task or 2 2 3 3 Server Trace 2: Service Provider. See Figure 9 Client1 1 4 identical to Trace 1 (time filtered out!) for an example: Trace 1 and Client2 1 4 Trace 2 don’t have the same 2 2 3 3 Server timestamps. Still they are idenTrace 3: Client1 1 4 tical as the sequence of events not identical, but equivalent to Trace 1 Client2 1 4 and their contents are the same 2 3 2 3 (if the time has been explicitely Server Trace 4: filtered out, i. e. not been consiClient1 1 4 not equivalent to Trace 1 Client2 dered during the comparison of 1 4 the Test Reports). Comparing Trace 3 to Trace 1 one finds that the sequence of events is not the same (consider dark grey 1 and light grey 2). The Figure 9 : Test for Identity and Equivalence traces are not identical, but equivalent, as the sequence of events and their contents are the same for each element (Server, Client1 and Client 2). Now compare Trace 4 to Trace 1: The sequence of events is not the same for Server element (consider dark grey 3 and light grey 3), i. e. the traces are neither identical nor equivalent. With the filtering possibilities and the test for equivalence one can precisely define the criteria that have to be fulfilled performing suitable regression testing.
4 Perspectives Currently, TMT is used in several Siemens internal projects (business units Automotive, Information and Communication Networks) with encouraging feedback. However, especially for users applying TMT to their system for the first time, the instrumentation effort cannot be neglected. TMT requires information (e.g. unique identifiers for Tasks and Service Providers) that is not readily available and must be especially generated. Most users want to touch their source code as little as possible. With our current work, we therefore focus on supporting users in instrumenting their systems. One major target is to provide instrumented middleware platforms for DCOM, CORBA, and Java RMI applications. A DCOM tracing backend will be available by 9/2000. In a recent project, automatic instrumentation for a CORBA application has been achieved by eavesdropping on the network traffic. The possibility of converting existing traces to the TMT format is also an interesting option for potential users, and we have already developed several converters. Together with our users, we work on finding and developing the best individual tracing solution. We will also extend TMT with new testing and analyzing facilities. A brand new beneficial feature of TMT - which is already implemented in a first version - concerns the analysis of races. It is well understood that races, besides interleaving of concurrent actions and internal nondeterminism, are a form of nondeterminism that is often inherent to concurrent systems. Depending on the execution order of events, the global state of the system reached after a race is ambiguous. Races are a potential source of software faults that are very difficult to detect during test execution in concurrent systems. They require repeated
test runs of the same test case with varying execution speeds of test events and an analysis of the test result after each test run. As a possible alternative to dynamic testing, potential races can be found and analyzed based on traces observed during the execution of the concurrent system. That means that no specification of the concurrent distributed system is needed. Instead potential race conditions are deduced from the mere observation of traces. Then the test engineer has to decide whether a certain constellation of message exchanges really constitutes a race. The trace analysis is based on a suitable definition of a happen-before relation between communication events. Together with the existing testing possibilities (Test for Identity and Test for Equivalence) the race analysis will support the user in evaluating a trace. Another new issue is to provide TMT with new performance monitoring capabilities. Using the TMT Data Monitoring it is already possible to visualise system- and application specific measurable data, which enables the user to get a detailed view into the state of program algorithms. The user can monitor system data like CPU and memory usage as well as arbitrary data of the application like the value of certain variables. In the future, we plan to extend these features by a statistically pre-processed presentation of communication-related data, e.g. the RPC round trip times (from RPC_CLIENT_BEGIN to RPC_CLIENT_END) which may be the basis for the derivation of performance figures as throughput, latencies and utilization. This capability will enhance the user’s scope from a low-level view of single events or groups of events to a high-level view of the overall performance.
5 Summary In this article, we have introduced the Test- and Monitoring Tool TMT that provides support for virtually any kinds of distributed applications. TMT’s active test support allows to directly interact with the System Under Test during runtime. Furthermore, TMT offers passive test functionality to document and compare different test runs. Criteria for regression testing may be flexibly defined.
6 References [1] [2] [3] [4] [5] [6]
GartnerGroup. Middleware Deployment Trends: Survey of Real-World Enterprise Applications. Strategic Analysis Report. April 1999 OVUM Ltd. OVUM-Report on Software Testing Tools. 1999 Berg, Canditt, Hennig, Hentschel, Reyzl, Schmitz-Foster, Monitoring with TMT – Insight into Distributed Systems, Proceedings of the PDPTA, Las Vegas, June 2000 http://www.segue.com http://msdn.microsoft.com/vstudio http://www.averstar.com
7 Acknowledgement TMT has been developed in the order of the Siemens department ICN EN HC SE 424. The authors wish to acknowledge their team collegues Klaus Grabenweger, Andreas Hennig (especially for his valuable tips and tricks in word processing), and Jürgen Schmitz-Foster who contributed essentially to TMT as it is today.
8 About the Authors Klaus Berg has studied Electrical Engineering at the University of Karlsruhe in Germany and works now on system monitoring and performance engineering. Currently he is architect and implementor in the team at Siemens that is building TMT. His research interests are Java GUI development and professional print and export support for Java applications. Contact:
[email protected] Sabine Canditt received her diploma in Electrical Engineering from the Technical University Munich in 1986. Since 1989, she has been employed with the Corporate Technology Department of Siemens AG
working on several projects in system development, evaluation, and performance engineering. She has been involved with TMT since the beginning of the project in 1997. Contact:
[email protected] Anja Hentschel received her diploma in Computer Science from the Technical University of Braunschweig in 1990. Afterwards, she joined the Corporate Technology Department of Siemens AG where she now works on system monitoring and performance engineering. Her main focus in the context of TMT has been the development of mechanisms for preprocessing and analysis of monitoring data. Contact:
[email protected] Erwin Reyzl received his diploma in Mathematics from the Ludwig-Maximilian-University Munich in 1986. Since 1987, he has been employed with the Corporate Technology Department of Siemens AG. He has been involved in development projects on parallel computing and telecommunication systems. Since 1997 he is project manager of the TMT development. Contact:
[email protected] Peter Zimmerer studied Computer Science at the University of Stuttgart, Germany and received his M.Sc. degree (Diplominformatiker) in 1991. He then joined the Siemens AG, Corporate Technology, and has been working in the field of software testing for object-oriented (C++, Java), distributed, component-based, and embedded software. He is co-author of several international conference publications, e.g. at EuroStar, Conference on Testing Computer Software and Software Quality Week. Contact:
[email protected]