Robustness Validation in Service Oriented ... - Semantic Scholar

9 downloads 47991 Views 1008KB Size Report
consumers (to select the services that best fit architectural requirements). .... marks, such as the TPC (Transaction Processing Performance Council) [41] and ...... of the contactEmail parameter was in place, too large email addresses caused ...
Robustness Validation in Service Oriented Architectures Nuno Laranjeiro, Marco Vieira and Henrique Madeira CISUC, Department of Informatics Engineering, University of Coimbra 3030-290 Coimbra, Portugal {cnl, mvieira, henrique}@dei.uc.pt

Abstract. The use of Service Oriented Architecture (SOA) in enterprise applications development is increasing very quickly. In a SOA environment providers supply a set of services that must be robust. Although SOA is being used in business-critical environments, there is no practical means to measure or compare the robustness of services. Robustness failures in such environments are dangerous, as they can be maliciously exploited with severe consequences for the attacked systems. This chapter addresses the problem of robustness validation in SOA environments. The approach proposed is based on a set of robustness tests that is used to discover both programming and design errors. Two concrete examples are presented: one focusing on web services and the other targeting Java Message Service (JMS) middleware. The proposed approach is useful for both providers (to validate the robustness of deployed services) and consumers (to select the services that best fit architectural requirements). Keywords: Benchmarking, Online Information Services, Reliability and robustness, Testing and Debugging.

1 Introduction Service Oriented Architectures (SOA) are now widely used to support most of the business, linking suppliers and clients in sectors such as banking and financial services, transportation, automotive manufacturing, healthcare, just to name a few. Web services are the key element of modern SOA [9], and consist of self-describing components that can be used by other software across the web in platform-independent manner, supported by standard protocols such as WSDL (Web Services Description Language) and UDDI (Universal Description, Discovery, and Integration) [8]. Services provide a simple interface between a provider and a consumer, where the first offers a set of services that are used by the second. An important aspect is services composition, which is based on a collection of services working together to achieve an objective. This composition is normally a "business process" that describes the sequencing and coordination of calls to the component services. Thus, if one component fails then the composite service may suffer an outage. Although SOA using Web Services are increasingly being used in complex business-critical systems, current development support tools do not provide any practical way to assess the robustness of services implementation or to compare the robustness of alternative solutions. This chapter proposes an experimental approach to benchmark robustness of services and middleware supporting SOA. Our approach is based

on a set of robustness tests that allows discovering both programming and design problems, including the identification of non-compliances between the specification and the way SOA middleware actually operates. Robustness is characterized according to the following failure modes: Catastrophic, Restart, Abort, Silent, and Hindering (adopted from the CRASH scale [15]). Specification conformance issues are categorized in three classes (from 1 to 3), according to the severity of the detected problems. The need for practical means to assess and benchmark robustness in SOA is corroborated by several studies that show a clear predominance of software faults (i.e., program defects or bugs) [18], [13], [7] as the root cause of computer failures and, given the huge complexity of today‟s software, the weight of this type of faults will tend to increase. Services in SOA are no exception, as they are normally intricate software components that implement a compound effort, in some cases using compositions of several services, which make them even more complex. Interface faults, related to problems in the interaction among software components/modules [45], are particularly relevant in service-oriented environments. In fact, services must provide a robust interface to the client applications even in the presence of invalid inputs that may occur due to bugs in the client applications, corruptions caused by silent network failures, or even security attacks. Classical robustness testing, in which the approach proposed in this chapter is inspired, is an effective approach to characterize the behavior of a system in presence of erroneous input conditions. It has been used mainly to assess robustness of operating systems and operating systems kernels [25], [15], [29], but the concept of robustness testing can be applied to any kind of interface. Robustness tests stimulate the system under testing through its interfaces submitting erroneous input conditions that may trigger internal errors or may reveal vulnerabilities. The proposed approach consists of a set of robustness tests (i.e., invalid service call parameters) that are applied during execution in order to observe robustness problems of the service itself and of the supporting middleware. This way, it can be used to: – Evaluate the robustness of services. This is useful in three different scenarios: 1) help providers to evaluate the robustness of the code of their services; 2) help consumers to pick the services that better fits their requirements; and 3) help providers and/or consumers to choose the web services for a given composition. – Evaluate the robustness of middleware supporting SOA. This is useful for administrators and system integrators to select the middleware that best fit specific requirements by evaluating the robustness of different alternatives. In addition, it is also useful to help administrators tuning up the systems configuration. The work presented in this chapter is based on previous works on robustness testing for web services [43] and Java Message Service (JMS) middleware [16], the first of which is available as an on-line tool [17]. In this chapter, we formalize the concepts and propose a generic approach for the definition of robustness benchmarks for services in general. This is in fact a key contribution of the chapter as it defines the basis for the development of robustness benchmarks for service oriented environments. The effectiveness of the proposed approach is demonstrated through two concrete examples of robustness benchmarking (although the approach is generic and can be used to define other benchmarks for different types of services). The first focuses on benchmarking the robustness of SOAP web services and the second focuses on benchmarking the robustness of Java Message Service middleware implementations.

Several robustness problems have been disclosed in the experiments presented in these examples, including major security vulnerabilities and critical robustness failures. The structure of the chapter is as follows. The next section presents background and related work, addressing relevant aspects of SOA and robustness benchmarking. Section 3 presents the robustness testing approach. Section 4 presents the first benchmark example, which focuses on SOAP web services. Section 5 presents a benchmark for JMS middleware. Finally, Section 6 concludes the chapter.

2 Background and Related Work Service Oriented Architecture (SOA) is an architectural style that steers all aspects of creating and using services throughout their lifecycle, as well as defining and providing the infrastructure that allows heterogeneous applications to exchange data. This communication usually involves the participation in business processes that are loosely coupled to their underlying implementations. SOA represents a model in which functionality is decomposed into distinct units (services) that can be distributed over a network and can be combined together and reused to create business applications. This type of profile makes natural integration technologies, such as web services and JMS, innate candidates for SOA implementations [9]. A dependability benchmark is a specification of a standard procedure to measure both the dependability and performance of computer systems or components. The main goal is to provide a standardized way to compare different systems or components from dependability point-of-view. Comparing to typical performance benchmarks, such as the TPC (Transaction Processing Performance Council) [41] and SPEC (Standard Performance Evaluation Corporation) [33] benchmarks, which consist mainly on a workload and a set of performance measures, a dependability benchmark adds two new elements: 1) measures related to dependability; and 2) a faultload that emulates real faults experienced by systems in the field [44]. This way, the main elements of a dependability benchmark are: – Measures: characterize the performance and dependability of the system under benchmarking in the presence of the faultload when executing the workload. The measures must be easy to understand and must allow the comparison between different systems in the benchmark domain. – Workload: represents the work that the system must perform during the benchmark run. The workload emulates real work performed by systems in the field. – Faultload: represents a set of faults and stressful conditions that emulate real faults experienced by systems in the field. – Benchmark procedure and rules: description of the procedures and rules that must be followed during the benchmark implementation and execution. – Experimental setup: describes the setup required to run the benchmark. Robustness is a particular aspect of dependability, and in this sense, a robustness benchmark can be considered a form of a dependability benchmark. A robustness benchmark typically consists of a suite of robustness tests or stimuli [25], which stimulate the system in a way that triggers internal errors, and in that way exposes both

programming and design errors. Systems can be differentiated according to the number of errors uncovered. Many relevant studies [32], [24], [5], [14], [10] evaluate the robustness of software systems, nevertheless, [15] and [29] present the most popular robustness testing tools, respectively Ballista and MAFALDA. Ballista is a tool that combines software testing and fault injection techniques [15]. The main goal is to test software components for robustness, focusing specially on operating systems. Tests are made using combinations of exceptional and acceptable input values of parameters of kernel system calls. The parameter values are extracted randomly from a database of predefined tests and a set of values of a certain data type is associated to each parameter. The robustness of the target operating system (OS) is classified according to the CRASH scale. Initially, Ballista was developed for POSIX APIs (including real time extensions). More recent work has been developed to adapt it to Windows OS [30] and also to various CORBA ORB implementations [26]. MAFALDA (Microkernel Assessment by Fault injection AnaLysis and Design Aid) is a tool that allows the characterization of the behavior of microkernels in the presence of faults [29]. MAFALDA supports fault injection both into the parameters of system calls and into the memory segments implementing the target microkernel. MAFALDA has been improved (MAFALDA-RT) [28] to extend the analysis of the faulty behaviors in order to include the measurement of response times, deadline misses, etc. These measurements are quite important for real-time systems and are possible due to a technique used to reduce the intrusiveness related to the fault injection and monitoring events. Another study [20] has been carried to extend MAFALDA in order to allow the characterization of CORBA-based middleware. In [21] it is presented a study on the robustness properties of Windows device drivers. Recent OS kernels tend to become thinner by delegating capacities on device drivers (which presently represent a substantial part of the OS code, since nowadays a computer interacts with many devices), and a large part of system crashes can be attributed to these device drivers because of residual implementation bugs. In this study, the authors present an evaluation of the robustness properties of Windows (XP, 2003 Server and Vista) concluding that in general these OS appear to be vulnerable, and that some of the injected faults can cause systems to crash or hang. One of the first examples of robustness testing applied to web services can be found in [31]. This paper proposes a technique to test web services using parameter mutation analysis. The WSDL document is parsed initially and mutation operators are applied to it resulting in several mutated documents that will be used to test the service. In spite of the effort put on this approach, these parameter mutation operators are very limited. In [47] a similar approach is presented. Although it represents a more complete study, the coupling that is done to the XML (eXtensible Markup Language) technology invalidates any kind of test generalization (i.e., it does not apply to other technologies as it is tightly coupled to XML). Concerning performance benchmarking in web services, two key benchmarks have been proposed: the TPC Benchmark™ App (TPC-App) [42] – a performance benchmark for web services infrastructures; and the SPEC jAppServer2004 benchmark [34] designed to measure the performance of J2EE 1.3 application servers. In addition, main vendors (e.g., Sun Microsystems, Microsoft) have also undertaken efforts towards the comparison of web services performance [23], [39]. Regarding JMS, there are several studies that focus on performance or Quality of

Service (QoS) assessment. In [37] the performance of two popular JMS implementations (Sun Java System Message Queue and IBM WebSphere MQ) is tested. Performance testing has also been applied in publish/subscribe domains in the presence of filters [22]. This work studies the scalability properties of several JMS implementations. In [6] is presented an empirical methodology for evaluating the QoS of JMS implementations, which focuses performance and message persistence. Concerning benchmarking initiatives, SPEC provides the SPECjms2007 benchmark [35], whose goal is to evaluate the performance of enterprise message oriented middleware servers based on JMS. In spite of the several JMS performance and QoS studies, no strategy has been developed for assessing the robustness properties of JMS middleware.

3 Robustness Benchmarking Approach for SOA environments A robustness benchmark can be defined as a specification of a standard procedure to assess the robustness of a computer system or computer component in the presence of erroneous inputs (e.g., random inputs, malicious input, boundary values, etc). The first step for the definition of a new robustness benchmark for SOA environments is the identification of the class of services targeted. The division of the SOA into well-defined areas is necessary to cope with the huge diversity of services available and to make it possible to make choices on the definition of benchmark components. In fact, most of the components are highly dependent on the application area. For example, the failure modes classification, the operational environment, and the workload are very dependent on the services being considered. The second step is the definition of the benchmark components, including: – Failure modes classification: characterize the behavior of the service while executing the workload in the presence of the robustness tests. – Workload: work that the service must perform during the benchmark run. – Robustness tests: faultload consisting of a set of invalid call parameters that is applied to the target services to expose robustness problems. Additionally, the benchmark must specify the required experimental setup and testing procedure. The experimental setup describes the setup required to run the benchmark and typically includes two key elements: the Benchmark Target (BT) that represents the service that the benchmark user wants to characterize, and the Benchmark management system (BMS) that is in charge of managing all the benchmarking experiments. The goal is to make the execution of the benchmark a completely automated process. Although this is evidently dependent on the actual benchmark specification, we identify three key tasks: experiments control, workload emulation, and robustness test execution. The testing procedure is a description of the steps and rules that must be followed during the benchmark implementation and execution. Although this is closely related to the class of services being targeted, we propose the following generic set of steps: 1. Tests preparation 1.1. Analysis of the services u nd er testing to gather relevant inform ation. 1.2. Workload generation. 2. Tests execution

3.

2.1. Execution of the workload generated in step 1.2 in order to understand the expected correct behavior for the service. 2.2. Execution of the robustness tests in order to trigger faulty behaviors, and in that way disclose robustness problems. Service characterization, including failure modes identification (using the data collected in step 2) and comparison of different alternatives.

3.1 Workload The workload defines the work that the system must perform during the benchmark execution. Three different types of workload can be considered for benchmarking purposes: real workloads, realistic workloads, and synthetic workloads. Real workloads are made of applications used in real environments. Results of benchmarks using real workloads are normally quite representative. However, several applications are needed to achieve a good representativeness and those applications frequently require some adaptation. Additionally, the workload portability is dependent on the portability requirements of all the applications used in the workload. Realistic workloads are artificial workloads that are based on a subset of representative operations performed by real systems in the domain of the benchmark. Although artificial, realistic workloads still reflect real situations and are more portable than real workloads. Typically, performance benchmarks use realistic workloads. In an SOA environment, a synthetic workload can be a set of randomly selected service calls. Synthetic workloads are easier to use and may provide better repeatability and portability when comparing to realistic or real workloads. However, the representativeness of these workloads is doubtful. The definition of the workload for some classes of services may be considerably simplified by the existence of workloads from standard performance benchmarks. Obviously, these already established workloads are the natural choice for a robustness benchmark. However, when adopting an existing workload some changes may be required in order to target specific system features. An important aspect to keep in mind when choosing a workload is that the goal is not to evaluate the performance but to assess specific robustness problems. 3.2 Robustness Tests Robustness testing involves parameter tampering at some level. Thus, a set of rules must be defined for parameter mutation. An important aspect is that these rules must focus limit conditions that typically represent difficult validation aspects (which are normally the source of robustness problems). Indeed, in component-based software as in the case of services, or SOA in general, it is frequent to encounter relevant issues at contract level (i.e., at interface level) [45]. Frequently, providers and consumers assume different behaviors from a service invocation or from a service response, respectively. Main causes for such disparities include the specification being poorly designed, the implementation on both sides not accurately following the specification, etc. In this sense, to assess the robustness of services we propose the use of a large set

of limit conditions when defining the mutation rules, which include: – Null and empty values (e.g., null string, empty string). – Valid values with special characteristics (e.g., nonprintable characters in strings, valid dates by the end of the millennium). – Invalid values with special characteristics (e.g., invalid dates with using different formats). – Maximum and minimum valid values in the domain (e.g., maximum value valid for the parameter, minimum value valid for the parameter). – Values exceeding the maximum and minimum valid values in the domain (e.g., maximum value valid for the parameter plus one). – Values that cause data type overflow (e.g., add characters to overflow string and replace by maximum number valid for the type plus one). The proposed generic rules, presented in Table 1, were defined based on previous works on robustness testing [15], [29]. Essentially, each rule is a mutation that is applied at testing time to an incoming parameter. For instance, when testing a service that accepts a String as parameter type, the StrNull test will be applied (as well as all tests that apply to Strings) by replacing the incoming value with a null value that is delivered to the service. 3.3 Failure Modes Classification Services robustness can be classified according to the severity of the exposed failures. Our proposal is to use the CRASH scale [15] (named as sCRASH to indicate that it concerns a services environment) as basis for services characterization and tailor this scale according to the specificities of the class of services targeted. The following points exemplify how the sCRASH failure modes can be generically adapted for a (abstract) services environment: – Catastrophic: The service supplier (i.e., the underlying middleware) becomes corrupted, or the server or the operating system crashes or reboots. – Restart: The service supplier becomes unresponsive and restart must be forced. – Abort: Abnormal termination when executing the service. For instance, abnormal behavior occurs when an unexpected exception is thrown by the service. – Silent: No error is indicated by the service implementation on an operation that cannot be concluded or is concluded in an abnormal way. – Hindering: The returned error code is incorrect. Note that, in some cases, the use of the sCRASH scale is very difficult or even not possible. For example, when the benchmark is run by service consumers it is not possible to distinguish between a catastrophic and a restart failure mode as the consumer does not have access to the server where the service is running. To face this problem we also propose a simplified classification scale (the sAS scale) that is based only on two easily observable failure modes: Abort and Silent. Although it is a simplified characterization approach it still is quite useful for service consumers. Obviously, these scales should be adapted to the targeted class of services as some of these classes may not be relevant or impossible to observe in specific scenarios.

Table 1. Example of the specification of the parameters of web services operations.

Object

Boolean

Date

List

Number

String

Type

Test Name StrNull StrEmpty StrPredefined StrNonPrintable StrAddNonPrintable StrAlphaNumeric StrOverflow NumNull NumEmpty NumAbsoluteMinusOne NumAbsoluteOne NumAbsoluteZero NumAddOne NumSubtractOne NumMax NumMin NumMaxPlusOne MumMinMinusOne NumMaxRange NumMinRange NumMaxRangePlusOne NumMinRangeMinusOne ListNull ListRemove ListAdd ListDuplicate ListRemoveAllButFirst ListRemoveAll DateNull DateEmpty DateMaxRange DateMinRange DateMaxRangePlusOne DateMinRangeMinusOne DateAdd100 DateSubtract100 Date2-29-1984 Date4-31-1998 Date13-1-1997 Date12-0-1994 Date8-32-1993 Date31-12-1999 Date1-1-2000 BooleanNull BooleanEmpty BooleanPredefined BooleanOverflow ObjectNull ObjectNonSerializable ObjectEmpty ObjectPrimitive ObjectCommon

Parameter Mutation Replace by null value Replace by empty string Replace by predefined string Replace by string with nonprintable characters Add nonprintable characters to the string Replace by alphanumeric string Add characters to overflow max size Replace by null value Replace by empty value Replace by -1 Replace by 1 Replace by 0 Add one Subtract 1 Replace by maximum number valid for the type Replace by minimum number valid for the type Replace by maximum number valid for the type plus one Replace by minimum number valid for the type minus one Replace by maximum value valid for the parameter Replace by minimum value valid for the parameter Replace by maximum value valid for the parameter plus one Replace by minimum value valid for the parameter minus one Replace by null value Remove element from the list Add element to the list Duplicate elements of the list Remove all elements from the list except the first one Remove all elements from the list Replace by null value Replace by empty date Replace by maximum date valid for the parameter Replace by minimum date valid for the parameter Replace by maximum date valid for the parameter plus one day Replace by minimum date valid for the parameter minus one day Add 100 years to the date Subtract 100 years to the date Replace by the following invalid date: 2/29/1984 Replace by the following invalid date: 4/31/1998 Replace by the following invalid date: 13/1/1997 Replace by the following invalid date: 12/0/1994 Replace by the following invalid date: 8/32/1993 Replace by the last day of the previous millennium Replace by the first day of this millennium Replace by null value Replace by empty value Replace by predefined value Add characters to overflow max size Set a null object Set a non serializable object Set a correct target class empty object Set an objectified primitive datatype (Boolean, Byte, Short, etc) Set common datatype (List, Map and Date)

Additionally, the failure modes characterization can be complemented with a detailed analysis of the service behavior in order to understand the source of the failures (i.e., defects). The goal is to gather the relevant information to be able to correct or wrap the identified problems. Obviously, the source of the failures (i.e., the defects) depends on several specificities of the services being tested (e.g., programming language, external interfaces, operational environment), which complicates the definition of a generic classification scale. An important aspect is that, although our approach specifically targets robustness aspects, in some cases it can also be used to uncover robustness related compliance problems between the specification and the underlying middleware implementation. This way, we also propose that the benchmark should include a simple and linear classification for non-conformity, or non-compliance observations. These represent the cases where acceptable inputs (as defined by the specification) trigger a problem that should not occur and that problem is handled in a correct way by the middleware implementation. Nevertheless, the problem should never occur in the first place, since the inputs were acceptable. This classification should be simple and orthogonal. As a guideline we propose the following three levels (obviously it must be instantiated taking into account the specificities of the services being targeted by the benchmark): Level 1 (a severe non-conformity consisting of a gross deviation from the specification), Level 2 (a medium severity non-conformity consisting of a partial deviation from the specification), and Level 3 (includes non-conformities that are the result of an effort to extend any standard specification). See Section 5.3 for the instantiation of this non-conformity classification to JMS environments.

4 Assessing the Robustness of SOAP Web Services SOAP web services provide a simple interface between a provider and a consumer and are a strategic mean for data exchange and content distribution [9]. In fact, ranging from online stores to media corporations, web services are becoming a key component within the organizations information infrastructure. In these environments the Simple Object Access Protocol (SOAP) [8] is used for exchanging XML-based messages between the consumer and the provider over the network (using for example http or https protocols). In each interaction the consumer (client) sends a request SOAP message to the provider (the server). After processing the request, the server sends a response message to the client with the results. A web service may include several operations (in practice, each operation is a method with one or several input parameters) and is described using WSDL [8]. A broker is used to enable applications to find web services. This section proposes a benchmark for the evaluation of the robustness of SOAP web services. 4.1 Robustness Tests Preparation Before generating and executing the robustness tests we need to obtain some definitions about the web service operations, parameters, data types and domains. As mentioned before, a web service interface is described as a WSDL file. This file can be processed automatically to obtain the list of operations, parameters (including return

values) and associated data types. However, the valid values for each parameter (i.e., the domain of the parameter) are typically not deductible from the WSDL description. This way, the benchmark user must provide information on the valid domains for each parameter, including for parameters based on complex data types, which are decomposed in a set of individual parameters (e.g., in the form of intervals for numbers and dates, and regular expressions for strings). A workload (set of valid web service calls) is needed to exercise each operation of the web service under testing. As it is not possible to propose a generic workload that fits all web services, we need to generate a specific workload for the web service under testing. Any of the previously mentioned options (real, realistic, or synthetic workloads) can be considered for web services. 4.2 Robustness Tests Execution A generic setup to benchmark the robustness of web services includes a robustness testing tool (which in fact represents the benchmark management system) that is composed of two main components: a workload emulator (WE) that acts as a service consumer by submitting calls (it emulates the workload) and a fault injector (FI) that automatically injects erroneous input parameters. An important aspect is that the source code of the web service under testing is not required for the robustness tests generation and execution. This is true for both the provider and the consumer sides. The execution of the benchmark includes two steps. In the first step, the workload is run without considering invalid call parameters. The goal is to understand the typical behavior expected from the web service (e.g., expected exceptions). This step includes several tests where each test corresponds to several executions of a given operation. The number of times the operation is executed during each test depends on the size of the workload generated (which was specified by the benchmark user previously). In the second step of the benchmark, the workload is run in the presence of invalid call parameters (robustness tests) generated by applying the rules presented in Table 1. As shown in Figure 1, this step includes several tests, where each test focuses a given operation and includes a set of slots. A given slot targets a specific parameter of the operation, and comprises several injection periods. In each injection period several faults (from a single type) are applied to the parameter under testing. The system state is explicitly restored at the beginning of each injection period and the effects of the faults do not accumulate across different periods. However, this is mandatory only when the benchmark is used by the web service provider as in many Test Operation 1

Slot Parameter 1 Injection Period Fault 1

Test Operation 2

Slot Parameter 2

Injection Period Fault 2







Test Operation N

Slot Parameter N Injection Period Fault N

Fig. 1. Execution profile of the step 2 of the benchmark.

cases the consumers are not able to restore the web service state since it is being tested remotely. During fault injection, the fault injector intercepts all SOAP messages sent by the emulated clients (generated by the workload emulator component) to the server. The XML is modified according to the robustness test being performed and then forwarded to the server. The server response is logged by the tool and used later to analyze the service behavior in the presence of the invalid parameters injected. To improve the coverage and representativeness of the benchmark, the robustness tests are performed in such way that fulfills the following goals/rules: all the operations provided by the web service must be tested, for each operation all the parameters must be tested, and for each parameter all the applicable tests must be considered. For instance, when considering a numeric parameter, the fifteen tests related to numbers must be performed (see Table 1). 4.3 Characterizing the Web Services When the benchmark is run from the provider point-of-view, the robustness of the web services is classified according to the sCRASH scale. On the other hand, when web services consumers run the benchmark the sAS scale should be used (see Section 3.3). All failure modes mentioned before can be easily observed by analyzing the logs of the benchmarking tool (for Abort, Silent, and Hindering failures) and the logs of application server used (for Catastrophic and Restart failures). The classification procedure is quite simple and it consists of matching the service‟s observed behavior to one of the sCRASH or sAS scales failure modes. To complement the failure modes characterization we recommend the benchmark user to perform a detailed analysis of the web service behavior in order to better understand the source of the failures (to be able to correct or wrap those problems). As an example, during our experimental evaluation (see Section 4.4) we have observed the following main sources of robustness problems: – Null references: related to null pointers that reflect none or poor input validation in the service provider. – Database accesses: exceptional behavior caused by invalid SQL commands. – Arithmetic operations: typically data type overflow in numeric operations. – Conversion problems: class cast exceptions or numeric conversion problems. – Other causes: arguments out of range, invalid characters, etc. These were not very frequent and are thus classified in this generic group. Obviously, the source of failures depends on the web services (e.g., if the web service has no access to a database then invalid database access operations will never be observed), which complicates the definition of a generic classification scale. Thus we do not propose a classification for failures source characterization. The user is free to use the classification introduced above or any other tailored for the specific services being tested. 4.4 Experimental evaluation and discussion Here we demonstrate the web services robustness benchmark considering two different scenarios: evaluation of 116 services publicly available and comparison of two different

implementations of the services specified by the TPC-App performance benchmark [42]. The most important component of the experimental setup is the robustness-testing tool. This tool, available at [17] for public use, includes a fault injection component that acts like a proxy that intercepts all client requests (generated by the workload emulator component), performs a set of mutations in the SOAP message and forwards the modified message to the server. The tool logs the server responses that are analyzed at the end of the experiments to characterize the behavior of the web services. As mentioned above, two experimental scenarios are considered. In the first, we evaluate the robustness of public web services provided by different relevant parties, including one service owned by Microsoft and another by Xara, and several services used in real businesses in the field. In the second scenario, we have applied the benchmark to compare two different implementations of web services specified by the standard TPC-App performance benchmark [42]. The two implementations (implementation A and B) have been developed using N-Version programming [4]. In both cases the web services have been coded by experienced Java programmers with more than two years experience. As application server for the experiments we have used JBoss [12], which is one of the most complete J2EE application servers in the field and is used in many real scenarios. 4.4.1 Public Web Services Robustness Results A set of 116 publicly available web services was tested for robustness (most of these services are listed at [46], a website that references publicly available web services). The following bullets point out some important aspects concerning the selected web services: – Several technologies are considered: .NET (74%), Visual Dataflex (6%), Axis (5%), MS Soap (3%), SOAP/AM, Visual FoxPro, Visual Basic (2%), Delphi, Coldfusion, and Glue (1% each), among others. The predominance of .NET services results of the also high percentage found in the original list at [46]. – Web services are owned by different relevant parties, including Microsoft and Xara. – Some web services implement the same functionally (e.g., Text Disguise and Free Captcha Service). – Some web services are used in real businesses in the field (e.g., Portuguese Postal Office Orders Cost, Mexican Postal Codes, and UPS Online Tracking Web-Service). Table 21 presents the tested web services and characterizes the observed failures (using the sAS scale as we are testing the web services as consumers) for each parameter in terms of their source (i.e., database access, arithmetic operation, null reference, and others). Although it would require very particular conditions for a web service to show different behaviors for the same test, we repeated each test 3 times. Note that, each operation can include a large number of tests depending on the number and type of the operation‟s input parameters. We found no differences between the 3 test runs for all 116 web services tested. The table only presents information about services that revealed robustness problems (parameters and operations that did not disclose any robustness problems are also omitted). The number of failures presented in table 2 represents distinct failures or 1

The services were tested between March 2007 and March 2008. Posterior tests may lead to different results if the providers modify the implementation (e.g., to correct software bugs).

similar failures caused by distinct mutations. Failure causes are differentiated using the classification presented in Section 4.3. Table 2. Public web services tests results. Web service Publisher

Name

Operation

Failures Parameter

getAirportInformationByCit webservice AirportInfoWeb cityOrAirport yOrAirportName(String x.net Service Name cityOrAirportName) BarSize Code39(String Code, int webservice Code39 BarCode Barsize,boolean ShowCoCode x.net deString, String Title) Title GetWeather(String CityCityName webservice Global Weather Name, String Countryx.net CountryName Name) webservice GetICD9Drug(String ICD9Drug Substance x.net Substance) invesSearch Public GetIndustryName(int sic) sic bot.com Companies AddEntry(Date Entry_ID XWebSerXWebBlog Date_Created, vices Date_Created int Entry_ID) ConvertLonLatPtToNeaLon restPlace(double Lon, Lat double Lat) ConvertLonLon TerraService LatPtToUtmPt(double Lon, Microsoft Lat double Lat) City ConvertPlaceToLonLatPt State Country Xara NavBar GetColors(String Template, Template xara.com Generator String Variant) Variant textdisguis GetNewWord(String Guid, Text Disguise Guid e.com String Password) dls27secon GBNIR Holiday GetAshWeds (int year) year ds Dates Code ICD-9 Code ICD9Codes(String Code, vba66a Lookup String Codelength) Codelength webservice Validate Email IsValidEmail Email x.net Address Search for ingig Search(String text) text Icelandic person country1 xmeCurrency ExgetRate(String country1, thods.net change Rate String country2) country2 IntToRoman(int Int) Int Romulan NumBob Swart bers XLII RomanToInt(String Rom) Rom getCodigoPostalFromMunic Mexican Postal Agenteel ipio Municipio Codes (String Municipio) Fed E-Payment nuwaveGetFedACHCities(String Routing Direstate tech.com state) ctory Search westWeb-Service CreateWebServiceDoculcUrl

DB

Arith metic

Null Conve Oth Sile Ref. rsion ers nt

1 1 1 1 1 1 1 4 5 2 2 2 2 1 1 1

1 1 1 1 1

1 1 1 2 1 5 5 1 1

1 1 1 2

wind.com documentation service

mentation(String lcUrl, String lcTitle, String lcScheme)

lcTitle

2

lcScheme

2

seanco.co Shakespeare GetSpeech(String Request) Request m Search ASMX/ esynaps WSDL files on Search(String KeyWord) KeyWord the Web Phone Number PhoneVerify(String PhoneNumbe vba66a Verification PhoneNumber) r service City Location CityStateToAreaCod(String jcono Information City, String State) State dls27secon GBSCT Holiday GetEasterMonday(int year) year ds Dates InText walterjone TEXT to Braille BrailleText s TextFontSize serviceID SearchRecipes(int serviceemail Recipes icuisine ID, String email String WebServices criteria criteria, int pageNumber) pageNumber OFAC SDN and IsSDNAndBlockedPersons( Names Blocked Persons String Names) walterjone GetParticipantsByLocaCity s FedWire tion(String City, String StateCode StateCode) Province & State Synaptic regionToAbbrev regionName Codes UserInput IsValid(String UserInput, tilisoft eyeVeri VeriCodeHas String VeriCodeHash) h ByLetterApproved(String Legal & Illegal ingig letter, boolean boys, booletter Icelandic Names lean girls, boolean middle) walterjone RSS Reader GetHTML RSSURL s WhoIs(String, esynaps Who Is DomainName DomainName) walterjone USA Weather GetWeatherByPlaceName PlaceName s CityName walterjone Global Weather GetWeather s CountryName RenameRuleset(String key, Scientio XmlMinerServer String common, String key LLC oldname, boolean common) North American walterjone GetNAICSByID(String Industry Classif. NAICSCode s NAICSCode) System GetHostInfoByIP(String esynaps YourHost IPAddress IPAddress) gsmNumber SendSMS(String ingig Icelandic SMS gsmNumber, String text) text ICD9ToICD10 GetICD9Code ICD10Code walterjone fromICDCod s e ICD-9-CM GetICD9Level3 toICDCode

1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1

1 1

1

1

1 1 1 1 1

2

1 1 1

1 1

1 1 1

A large number of failures has been observed (33% of all tested services presented some kind of failure). Most of the failures correspond to the abort failure mode, which is related to unexpected runtime exceptions raised during tests. A total of 89 Abort failures (not counting similar errors triggered by different faults for the same parameter in a given operation) were detected. 34% were marked as null references, 33% as SQL problems, 9% conversion problems, 4% as arithmetic problems, and 20% as others. A large percentage of web services revealed robustness problems related to null references. It was observed that most of the services assume that their input parameters are not null. However, this may not always be the case and it may lead to robustness problems that can have even more serious consequences, such as security issues. An interesting result is that there is a significant number of abort failures related to database accesses. Through a detailed observation of the exceptions raised we could observe the lack of SQL prepared statements. In most cases, this was observed when performing the Replace by predefined string mutation using an apostrophe character (a character frequently used in strings in SQL statements), suggesting the presence of SQL Injection vulnerabilities. The typical cause for failures related to arithmetic problems was the use of the maximum value for a numeric type, which resulted in overflows in arithmetic operations. Several other issues were observed, such as: – Unexpected cast exceptions in the Microsoft service whenever an overflowed Boolean was used as input. – In the „Recipes‟ web service the access to a node of a Red-Black tree object is affected, fortunately the operation appears to carry no further consequences. – Arguments out of range (clearly visible in the „Search for Icelandic Person‟ service) and other exceptional behavior not properly handled by the service implementations. – Other abnormal behaviors like Invalid Memory Reference for a Write Access and Internal Server Error. Additionally, some silent failures were observed for two web services: WebService Documentation (owned by west-wind.com) and Code 39 Bar Code (owned by webservicex.net). For the former, robustness tests consisting in the replacement of any parameter by a null value always leads to an absence of response from the server. For the latter, an overflowed string in the Title parameter led the server to report a null reference exception. However, web service requests submitted immediately after that abort failure remained unanswered (which is understood as a silent failure). An interesting aspect is that, although implementing the same functionally, Text Disguise and Free Captcha Service present different robustness results (2 and 0 failures observed respectively). This way, it is clear that from a consumer point-of-view Text Disguise is a better choice when considering robustness aspects. The observed problems reveal that the services were deployed without being properly tested, and that our approach can expose the insufficiencies of the tests performed by providers. It is important to emphasize that, robustness problems can gain a higher importance if we consider that some can also represent security problems that can be used to exploit or attack the problematic web services (e.g., using SQL Injection).

4.4.2 TPC-App Web Services Robustness Results The second experimental scenario consisted in evaluating the robustness of two implementations of a representative subset of the TPC-App web services (each web service implemented a single operation). A summary of the tests results is presented in Table 3. Each table row presents the results for a given parameter of a particular web service and includes a small failure description. As in the previous experiments, the number of failures in this table represents distinct failures or similar failures caused by distinct mutations. As shown in Table 3, robustness problems related to abort failures were observed for both solutions. An interesting robustness problem was observed for the newCustomer service in implementation A. In fact, although the code targeting the validation of the contactEmail parameter was in place, too large email addresses caused the web service to throw a StackOverflowException. After some analysis of the code we concluded that the problem resided in the external API that was being used to validate email addresses (Jakarta Commons Validator 1.3.0 [3]). This shows that robustness problems may occur even when programmers pay a great attention to the code correctness. In fact, the use of third party software (as is the case in this example) may raise problems that are not obvious for programmers. Furthermore, this type of errors can easily appear or disappear when an apparently harmless update is done to the external libraries commonly required by projects. However they can be easily detected with the help of our robustness-testing approach. Table 3. Robustness problems observed for each TPC-App implementation. Web Service

Target Parameter

customerID changePayment paymentMethod Method creditInfo poId billingAddr1 billingAddr2 Billingcity billingCountry billingState billingZip businessInfo businessName newCustomer contactEmail contactFName contactLName contactPhone creditInfo Password paymentMethod poId subjectString cutOffDuration newProducts itemLimit productDetail itemIds

Implementation A # Abort Unexpected Failures Exceptions 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 stack overflow 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 1 null pointer 0 0 0 0 0 -

Implementation B # Abort Failures 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1

Unexpected Exceptions null pointer null pointer null pointer null pointer null pointer null pointer null pointer null pointer null pointer null pointer null pointer

As we can see from the analysis of the results in Table 3, from a consumer standpoint, the best option would be to choose the newProducts and productDetail services from implementation A, and the changePaymentMethod and newCostumer services from implementation B. From the provider point-of-view, it is clear that some software improvements are needed in order to resolve the robustness problems detected. In fact, after performing the robustness tests we have forwarded the results to the programmers in order to get the implementations improved (this is what is expected when robustness problems are detected). Through a detailed analysis of the results the programmers were able to identify solutions for the existing software faults and two new versions were developed. The robustness tests were then executed for these new versions and no robustness failures were observed. This shows that this type of testing can be an important tool for programmers to improve the robustness of their solutions. In this experiment we have performed the tests as providers as we had the source code of the web services being tested. Although only abort failure modes were observed we believe that all the failure modes defined by the sCRASH scale are useful for providers. Our results are dependent on the tested implementations and cannot be generalized. Additionally, we believe that there are many services available that cause the application server to crash or hang when invalid parameters are provided. An excellent example can be found in the following section, where catastrophic failures (unobserved up to this point) have been disclosed in JMS middleware.

5 Assessing Robustness of JMS middleware This section proposes an experimental approach for the evaluation of the robustness of Java Message Service (JMS) middleware. The use of JMS is growing quite fast as it provides a very powerful and easy way for communication and integration in Java enterprise applications [38]. Java Message Service is a Java Application Programming Interface (API) that implements a standard for enterprise messaging. In practice, a JMS provider is a middleware component that provides and manages messaging services. There are many open source and proprietary providers available today (e.g., Apache ActiveMQ, JBoss Messaging, WebSphere MQ, Open MQ). In a JMS based environment, a middleware component provides services to create, send, receive, and read messages. Client applications produce and consume those messages by interacting with the services supplied by the middleware. Two messaging models are supported by JMS [38]: – The point-to-point model is based on message queues, senders and receivers. Senders send messages to queues, which make those messages available for receivers. Queues retain all messages until these are consumed or until they expire. – The publish/subscribe model is based on the concept of topics. A publisher sends a message to a topic that distributes the message to multiple subscribers. Conceptually the subscriber receives the message if it is online at time it is published, but the JMS API allows durable subscriptions, loosening this restriction.

5.1 Fault Model In a JMS environment a connection factory creates connections that can be used to produce sessions. Each session is able to create message producers or consumers, which in turn have the ability to send/receive messages to/from a destination [38]. The robustness-testing approach is obviously dependent on the JMS environment specificities and consists of modifying messages immediately before they are sent from a producer to a destination that may be linked to one consumer (in the case of point-to-point messaging) or to several consumers (in the case of publish/subscribe messaging). Message parameter modification is based on the set of mutation rules presented in Table 1, including extensions that take the JMS specificities in consideration. There are three main parts in a JMS message: – Header: contains fields used to identify and route messages. – Properties: extension to the header, provide additional functionality when required by the business logic of client applications. – Body: the message content itself. The fault model proposed considers not only the parameters included in these three parts, but also the multiple ways of setting them (e.g., setting properties using the several different methods available: setBooleanProperty, setStringProperty, etc). The JMS specification defines several message header fields [46]. Some of those fields are intended to be set by the client application. For instance, a client may need to set the JMSReplyTo field to specify the object to which a reply to the message should be sent. However, there are other fields that are set automatically by the JMS implementation just after the client sends the message (by executing the „send‟ or „publish‟ method) and before that message is sent to the destination. The JMS implementation must handle and correctly set all fields that are of its own responsibility, even if the client did set them. For example, if a client sets the JMSExpiration field the JMS send or publish method will overwrite it. The JMS specification defines the whole set of header fields available, their types and how they are set [38]. It is important to emphasize that header fields are parameters that robustness testing must cover as they define application level behaviors need to be tested under faulty conditions. Message properties are also relevant for robustness testing. All allowed types and conversions are also defined by the JMS specification. For these fields, a value written using a given data type can be read as one or more data types. For example, after setting a boolean property it is possible to retrieve it using a getBooleanProperty method or a getStringProperty method (in this case it returns a string representation of the boolean value). The unmarked cases in the JMS specification must throw a JMSException as they represent exceptional behavior (e.g., setting a byte property and then retrieving the value as a boolean, float or double should raise a JMSException). In addition it is possible to set an object of an unspecified type by using a setObjectProperty method (the acceptable object types are limited to the primitive objectified types) and retrieve it with a suitable available getter method. In other words, the setObjectProperty method can be used to set, for instance, a Boolean, that can then be retrieved either by using the getObjectProperty method or getBooleanProperty. Testing the correctness of these conversions requires a later verification of the value set. The goal is to check if the JMS middleware is accurately following the specification. Any non-compliance between the JMS specification and the way it is imple-

mented may compromise the client application‟s business logic. Thus, properties conversion testing is a key aspect that must be included in a robustness test. Several data types are allowed for the message body: Stream, Map, Text, Object or Bytes. At first sight, it may seem less interesting to mutate the message body to exceptional values. In fact, this is most likely to trigger problems in the client applications instead of triggering faults in the middleware itself. However, for sake of completeness we have decided to consider tests that use valid or null objects for the body. 5.2 Robustness Tests Profile Our proposal for JMS middleware robustness-testing includes three major phases: Phase 1) valid messages are created and sent; Phase 2) faults are injected in the messages being sent; and Phase 3) involves regular message creation and sending with the goal of exposing errors not revealed in Phase 2. Phase 1 is executed once at the beginning of the tests. Phases 2 and 3 are repeated several times, one for each parameter being tested (this includes header fields, properties, and body). This execution profile is illustrated 2. 18-04-2008 in Figure 23-04-2008 System state restored

Regular message sending

Phase 1

System state restored

Parameter 1 Regular Parameter 2 Regular injection message Injection Message period sending Period Sending

Phase 2

Phase 3

Phase 2

Phase 3

Fig. 2. Robustness-testing profile.

The first phase consists in sending a set of valid messages to a given destination and verifying their delivery at the consumer. Again, any of the previously indicated workload types (real, realistic, or synthetic) can be used, and the goal is to understand the behavior of the JMS implementation without (artificial) faults. This phase represents the gold run that allows checking if subsequent robustness tests affect message sending and delivery. In the second phase a parameter is chosen from the list of available not yet tested parameters. According to the type of the selected parameter, a set of rules is applicable. For instance, if the parameter is numeric, then all the mutations that are applicable to numbers will be applied (see Table 1). Only one of the rules is selected in each injection period and all messages produced in this period are tampered according to that rule. Note that, all fault types applicable to the parameter in question are considered (a single fault type is considered in each injection period). Message delivery at the consumer is verified for correctness and consistency. In the third phase, we let the system run for a predefined period (at least twice the time necessary to run the full workload) in order to check for abnormal behavior not noticed in Phase 2. Notice that a fault may be silently triggered during Phase 2 and only exposed in Phase 3. Additionally, even if a fault is clearly exposed during the

injection period, it may or may not persist to the third step. This may be an indication of a more severe problem. In all three phases, after each message is sent, reception follows up and content verification is done. The only exception to this rule is when mutations are being applied to the JMSMessageID. As this is a unique identifier, it is interesting to send two consecutive messages with the same identifier before trying to receive any. In this way we can see if duplicated identifiers affect message delivery. After Phase 3, the system state is restored so that testing can continue with no accumulated effects. Testing continues now for the following parameter (as the system was restored, parameter order should have no effect) by repeating phases 2 and 3. The process is fully automated and repeats until there are no more parameters to test. 5.3 Failure Modes The robustness of JMS middleware is also based on the sCRASH scale [15]. An important aspect is that the benchmark can also be used to uncover robustness related compliance problems between the specification and the JMS implementation. For instance, the specification states that a JMSException must be thrown if the JMS provider fails to send a message due to some internal error. If one of the injected faults causes this behavior and the parameter value is acceptable (as defined by the specification) this represents non-conformity. Thus, the following classification for non-conformity, or non-compliance observations is proposed (note that this is an instantiation of the scale presented in Section 3.3 for the JMS context): – Level 1: A severe non-conformity was observed. The problem affects basic messaging services (e.g. the inability to send or receive messages, or a specification part that is ignored). – Level 2: A medium severity non-conformity is observed. This can be any nonconformity that is not severe (as defined by Level 1) and is not related with extra functionality (functionality not specified by the standards that was introduced by the JMS provider). It can also represent any kind of restriction over the specification (e.g. restricting the domain value of a variable to an interval smaller than what is in the specification). – Level 3: Includes non-conformities observed that are the result of an effort to extend the JMS specification (e.g., permitting additional data types as properties). Note that, classifying a given non-conformity is a complex task and depends on the person performing the analysis. Nevertheless, we believe that a straightforward classification is useful to understand the criticality of specification compliance problems. 5.4 Experimental Results and Discussion In this section we present an experimental evaluation of three different JMS providers. As the JBoss Application Server [12] is one of the most widely used application servers on the market, we decided to test JBoss MQ [27] for robustness problems (two different major versions). We have also tested the Apache Software Foundation project ActiveMQ [1] as it has a large popularity among the open source community.

Table 4. Problems detected in the JMS implementations. Provider JBoss MQ 4.2.1.GA JBoss MQ 3.2.8.SP1 ActiveMQ 4.1.1

Robustness issues Type # Catastrophic 1 Catastrophic 1 Silent 1

Compliance issues Type # Level 2 4 Level 2 4 Level 2 4

Security issues DoS Attacks DoS Attacks Message suppression

Most of the JMS implementations are application server independent (i.e., a given JMS implementation can be used in any application server). However, in real scenarios the JMS provider used is typically the one that is built into the application server being employed (otherwise a serious configuration effort is normally involved). In this way, we tested each JMS implementation in its most used container, which is JBoss AS for JBoss MQ and Apache Geronimo [2] for ActiveMQ. Several robustness problems were exposed. These problems are of extremely importance as they also represent major security vulnerabilities. Several minor nonconformities were also found in all three providers. All providers revealed Level 2 non-conformities. ActiveMQ added a Level 3 non-compliance issue to these. Table 4 summarizes the results, which are discussed in detail in the next sections. 5.4.1 JBoss MQ 4.2.1.GA Results JBoss MQ is a production ready messaging product available from JBoss. According to the developers, it was submitted to multiple internal tests and to Sun Microsystems compliance tests (it is 100% Sun Compatibility Test Suite (CTS) [19] JMS compliant). Considering this scenario, we were expecting to find none or a very small number of robustness problems. In fact, JBoss MQ passed all the robustness tests related to properties and body, and almost all the tests related to the header fields. However, it failed in the tests where JMSMessageID was set to null. At first, sending a message with this field set to null appeared to cause no harm to the JMS provider, as the message was correctly stored in the internal JBoss JDBC (Java Database Connectivity) database. However when a receiver tried to read the message an unexpected null pointer exception was thrown and any subsequent reads of that message were unable to succeed. Furthermore, regular subsequent messaging operations were also affected. In fact, consumers were unable to retrieve any valid messages sent afterwards. This failure was classified as Catastrophic as a corruption of the JMS server has occurred. In fact, a restart of the application server was unable to restore the normal behavior of the JMS provider. The only way we found to recover was to manually delete the invalid message information from the repository. This was possible because JBoss MQ uses a JDBC database that is easy to access and modify using SQL (Structured Query Language). If JBoss used any other proprietary repository, we would have had to delete the whole message repository. Nevertheless, both solutions are totally unacceptable and difficult to manage. As this represents a severe failure, we then analyzed the JBoss MQ source code in order to find the root for this problem. We concluded that, although the JMSMessageID represents a unique identifier for the message, JBoss uses another variable to uni-

quely identify each message. This allows sending and storing messages with null JMSMessageID. However, when a receiver fetches a message from the repository JBoss MQ tries to load messages into memory in a Map structure that uses a hash of the JMSMessageID as key (and the whole message as value). This of course produces a null pointer exception, as hashing a null value is impossible. This serious problem also exposes a severe security issue that can be exploited by hackers to cause Denialof-Service. In fact, it is quite easy for a malicious user to generate a message with a null JMSMessageID and send it to the JMS provider to crash it. This shows the usefulness of our approach to discover possible security vulnerabilities. Some minor compliance issues were also detected in JBoss MQ 4.2.1.GA. In fact, the overflow mutation (i.e., overwriting the parameter with an overflown string) applied to JMSMessageID, JMSCorrelationID and JMSMessageType fields caused a JMSException wrapping a UTFDataFormatException. This JMSException is according to the standard, which specifies that this exception must be thrown if the provider fails to send the message due to some internal error. Nevertheless, the specification does not impose the restriction that the value for these header fields has to be handled as a UTF string. This is the cause for the UTFDataFormatException as the chosen method to write the string imposes a limit on its length [11]. Note that, choosing a different writing method would easily solve this issue. The same happens while setting a property name with an overflowed string (it does not happen while setting the value for the property). These are both level 2 compliance issues as they represent non-severe problems related to what is defined by the JMS specification. 5.4.2 JBoss MQ 3.2.8.GA results Legacy or older application servers are still used frequently in enterprises as upgrading a system to a newer version may have a high cost. In this sense we found interesting to test the previous JBoss MQ major version. With this experience we also wanted to verify if the serious robustness problem detected in the latest version also existed in the previous version. Again, the same Catastrophic robustness failure occurred in this version. In fact, the results observed (for both robustness and non-conformities) were exactly the same for both JBoss MQ versions tested. Note that, even though this middleware is normally submitted to a large battery of tests (as stated by the developers), a major robustness problem and security vulnerability has passed silently between versions. This is something that could have been easily avoided by testing the middleware using our testing approach. 5.4.3 ActiveMQ 4.1.1 results Apache ActiveMQ is a widely used open source JMS implementation, for which we were also expecting to discover a few or no robustness problems at all. Testing the message properties and body revealed no robustness problems. However, a failure was triggered while setting the JMSMessageID to a predefined value. In fact, mutating the JMSMessageID of a single message posed no problem. However, tampering a second message (before delivering the first to the consumer and, con-

sequently, removing it from the repository), caused the first message to be overwritten. Note that the JMS specification states that the JMSMessageID field contains a value that uniquely identifies each message. Overwriting messages represents an unacceptable behavior as it may lead to messages corruption and losses. After analyzing the source code, we concluded that this implementation uses this field as a unique identifier for the messages. Although this is an acceptable and natural approach, exceptional cases like duplicated values should be verified. This failure was classified as Silent (since an operation was incorrectly completed and no error was indicated) and may represent a security issue. In fact, security attacks that try to generate duplicate message IDs are quite easy. These messages will silently disappear from the provider‟s repository and will never be delivered to consumers. Similarly to JBoss MQ, some minor compliance issues were detected. JMSMessageID, JMSCorrelationID and JMSType are susceptible to the overflow mutation. In fact, this implementation imposes a limit on the size of these fields (2^15 – 1 characters) that differs from the one imposed by JBoss MQ (and both are incorrect as there is no limit defined in the standards). This same limitation is found in the size of the properties keys that, when violated, invalidate message delivery. These are level 2 non-conformity issues as they correspond to non-severe restrictions to the standards. Another compliance issue was detected while testing the message properties. The JMS specification states that [38]: “The setObjectProperty method accepts values of class Boolean, Byte, Short, Integer, Long, Float, Double, and String. An attempt to use any other class must throw a JMSException.” This was not the case with this implementation. In fact, it accepts a few other data types (Map, List and Date). This represents a level 3 non-conformity as it is clearly related to a specification extension. Note that a generic JMS based application may have problems if it is built upon the assumption that these data types are not allowed. 5.4.4 Additional comments on the JMS robustness testing results Although only a subset of the failure modes was observed (catastrophic and silent failures), we believe that all the failure modes defined by the sCRASH scale are useful. Indeed, some preliminary tests on the latest Java EE 5 [36] JMS reference implementation (OpenMQ 4.1 [40]) exposed an Abort failure. In this case, setting an Object property using null as value exposes a robustness problem. This is caused by the fact that this JMS stack checks if the object type is one of those permitted by the specification. If it is not an allowed type (as is the case of null that has no type), then it tries to build a String (for logging purposes) with information of the object itself. This produces a null pointer exception, as the null object does not contain any information. The information provided by our approach is of high utility if one wishes to choose a robust JMS provider. Additionally, it can provide valuable information to vendors to improve the quality of their solutions, fixing potential robustness and security problems. Finally, it is useful to gather some insight about the level of compliance between a given implementation and the JMS specification (valuable for consumers as they can choose the implementation that shows less compliance disparities and for middleware developers to improve the level of conformity to the standards).

6 Conclusion This chapter proposes an experimental approach for robustness evaluation of Service Oriented Architecture environments. The approach consists of a set of robustness tests that are used to trigger faulty behaviors in services and middleware. Two robustness benchmarks targeting the two key SOA technologies, web services and Java Message Service, were defined and experimentally illustrated. Concerning web services, several robustness problems were revealed in both public available and house-made web services. Regarding JMS, our approach revealed itself of extreme utility as it was able to reveal programming or design errors, major security issues and several nonconformities in three JMS implementations. A failure mode classification scheme was also defined. This enables a categorization of the robustness and compliance properties of the target systems and completes our methodology by providing a simple way to categorize the problems triggered by our robustness-testing approach.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

11. 12. 13.

14.

Apache Software Foundation: Apache ActiveMQ. http://activemq.apache.org/ Apache Software Foundation: Apache Geronimo. http://geronimo.apache.org/ Apache Software Foundation: Jakarta Commons Validator. http://jakarta.apache.org/commons/validator/ Avizienis A.: The Methodology of N-Version Programming. Chapter 2 of Software Fault Tolerance, M. R. Lyu (ed.), Wiley, 23-46 (1995). Carrette, G. J.: CRASHME: Random Input Testing. (1996) http://people.delphi.com/gjc/crashme.html Chen, S., Greenfield, P.: QoS Evaluation of JMS: An Empirical Approach, 37th Annual Hawaii Intl Conf. on System Sciences (HICSS'04), Track 9-Volume 9, pp. 90276.2 (2004) Chillarege, R.: Orthogonal Defect Classification. In: Chapter 9 of Handbook of Software Reliability Engineering, M. Lyu (Ed), IEEE Computer Society Press, McGrow-Hill (1995) Curbera, F. et al.: Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI, Internet Computing, IEEE, vol. 6, pp. 86-93 (2002). Erl, T.: Service-Oriented Architecture: Concepts, Technology, and Design, Upper Saddle River, Prentice Hall (2005) Fabre, J.-C., Salles, F., Rodríguez, M., Arlat, J.: Assessment of COTS Microkernels by Fault Injection. 7th IFIP Working Conference on Dependable Computing for Critical Applications: DCCA-7, San Jose, CA, USA (1999) Green, R.: UTF: Java Glossary. Roedy Green‟s Java & Internet Glossary, http://mindprod.com/jgloss/utf.html JBoss: JBoss Application Server Documentation Library, http://labs.jboss.com/portal/jbossas/docs Kalyanakrishnam, M., Kalbarczyk, Z., Iyer, R.: Failure Data Analysis of a LAN of Windows NT Based Computers. Symposium on Reliable Distributed Database Systems, SRDS18, October, Switzerland, pp. 178-187 (1999) Koopman, P. et al.: Comparing operating systems using robustness benchmarks. The Sixteenth Symposium on Reliable Distributed Systems, , pp. 72-79 (1997)

15. Koopman, P., DeVale, J.: Comparing the robustness of POSIX operating systems. 29th Annual Intl Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 30-37 (1999) 16. Laranjeiro, N., Vieira, M., Henrique, M.: Experimental Robustness Evaluation of JMS Middleware. IEEE Services Computing Conference (SCC 2008), pp. 119-126 (2008) 17. Laranjeiro, N., Canelas, S., Vieira, M.: wsrbench: An On-Line Tool for Robustness Benchmarking. IEEE Services Computing Conference (SCC 2008), pp. 187-194 (2008) 18. Lee, I., Iyer, R. K.: Software Dependability in the Tandem GUARDIAN System. IEEE Transactions on Software Engineering, vol. 21, no. 5, pp. 455-467 (1995) 19. Maron, J.: An overview of the CTS for J2EE component developers, http://www2.syscon.com/ITSG/virtualcd/Java/archives/0701/maron/index.html 20. Marsden, E., Fabre, J.-C.: Failure Mode Analysis of CORBA Service Implementations. IFIP/ACM Intl Conf. on Distributed Systems Platforms, Middleware'01, Germany (2001) 21. Mendonça, M., Neves, N. Robustness Testing of the Windows DDK. 37th Annual IEEE/IFIP Intl Conference on Dependable Systems and Networks, 2007, pp. 554-564 22. Menth, M. et al.: Throughput performance of popular JMS servers. Joint international conference on Measurement and modeling of computer systems, pp. 367-368 (2006) 23. Microsoft Corporation: Web Services Performance: Comparing Java 2TM Enterprise Edition (J2EETM platform) and the Microsoft® .NET Framework - A Response to Sun Microsystem‟s Benchmark. (2004) 24. Miller, B. P., Koski, D., Lee, C. P., Maganty, V. Murthy, R., Natarajan, A., Steidl, J.: Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities and Services, University of Wisconsin, USA, Research Report, N°CS-TR-95-1268 (1995) 25. Mukherjee, A., Siewiorek, D. P.: Measuring Software Dependability by Robustness Benchmarking. IEEE Trans. of Software Engineering, 23 (6) (1997). 26. Pan, J., Koopman, P. J., Siewiorek, D. P., Huang, Y., Gruber, R., Jiang, M. L.: Robustness Testing and Hardening of CORBA ORB Implementations. Intl. Conference on Dependable Systems and Networks, DSN-2001, Gothenburg, Sweden, pp.141-50, (2001) 27. Red Hat Middleware: JBossMQ. http://www.jboss.org/. (2007) 28. Rodríguez, M., Albinet, A., Arlat, J.: MAFALDA-RT: A Tool for Dependability Assessment of Real-Time Systems. IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2002, Bethesda, MD, USA (2002) 29. Rodríguez, M., Salles, F., Fabre, J.-C., Arlat, J., "MAFALDA: Microkernel Assessment by fault injection and design aid", 3rd European Depend. Computing Conf, EDCC-3 (1999) 30. Shelton, C., Koopman, P., Vale, K. D.: Robustness Testing of the Microsoft Win32 API. Intl Conf. on Dependable Systems and Networks, DSN‟2000, New York, USA, (2000) 31. Siblini, R., Mansour, N.: Testing Web services. The 3rd ACS/IEEE International Conference on Computer Systems and Applications, p. 135 (2005). 32. Siewiorek, D. P., Hudak, J. J., Suh B.-H., Segall, Z.: Development of a Benchmark to Measure System Robustness. 23rd International Symposium on Fault-Tolerant Computing, FTCS-23, Toulouse, France, pp.88-97 (1993) 33. Standard Performance Evaluation Corporation (SPEC), http://www.spec.org 34. SPEC: SPECjAppServer2004,”, http://www.spec.org/jAppServer2004/ 35. SPEC: SPECjms2007. http://www.spec.org/jms2007/ 36. Sun Microsystems Inc..: Java Enterprise Edition, http://java.sun.com/javaee/ 37. Sun Microsystems Inc.: High Performance JMS Messaging: A Benchmark Comparison of Sun Java System Message Queue and IBM WebSphere MQ (2003) 38. Sun Microsystems Inc.: Java Message Service (JMS) Sun Developer Network,

http://java.sun.com/products/jms/ 39. Sun Microsystems Inc.: Web Services Performance: Comparing JavaTM 2 Enterprise Edition (J2EETM platform) and .NET Framework. (2004) 40. Sun Microsystems, Inc.: Open Message Queue, https://mq.dev.java.net/ 41. Transaction Processing Performance Council (TPC), http://www.tpc.org 42. Transaction Processing Performance Council, “TPC BenchmarkTM App (Application Server) Standard Specification, Version 1.1”, http://www.tpc.org/tpc_app/ 43. Vieira, M., Laranjeiro, N., Madeira, H.: Assessing Robustness of Web Services Infrastructures. The 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2007), pp. 131-136 (2007) 44. Vieira, M.: Dependability Benchmarking for Transactional Systems. PhD dissertation, University of Coimbra (2005) 45. Weyuker, E.: Testing Component-Based Software: A Cautionary Tale. IEEE Software, (1998) 46. XMethods.net.: XMethods, http://xmethods.net/ 47. Xu, W. et al.: Testing Web services by XML perturbation. 16th IEEE International Symposium on Software Reliability Engineering, p. 10 pp (2005)