Automated Continuous Testing of Multi-Agent Systems - CiteSeerX

Automated Continuous Testing of Multi-Agent Systems Cu D. Nguyen, Anna Perini, and Paolo Tonella Center for Scientific and Technological Research (ITC-irst) Fondazione Bruno Kessler Via Sommarive, 18 38050 Trento,Italy {cunduy, perini, tonella}@itc.it

Abstract. Agent-based distributed systems are increasingly used in various application domains, where autonomy, proactivity and cooperation are required. Correspondingly, the demands on the quality of the delivered agents are growing. However, testing remains a challenging activity and systematic and automated approaches are still missing. We propose a novel framework for the continuous testing of multi-agent systems. In this framework, test cases are continuously generated and executed. Two techniques for the automated, continuous generation of test cases are investigated in this paper: (1) random; (2) evolutionary mutation. Preliminary experimental results, obtained on a case study, are encouraging and indicate that continuous testing can exercise a multiagent system more effectively than under the usual time constraints of manual testing.

1

Introduction

Agent-based distributed systems are increasingly used in various application domains. Agents have been recognized as a promising technology to build next generation of seamless mobility services. They appear in mobile phones or personal digital assistant equipments, helping their owners manage complicated work or do shopping; agents facilitate e-learning, decentralized enterprise management and are a candidate technology for e-inclusion applications. In the development process of such complex and critical systems, testing becomes crucial in order to ensure a satisfactory level of quality. Studies on testing of Multi-Agent Systems (MAS) are quite preliminary and only a few works investigated structured and tool-supported approaches to testing [20, 14, 1]. In fact, MAS testing is a challenging task. The very specific nature of software agents, which are designed to be autonomous, proactive, collaborative, and ultimately intelligent, makes it difficult to apply existing software testing techniques to them. For instance, agents operate asynchronously and execute in parallel, which challenges developing and debugging. Agents communicate primarily through message passing instead of method invocation, so traditional

testing approaches are not directly applicable. Agents are autonomous and cooperate with other agents, so they may run correctly by themselves but incorrectly in a community or vice versa. Moreover, agents can be programmed to learn, so successive tests with the same test data may give different results [15]. In this paper, we propose a framework for the continuous testing of MAS that complements, but does not replace, the manual creation of test suites (sets of related test cases). Continuous testing of MAS is part of a comprehensive goal-oriented testing methodology [11], in which test suites are derived manually from a goal-oriented requirement specification given, for example, using the Tropos [2] notation. In continuous testing, a software agent (called Autonomous Tester Agent) plays the role of the human tester, by producing test suites and executing them. A network of Monitoring Agents observes the execution (including the exchanged messages) and recognizes crashes and any behavior that does not comply with the agent specifications (given in the form of pre- and postconditions). The Monitoring Agents report any revealed bugs on the same bug tracking system that human testers use. The main advantage of the Autonomous Tester Agent over the human is that it can generate test cases automatically and it can run continuously. In this way, the MAS under test is tested more thoroughly and is stressed more extensively. This makes potentially a big difference, since MAS faults are typically hard to reveal, in that they require specific conditions and contexts. Continuous testing thus addresses the main weakness of manual testing, which is naturally bound to a limited testing time and to a small number of execution conditions. The next section introduces related work. Section 3 gives background notion of a goal-oriented software testing methodology. MAS testing challenges and our approach are presented in Section 4. Then, Section 5 describes our framework for automated continuous testing of MAS. Section 6 discusses experimental results. Finally, Section 7 summarizes this work and introduces our future investigation.

2

Related Work

Saff and Ernst [16, 17] introduced and evaluated a testing technique that uses spare CPU resources to run continuously tests in the background, providing rapid feedback about test failures while the source code is being edited. This technique was implemented on top of JUnit [6] and the Eclipse IDE1 . Our work has a similar inspiration, but we target MAS specifically and aim at automating test cases generation. Rouff [15] discusses the challenges involved in testing of single agents and agent communities. He proposed a special tester agent, used to test the other agents individually or within the community they belong to. We share with Rouff the natural choice of testing a MAS using an agent and we go even further by separating the testing from the monitoring responsibility, the latter being assigned to monitoring agents. This makes our framework more scalable in dealing 1

http://www.eclipse.org

with the distributed nature of MAS. Differently from Rouff, our final aim is to have a continuously running tester agent which autonomously generates new test cases. Dikenelli et al. [20] proposed a test-driven MAS development approach that supports iterative and incremental MAS construction. A testing framework, which is built on top of JUnit [6] and Seagent [4], is used to support the approach. The framework allows writing automated tests for agent behaviors and interactions between agents. Similarly, Coelho et al. [14] introduced an approach for MAS unit testing built on top of JUnit and JADE [19]. Agents differ from objects since they communicate through message passing and not by method invocation, so both approaches involve mock agents, which simulate real agents, to interact with the agents under test. However, a number of mock agents needs to be implemented in order to test every role of agents under test, raising scalability issues.

3

Background

Tropos is an Agent-Oriented software engineering methodology [2] that guides the software engineers in building a conceptual model, which is incrementally refined and extended, from an early requirements model to system design artifacts and then to code. Goal-oriented testing methodology [11] integrates testing into Tropos by means of a systematic way of deriving test suites from Tropos output artifacts. Test suites derivation takes place along the development phases. For instance, based on the results of the Late Requirements and Architectural Design phases, developers derive system test suites to test the system under development at the system level. Similarly, based on the results of the Architectural Design and Detailed Design phases, developers derive agent test suites to test agents individually as well as interactions among them. The derivation of test suites is realized at the time when developers specify or design the system, thus helping them refine their design and uncover defects early. Goal-oriented testing aims at: (1) verifying the capability of the system actors (agents in case a MAS is chosen as implementation platform) to fulfill their goals; and, (2) ensuring that they do not misbehave under abnormal circumstances. Test suites derived following the methodology adhere to these two objectives. To derive test suites to test a goal, the methodology exploits the relationships between the goal and other artifacts like goals, plans. As an example, given a Means-End relationship between goal G1 and plan P 1, we say G1 is fulfilled when P 1 is executed successfully. To test G1, one has to derive different test suites to launch P 1 and verify execution results. Specific test inputs (i.e. message content, interaction protocol), and expected outcome are partially generated from plan design (e.g. UML activity or sequence diagrams) and are then completed manually by testers.

4 4.1

Continuous Testing of Multi-Agent Systems Multi-Agent Systems and Testing Issues

Agent-oriented software development involves multiple disciplines, e.g. software engineering, cognitive science, social science, artificial intelligence, machine learning, etc. Each has its own view on MAS. In this work, we adopt the software engineering and software testing perspective on MAS. Multi-Agent systems are systems composed of multiple software agents that interact with one another in order to achieve their intended goals, and the goals of the systems as a whole. A MAS is usually a distributed system, its agents can be located in different hosts, and they communicate mainly through message passing. Each host provides a specific environment for the agents located at that host. Agents, in turn, are software systems that have (among others) the following properties: Reactivity, agents are able to sense environmental changes and react accordingly; Proactivity, agents are autonomous, in that they are able to choose which actions to take in order to reach their goals in a given situation; Social ability, that is, agents are interacting entities, which communicate, share knowledge, and may cooperate for goal achievement. Due to these peculiar properties, testing MAS is a challenging task that should address the following issues: Distributed/asynchronous. Agents operate in parallel and asynchronously. An agent might have to wait for other agents in order to fulfill its intended goals. An agent might work correctly when it operates alone but incorrectly when put into a community of agents or vice versa. MAS testing tools must have a global view over all distributed agents, in addition to local knowledge about individual agents, in order to decide whether the whole system is behaving according to the specifications. Autonomous. Agents are autonomous. The same test inputs may result differently at different runs, since agents might update their knowledge base between two runs, or they may learn from previous inputs, resulting in different decisions made in similar situations. Message passing. Agents communicate through message passing. Traditional testing techniques, involving method invocation, cannot be directly applied. Environment factor. Environment is an important factor that influences the agents’ behaviors. Changing the environment of an agent may affect the test results, even for the same test input sequence. Black-box MAS. In some particular cases, MAS could be seen as “black-box”, that is, they may provide no or little observational primitives to the outside world, resulting in limited access to the internal agents’ state and knowledge. This kind of MAS could be quite difficult to test, in that the test result (PASS or FAIL) may be hard to assess.

4.2

Automated Continuous Testing

The specific features of MAS (autonomy, proactivity, learning ability, cooperation, etc.) demand for a framework that supports extensive and possibly automated testing. We propose to complement manually derived test cases with automatically generated ones, which are continuously run in order to reveal errors associated with conditions that are hard to simulate and reproduce. Testing multi-agent systems can be achieved very naturally by means of a dedicated autonomous tester agent which continuously interacts with agents under test, and of monitoring agents which check those agents states. Since agents communicate primarily through message passing, the tester agent can send messages to other agents to stimulate a behavior that can potentially lead to fault discovery. The messages sent by the tester agent are those encoded in the test suites, which can in turn be manually derived from goal diagrams --following the goal-oriented testing methodology [11]-- or automatically generated. It is then the monitoring agents’ responsibility to observe the reactions to the messages sent by the tester agent and, in case these are not compliant with the expected behavior (post-conditions violated) or crashes happen, to inform the development team that a fault was revealed. Since the behavior of a MAS can change over time, due to the mutual dependencies among agents and to their learning capabilities, a single execution of a test suite might be inadequate to reveal the faults. Usage of an autonomous tester agent allows for an arbitrary extension of the testing time, that can proceed unattended and independently of any other human-intensive activity. Continuous testing of a MAS requires that the tester agent has the capability to evolve existing test suites and to generate new ones, with the aim of exercising and stressing the application as much as possible, the final goal being the possibility to reveal yet unknown faults. We propose a framework for continuous testing MAS, called eCAT (environment for the Continuous Agent Testing). One of the main components is the Autonomous Tester Agent, that is capable to automatically generate new test suites and to execute them on a MAS. We consider two automated test case generation techniques for the Autonomous Tester Agent: random generation and evolutionary mutation generation (called evol-mutation from now on). Using Random test suites generation technique, the Autonomous Tester Agent generates test suites randomly. While with evol-mutation, the Autonomous Tester Agent generates more effective test suites by mutating the existing ones. Random Testing The Autonomous Tester Agent is capable of generating random test cases, following the random test data generation strategy [9, 18]. First, the Autonomous Tester Agent selects a communication protocol among those provided by the agents platform, e.g. FIPA Interaction Protocol [5] in JADE [19]. Then, messages are randomly generated and sent to the agents under test. In order to insert meaningful data into the messages, a model of the domain data, coming from the business domain of the MAS under test, must be also supplied. The message format is that prescribed by the agent environment of choice (such

as the FIPA ACLMessage [5]), while the content is constrained by a domain data model. Such a model prescribes the range and the structure of the data that are produced randomly, either in terms of generation rules or in the (simpler) form of sets of admissible data that are sampled randomly. Randomly generated messages are then sent to the agents under test and it is the Monitoring Agents’ responsibility to observe the reactions, i.e., communications, exceptions etc. happening in the agent system. When a deviation from the expected behavior is found (post-condition violated or crash), it is reported to the development team. The main limitation of random testing of MAS is that long and meaningful interaction sequences are hardly generated randomly. However, it is often the case that agent interaction protocols need only one trigger message, like those specified in [5], or the agent under test needs only one message to trigger its goals. In these cases, random testing is a cheap and efficient technique that can reveal faults. Evidence is provided in the experimental results section. For the generation of longer sequences that are inherently constructed so as to maximize the likelihood of revealing faults, more sophisticated techniques need to be used, such as evol-mutation, described in the next section. Evolutionary Mutation Testing Mutation testing [3, 7] is a way to assess the adequacy of a test suite and to improve it. Mutation operators are applied to the original program in order to artificially introduce known defects. The changed version of the program is called a mutant. For example, a mutant could be created by modifying branch conditions, e.g., if (msg.getPerformative() == ACLMessage.REQUEST)

changed into if (msg.getPerformative() == ACLMessage.REQUEST WHEN)

or it can be created by modifying method invocation (e.g., receive() changed into blockingReceive()). A test case is able to reveal the artificial defects seeded into a mutant if the output of its execution deviates from the output of its execution on the original program. In such a case, the mutant is said to have been killed. The adequacy of a test suite is measured as the ratio of all the killed mutants over all the mutants generated. When such a ratio is low, the test suite is considered inadequate and more test cases are added to increase its capability of revealing the artificially injected faults, under the assumption that this will lead to revealing also the “true” faults. Evolutionary testing [12] is based on the possibility to evolve test suites by applying mutation operators to the test cases themselves. In order to guide the evolution towards better test suites, a fitness measure is defined as a heuristic approximation of the distance from achieving the testing goal (e.g., covering all statements or all branches in the program). Test cases with higher fitness values are more likely to be selected for evolution when a test suite is transformed into the next one. We propose to use a combination of mutation and evolutionary testing for the automated generation of the test cases executed by the tester agent in a

given multi-agent environment. Intuitively, we use the mutation adequacy score as a fitness measure to guide evolution, under the hypothesis that test suites that are better at killing mutants are also likely to be better at revealing real faults. The proposed technique consists of the following four steps: Step 0: Preparation, given the MAS under test M , we apply mutation operators to M to produce a set of mutants {M1 , M2 , . . . , Mn }. One or more mutations are applied to one or more (randomly chosen) agents in M . Step 1: Test execution and adequacy measurement, the Autonomous Tester Agent executes the test cases {T C1 , T C2 , . . . , T Cn } on all the mutants. Initially, test cases can be randomly generated or they can be those derived from goal analysis by the user. The Autonomous Tester Agent then computes i the adequacy of each test case (fitness value): F (T Ci ) = K N , where Ki is the number of mutants killed by T Ci . To increase performance, the executions of the test cases on the mutants are performed in parallel (e.g., on a cluster of computers, with one mutant per node). Step 2: Test case evolution, the procedure for generating new test cases is described as follows. 1: Select randomly whether to apply mutation or crossover 2: if Crossover is chosen then 3: Select 2 test cases (i, j) with probability F (T Ci ), F (T Cj ) 4: Apply crossover on T Ci and T Cj 5: else 6: Select a test case with probability of selection F (T Ci ) 7: Apply mutation 8: end if 9: Add the new test cases to the new set of test cases The basic mechanisms used to evolve a given test case are mutation and crossover. Mutation consists of a random change of the data used in the messages exchanged in a test case, similarly to the random generation described above. A good test case (according to the fitness value) is selected, one of its messages is chosen randomly and the content of the message is then changed randomly. Crossover consists of the combination of two test cases. Two good test cases are chosen. Some data in the second test case replace the data used in the first one, or an entire sequence of messages is taken from the second test case and is appended at the end of the first test case, possibly after truncating its message sequence at a randomly selected point. The encoding of test cases for the evolutionary algorithm is described as follows. Since each test case specifies a test scenario that contains a sequence of messages, test case T Ci is encoded as {M sg i 1 , M sg i 2 , . . . , M sg i ni } where ni is the number of messages specified in T Ci . Crossover and mutation are realized by operating on these messages. In particular, mutation requires a random modification of a message. This can be achieved by resorting to a database of messages, built by collecting all messages from initial test cases, and enriched with domain data, like in random testing. Moreover, the database is gradually enriched during testing with messages returned by the MAS under test and it mutants. Random

sampling of this database is used to produce test case mutants. The size and diversity of the messages in the database are crucial properties that determine the ability of mutated test cases to reveal faults. This is the reason why we continuously grow the database as testing proceeds, by capturing and storing the exchanged messages. Step 3: Decision 1: if Number of generation > Max number of generation then 2: DONE 3: else 4: Create new mutants for the next generation {substitute current set of mutants by a new one to increase the diversity of mutant (i.e. fault)} 5: Goto Step 1. 6: end if

The algorithm stops when the number of generation exceeds a given maximum number of generation. Otherwise, we go back to Step 1 and keep on testing continuously. When no improvement of the fitness values is observed for a number of evolutionary iterations, Step 0 (Preparation) is repeated and a new set of mutants is produced, so as to generate test cases that are assessed on a different set of artificial defects. In fact, the occurrence of no progress for some time may indicate that either the residual mutations are too hard (maybe impossible) to reveal or that all mutations are easily revealed by the current population of test cases. Hence, the time has come to change mutants. As with random generation, each time the Monitoring Agents observe a deviation from the expected behavior, the development team is informed through a bug report submission.

5

eCAT Framework

We propose an agent testing framework, called eCAT 2 that implements our method for automated continuous testing of MAS. The framework facilitates test suites derivation from goals analysis following the goal-oriented testing methodology, by semi-automatically generating test suites skeletons from goal analysis diagrams produced by TAOM4E3 , a tool that supports Tropos. The framework also provides GUIs to help human testers specify test inputs. More importantly, eCAT can evolve and generate more test inputs by evol-mutation or random testing techniques, described in the previous section, and run these test inputs continuously to test the MAS. Fig. 1 depicts a high level architecture of eCAT that consists of three main components: Test Suite Editor, allowing human testers to derive test suites from goal analysis diagrams; Autonomous Tester Agent, capable to automatically generate new test cases and to execute them on a MAS; and Monitoring Agents, that monitor communication among agents, including the Autonomous Tester Agent, 2

3

For more information, visit: http://sra.itc.it/people/cunduy/ecat http://sra.itc.it/tools/taom4e

eCAT Test suites editor Autonomous tester agent

Remote monitoring agent

Agent Z

Environment N

Central monitoring agent

Remote monitoring agent

Host N Agent A

Agent B

Environment 1

Host 1

MAS

Fig. 1. eCAT framework

and all events happening in the execution environments in order to trace and report errors. Remote monitoring agents are deployed with the environments of the MAS under test, transparently to the MAS, in order to avoid possible side effects. For instance, a MAS under test is composed of two geographicallydifferent environments: one is on a mobile phone while the other is on a Web server, two remote monitoring agents will be deployed on the two environments. All the remote monitoring agents are under the control of the Central monitoring agent, which is located at the same host as the Autonomous Tester Agent. The monitoring agents overhear agent interactions as well as events taking place in the environments, providing a global view of what is going on during testing and helping the Autonomous Tester Agent evaluate the mutants’ behaviors. The roles of the Monitoring Agents are three-folds: (i) Monitoring events (e.g., agent born, agent died, etc.) and interactions taking place inside the MAS and its environment. This results in execution traces. (ii) Guarding the MAS operations with respect to specified pre-/post-conditions; and, (iii) providing execution traces of the MAS under test to the Autonomous Tester Agent. The second role is important especially in testing “black-box” MAS, which do not expose observational interfaces to the outside world or provide interfaces only for perceiving environmental changes. Pre- and post-conditions can be very useful to judge the operation of a MAS and determine the test result correspondingly. Current version of eCAT is implemented as an Eclipse plug-in and supports the JADE [19] and JADEX [13] platforms. The Autonomous Tester Agent and the Monitoring Agents are implemented as a JADE agent and JADE tool-agent, respectively. Possible scenarios where eCAT can be exploited are as follows: One-to-One, one testing thread of the Autonomous Tester Agent is used to test one instance of the MAS under test; test cases are executed sequentially. This scenario is often applied when we run test cases derived from goals analysis

to explicitly evaluate the model, goal by goal. We call this kind of test execution goal-oriented test (the term is used in the experimental section). In addition, the random testing technique is currently implemented for this scenario. Many-to-Many, multiple testing threads of the Autonomous Tester Agent are used to test a corresponding number of instances/mutants of the MAS under test. Each thread can execute a different test case in order to take advantage of computing resources, or all threads can execute the same test case on all instances/mutants of the MAS to assess the different behaviors. These instances/mutants may have different initial knowledge bases, and this could lead to revealing different faults. The Many-to-Many scenario is used also in evolmutation testing. Many-to-One, multiple testing threads are used to test one single instance of the MAS under test. This scenario is particularly useful in stress testing, to measure performance as well as possible overloading faults.

6

Experimental Results

This section describes the experimental results obtained when we used eCAT to test BibFinder . First, we introduce BibFinder , the MAS under test, its features and architectural design. Then, the different testing techniques applied to BibFinder , the testing results, and our evaluation are presented. BibFinder is a MAS for the retrieval and exchange of bibliographic information in BibTeX format4 . BibFinder is capable of scanning the local drivers of the host machine, where it runs, to search for bibliographic data in the format of BibTeX. It consolidates databases spread over multiple devices into a unique one, where the queried item can be quickly searched. BibFinder can also exchange bibliographic information with other agents, in a peer-to-peer manner, thus augmenting its search capability with those provided by other peer agents. Moreover, BibFinder performs searches on and extracts BibTeX data from the Scientific Literature Digital Library5 , exploiting the Google search Web service6 . BibFinder is available for the JADE platform [19]. The next section discusses how BibFinder and its goals have been tested. 6.1

Testing BibFinder

We applied three testing techniques included in eCAT when testing BibFinder : (1) random testing, which mainly uncovered bugs that make BibFinder crash; (2) goal-oriented testing, aimed at verifying if the agents in BibFinder can fulfill their goals; and (3) evol-mutation testing, aimed at revealing more bugs thanks to the possibility of continuous execution. 4 5 6

http://www.ecst.csuchico.edu/~jacobsd/bib http://citeseer.ist.psu.edu http://code.google.com/apis/soapsearch

Goal-oriented testing Based on the relationships in BibFinder ’s architectural design, we derived 6 test suites to test the fulfillment of the associated goals. This derivation follows the goal-oriented software testing methodology discussed in [11, 10]. These test suites contain 12 test cases specifying 12 different test scenarios. Goal-oriented testing enhanced by coverage Given a test suite, such as the one derived through goal-oriented testing, statement coverage can be measured and used to make sure that all implemented code has been exercised by at least one test case (excluding any unreachable code). We enhanced goal-oriented testing by manually adding 3 new test cases, in order to reach 100% statement coverage of the main packages. In other words, we complement black-box testing with white-box testing: by analyzing the coverage rate by means of the tool GroboCodeCoverage7 , we figured out the uncovered code and added test cases able to increase the coverage level, up to 100% coverage. Random testing In order to apply the random test case generation technique during continuous testing, we pre-defined a library of interaction protocols and a repository of domain data. The interaction protocols include the five FIPA protocols Propose, Request, Request-When, Subscribe, and Query [5], and twenty-one (simple) protocols, which are created from twenty-one different FIPA message performatives, such as AGREE, REQUEST, etc. Domain data have been collected from the test suites derived from the goal model and have been manually augmented with additional possible input values. The Autonomous Tester Agent generates test cases by selecting domain data randomly and combining them with interaction protocols. The Autonomous Tester Agent continuously generates test cases and executes them against BibFinder . The Monitoring Agents are in charge of observing the whole system, i.e. BibFinder and the JADE platform. Based on the intercepted information, it can recognize the situations in which bugs are revealed (e.g. some agents crash). Evol-mutation testing The preparation step of evol-mutation testing consists of creating initial test cases, as initial individuals, and creating mutants of the original BibFinder system. The initial population contains the 12 test cases derived from goal-oriented testing technique. Since BibFinder agents are implemented in JADE, a pure Java platform, we can apply existing object-oriented mutation operators for Java on them in order to create mutants. It would be better to have agent-oriented specific mutation operators, but unfortunately, to the knowledge of the authors, no work has investigated this issue yet. We consider it as a future work. We adapted the tool MuClipse8 , built on top of µJava [21], to create mutants from the source code of three agents: BibFinderAgent, BibExchangerAgent, 7 8

http://groboutils.sourceforge.net/codecoverage http://muclipse.sourceforge.net

BibExtractorAgent. The source code of the supporting classes was left untouched. 24 class-level and 15 statement-level mutation operators [21] were applied on those agents. After combining the results, we obtained 178 mutants of BibFinder to be used in evol-mutation testing. 6.2

Results

G+

4

2

R R

1 0

G+

12

3

0

3

6

9

12

15

M 18

Time( y le)

Number of Bugs

Number of Bugs

We conducted testing experiments with the goal-oriented (G), coverage-enhanced goal-oriented (G+ ), and random (R) techniques on a computer equipped with 2G RAM, processor Core 2 Duo 1.86GHz (named Host in the following). The last technique, evol-mutation testing, was used with the original version of BibFinder running on the Host and 15 mutants running on 3 cluster machines (4GB RAM, 4 CPUs Xeon 3GHz). These experiments were repeated 10 times for each technique in order to measure the average time and the ability to discover faults. Each execution time is composed of a number of execution cycles, in which test cases are run on BibFinder and its mutants. Test cases executed in each cycle can be the same in the case of goal-oriented and coverage-enhanced goal-oriented techniques; they are different in the case of random testing; and in the case of evol-mutation testing, the test cases executed in a cycle are those from the previous cycle plus one or two new test cases generated by evolution. To assess the performance of eCAT we considered real bugs of BibFinder that were detected during its development and artificial faults inserted into the code according to the fault seeding method [8]. Details of faults found are presented in the Appendix.

9 6

R

3 0

0

3

(a) Real bugs

6

R R R 9

12

15

M 18

Time( y le)

(b) All bugs

Fig. 2. Bugs revealed by cycle

Looking at Fig. 2(a) we can notice that Random testing is quite effective in detecting fatal bugs. It actually revealed two real fatal bugs and one of them was not detected by any other technique. Goal-oriented testing revealed moderate bugs, showing that the implemented agents fail to fulfill their goals. These

moderate bugs were uncovered easily, right at the first cycle, because the agents of BibFinder are currently just reactive. Since proactive agents could behave differently at different cycles, more test cycles may be necessary to bring proactive agents to a state that reveals faults. However, more experiments are needed to prove this. Finally, looking at the results, we can see that evol-mutation testing reveals the bugs uncovered by goal-oriented testing (this is expected since evolmutation takes the test cases used by goal-oriented technique as initial inputs), but, more importantly, evol-mutation revealed also bug No. 7, which was not detected by any other technique. This bug was uncovered by mutating a message and enriching its content with data taken from the dynamically constructed database. To further evaluate the performance of eCAT , we used also the fault seeding method [8]. We involved 3 PhD students with a lot of skill and experience in MAS development and asked them to insert realistic bugs (i.e., bugs regarded as similar to real bugs as possible) into BibFinder . We obtained 15 copies of BibFinder , each containing one bug. First, we ran Coverage-enhanced goal-oriented and Random techniques to find bugs on these copies. Then, we ran Evol-mutation on the copies left, i.e. containing bugs that could not be found by the other two techniques. Because Evol-mutation uses test cases of Coverage-enhanced goaloriented technique as initial inputs, bugs found by the later is a subset of bugs found by the former, so we only need to run Evol-mutation to find the bugs that are left. Eventually, 11 of these bugs were uncovered by one or more of the testing techniques under log(MTBF) study. eCAT could not detect 4 bugs pertaining 2 b to BibExtractorAgent, even with Evol-mutation technique. These bugs were inserted into the 1 M crawling functionality of BibExtractorAgent, by which the agent is able to scan and monitor b 2 1 1 2 b changes in local directories, in order to search R log(time) b for BibTeX files. These directories can be conR 1 sidered as environment to BibFinder and those b bugs can be revealed only by changing this enviG+ 2 ronment. Hence, no available testing technique can reveal any of these 4 bugs by construction. This is one of the issues mentioned in Section 4.1. Fig. 3. Log-log plot of mean time between failures The summary of bugs found by each technique is shown in Fig. 2(b), where they are plotted against the testing cycles. The mean time between failures (MTBF) is depicted in the log-log plot in Fig. 3. We can observe that the mean time between two bugs found at the beginning of testing is very small. Then, it tends to increase, although not always monotonically. Since the number of remaining bugs decreases, it becomes harder and harder to reveal them. After the last bug found (around 1 hour from the beginning of testing) no more bug is revealed by eCAT . In a real development scenario, eCAT can be left running continuously, so as

to try to reveal also those bugs that are associated with a very long mean time between failures and are thus extremely hard (or impossible) to reveal in traditional testing sessions. Going back to Fig. 2(b), we can notice that goal-oriented and random testing are quite effective in the initial testing cycles, when bugs can be revealed by simple and short message sequences and the selection of the input data is not critical to expose them (i.e., there exist large equivalence classes of input data that can be used interchangeably to reveal a given fault). When remaining bugs become hard to find (last testing cycles) goal-oriented and random testing become ineffective and it is only through evol-mutation that additional faults can be revealed.

7

Conclusion

We introduced a novel approach for the continuous testing of multi-agent systems. The specific nature of software agents, involving autonomy, proactivity, learning-capability, etc., makes testing particularly hard and challenging. Continuous testing can be run unattended, for a long time (e.g., during the night), thus exercising the MAS under a huge number of different scenarios and conditions, which would be impossible to achieve by manual testing. A tester agent takes care of continuous, automated test case generation and monitoring agents report any discovered error through the bug tracking system. The results obtained from the case study indicate that continuous testing has a big potential to complement the manual testing activity. In fact, especially for faults involving long message sequences and specific input data, continuous testing seems particularly suited to explore those states that can potentially lead to them. Whenever high reliability (i.e., long mean time between failures) is the aim, evol-mutation can contribute to the discovery of the hard to reveal faults, which would go probably unnoticed under goal-oriented and random testing. In our future work, we will further investigate the pre- and post-conditions so that they can specify behavioral constraints of agents under test. This can potentially contribute to guide evol-mutation to reveal faults that violate specified conditions. In addition, we plan to extend our framework to deal with remaining MAS testing issues, such as “black-box” MAS and the environment factor, so that the Autonomous Tester Agent can detect faults related to specific environment configurations.

References 1. J. A. B. Blaya, J. M. Hernansaez, and A. F. G´ omez-Skarmeta. Towards and approach for debugging multi-agent systems through the analysis of agent messages. Comput. Syst. Sci. Eng., 20(4), 2005. 2. P. Bresciani, P. Giorgini, F. Giunchiglia, J. Mylopoulos, and A. Perini. Tropos: An Agent-Oriented Software Development Methodology. Autonomous Agents and Multi-Agent Systems, 8(3):203–236, July 2004. 3. R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on test data selection: Help for the practicing programmer. IEEE Computer, 11(4):34–41, 1978.

4. O. Dikenelli, R. C. Erdur, and O. Gumus. Seagent: a platform for developing semantic web based multi agent systems. In AAMAS ’05: Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 1271–1272, New York, NY, USA, 2005. ACM Press. 5. F. for Intelligent Physical Agents. Fipa specifications. http://www.fipa.org/specifications. 6. E. Gamma and K. Beck. Junit: A regression testing framework. http://www.junit.org, 2000. 7. R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering, 3(4):279–290, 1977. 8. M. J. Harrold, A. J. Offutt, and K. Tewary. An approach to fault modeling and fault seeding using the program dependence graph. Journal of Systems and Software, 36(3):273–295, 1997. 9. H. D. Mills, M. D. Dyer, and R. C. Linger. Cleanroom software engineering. IEEE Software, 4(5):19–25, September 1987. 10. C. D. Nguyen, A. Perini, and P. Tonella. A goal-oriented software testing methodology. Technical report, FBK-irst, 2006 (http://sra.itc.it/images/sepapers/ gost-techreport.pdf). 11. C. D. Nguyen, A. Perini, and P. Tonella. A goal-oriented software testing methodology. In 8th International Workshop on Agent-Oriented Software Engineering, AAMAS, May 2007. 12. R. Pargas, M. J. Harrold, and R. Peck. Test-data generation using genetic algorithms. Journal of Software Testing, Verifications, and Reliability, 9:263–282, September 1999. 13. A. Pokahr, L. Braubach, and W. Lamersdorf. Jadex: A BDI Reasoning Engine, chapter Multi-Agent Programming. Kluwer Book, 2005. 14. A. S. Roberta Coelho, Uira Kulesza and C. Lucena. Unit testing in multi-agent systems using mock agents and aspects. International Workshop on Software Engineering for Large-scale Multi-Agent Systems, May 2006. 15. C. Rouff. A test agent for testing agents and their communities. IEEE, 2002. 16. D. Saff and M. D. Ernst. Reducing wasted development time via continuous testing. In Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE03), 2003. 17. D. Saff and M. D. Ernst. An experimental evaluation of continuous testing during development. In Proceedings of the 2004 International Symposium on Software Testing and Analysis, pages 76–85, July 12-14 2004. 18. P. Th´evenod-Fosse and H. Waeselynck. Statemate: Applied to statistical software testing. In Proc. of the Int. Symposium on Software Testing and Analysis (ISSTA), pages 78–81, June 1993. 19. TILAB. Java agent development framework. http://jade.tilab.com/. 20. A. M. Tiryaki, S. Oztuna, O. Dikenelli, and R. Erdur. Sunit: A unit testing framework for test driven development of multi-agent systems. In 7th International Workshop on Agent Oriented Software Engineering, 2006. 21. J. O. Yu-Seung Ma and Y. R. Kwon. Mujava : An automated class mutation system. Software Testing, Verification and Reliability, 15(2):97–133, June 2005.

Appendix BibFinder Description Fig. 4 shows how Tropos constructs are used to model the requirements of a MAS that supports users (such as researchers) during bibliographic research. Both the user and the system are represented as actors (circles), user needs are represented in terms of goal dependencies from the actor Researcher to the system actor BibFinder, e.g. by the hard goal Find-BibTeX and the softgoal Fast-and-efficient.

Find BibTeX

Researcher

Fast and efficient

BibFinder Managing local database Find BibTeX Fast and efficient

+

+

Extracting BibTeX Exchanging BibTeX

Legend Dependency Plan

Hard goal

Soft goal

+ Means-end Hard goal 1

Hard goal 2

H Goal

Actor

Contribution Actor

OR decomposition

Fig. 4. Late requirement analysis - Tropos notation.

Fig. 5 depicts the architectural design of BibFinder in Tropos. The system contains three agents: BibFinderAgent, BibExchangerAgent, and BibExtractorAgent. Roles of each agent are briefly described as follows: BibFinderAgent maintains the local BibTeX database and coordinates the operation of the system as a whole; BibExchangerAgent is in charge of querying the local database and exchanging data with external agents (e.g. with other instances of BibFinder ); BibExtractorAgent crawls on local storage devices looking for BibTeX files, and performs searches on and extracts BibTeX items from the Internet. Each agent in BibFinder is responsible for some goals and depends on the other agents for fulfilling some other goals. Inside each agent, a given goal can be decomposed into sub-goals, resulting in a tree of goals, in which each leaf goal has a specific plan as means to achieve the goal. For instance, BibFinderAgent

Fig. 5. Architectural design of BibFinder in TAOM4E

has two root goals Managing-local-database and Handling-requests; the former is decomposed into Updating-database and Deleting-BibTeX-item goal. The plan Update-BibTeX, for adding new items or updating existing items in the database, acts as means to achieve the goal Updating-database. When serving external requests, BibFinderAgent depends on BibExtractorAgent for seeking URLs and on BibExchangerAgent for querying the local database. Similarly, BibExtractorAgent and BibExchangerAgent have also goal decomposition and plans specified to fulfill their goals. Detailed Test Results Faults of BibFinder detected by eCAT are presented in Table 1. Fault No. 1 says that the BibFinderAgent died when it was asked to parse a BibTeX; fault No. 2 says that JADE does not support creating a new thread within BibExtractorAgent; fault No. 3 shows that the BibFinderAgent fails to forward messages to the BibExchangerAgent when those messages come from a different JADE platform; etc. Faults are classified by severity level (i.e. Fatal faults make agents die, Moderate faults are associated with discrepancies between implementation and specification). In the Cycle / generation column, we can find the average cycle (generation in the case of evol-mutation technique) when bugs were uncovered. One cycle of random testing costs less time than one of goal-oriented testing and evol-mutation.

Table 1. Results of continuous testing on BibFinder NoBug

Bug type

Cycle / gen- Technique eration

Fatal Moderate Moderate Moderate Moderate Fatal

14 1 1 1 1 G:1, M:1 18

Real bugs 1 2 3 4 5 6

BibTeX parsing Using thread in BibExtractorAgent Forward message error No reply to incorrect requests Lack a required data field Update wrong BibTeX

7

Add new wrong BibTeX

R G, G+ , G, G+ , G+ , M G, G+ , R:15, G, G+ ,

M M M M, R

Fatal M Artificial bugs 8 Index out of bound in BibExtractorAgent Fatal G+ :1, R:9 G+ , R 9 Always reply null Moderate 1 G+ 10 No answer to any request Moderate 1 G+ + 11 Index out of bound in BibExchangerAgent Fatal G :1, R:9 G+ , R 12 Return incorrect BibTeX Moderate 1 G+ 13 Null exception to an array Fatal 1 G+ 14 Reply wrong performative Moderate 1 G+ 15 Handle invalid request error Moderate 1 G+ 16 Infinite loop Fatal 16 R 17 Null reference from BibFinderAgent to BibExtractorA- Fatal 1 G+ gent 18 Null reference from BibFinderAgent to BibExchangerA- Fatal 1 G+ gent R: Random (0.1 minutes / cycle) , G: Goal-oriented (0.13 minutes / cycle with 12 test cases), G+ : Coverage-enhanced Goal-oriented (0.16 minutes / cycle with 15 test cases), M: evol-Mutation (3.9 minutes / cycle with 15 initial test cases)