An Automated Distributed Infrastructure for ... - Semantic Scholar

3 downloads 14315 Views 157KB Size Report
mation sources for failure data are presented, and preliminary ... ures (TBF) and the Time To Recover (TTR). ..... Chipset Based Bluetooth USB dongles and with.
An Automated Distributed Infrastructure for Collecting Bluetooth Field Failure Data Marcello Cinque1 , Fabio Cornevilli1,2, Domenico Cotroneo1 , Stefano Russo1,2 (1) Dipartimento di Informatica e Sistemistica Universit`a degli Studi di Napoli Federico II Via Claudio 21, 80125 - Naples, Italy (2) Laboratorio ITEM- Consorzio Interuniversitario Nazionale per l’Informatica Via Diocleziano 328 - 80124 Naples, Italy {macinque, cotroneo, sterusso}@unina.it - [email protected]

Abstract

and failures. As stated in [3], “there is no better way to understand dependability characteristics of computer systems than by direct measurements and analysis”. This paper represents a first step toward a failure data analysis on Bluetooth systems. The utilization of Bluetooth as a “last meter” access network, as Gerla et. al. also proposed [4], represents an opportunistic and cost-effective way to improve the connection availability of the wide-spread existing 802.11 networks. Indeed, many portable devices are now equipped with both Bluetooth and 802.11 wireless interfaces. Moreover, as the number of Access Points (APs) increases, Bluetooth has a better behavior than 802.11 in terms of bandwidth, delay, fairness and energy efficiency, as previous results already outlined [5]. In order to conduct a measurement-based analysis, two fundamental issues must be addressed:

The widespread use of mobile and wireless computing platforms is leading to a growing interest on dependability issues. Several research studies have been conducted on dependability of mobile environments, but none of them attempted to identify system bottlenecks and to quantify dependability measures. This paper proposes a distributed automated infrastructure for monitoring and collecting spontaneous failures of the Bluetooth infrastructure, which is nowadays more and more recognized as an enabler for mobile systems. Information sources for failure data are presented, and preliminary experimental results are discussed.

1 Introduction

1. realizing a non-invasive distributed and automated infrastructure for monitoring and collecting errors and failures under diverse workload profiles;

Recent advantages in mobile computing hardware, such as laptop computers and handhelds devices, as well as wireless networking (UMTS, Bluetooth, and 802.11), deliver more and more complex mobile computing platforms, which today encompass a variety of systems, each one characterized by specific kinds of mobile terminals and communication protocols. The wide spread use of these mobile computing platforms is leading to a growing interest of dependability issues. Examples are health care [1] and aircraft maintenance systems [2]. Although several research studies have been conducted on dependability of mobile environments, as discussed in section 2, none of them attempted to identify system bottlenecks and to quantify dependability measures by providing information from field data and by classifying errors

2. quantifying dependability characteristics by extracting only significant data from the measured ones. To this aim, filtering and coalescing algorithms have to be applied. The paper presents a distributed automated infrastructure for monitoring and collecting spontaneous failures of Bluetooth piconets. Failure data are gathered from several information sources, such as Bluetooth users, system log files and emulative applications. Gathering both user level and system level failure data, helps to understand how user observed failures are manifestations of system fail1

ures. Indeed system failures can be interpreted as errors for user level failures. Emulative applications emulate the behavior of Bluetooth users in a random manner, using different workload profiles. The reason for emulation is three fold: i) it increases the volume of data being collected; ii) emulated users continuously use Bluetooth appliances, allowing the time between failures to be measured, and iii) emulated users are more expert and credible than real ones. In order to understand whether emulation represents an effective way to assess Bluetooth dependability, the results from emulation can be compared with the ones obtained from real users and system logs. Our automated infrastructure has been implemented for Linux systems, with the BlueZ (the official Linux Bluetooth software stack), and has been being applied over a Bluetooth-based testbed made of COTS (Commercial Of The Shelf) software and hardware components. The testbed includes both emulative devices and usable devices, used by real users, i.e. our undergraduate and graduate students. The paper also presents preliminary experimental results, in terms of failure classification and stochastic distributions for the Time Between Failures (TBF) and the Time To Recover (TTR). These parameters “provide an overall picture of system and help to identify dependability bottlenecks”, as stated in [3]. The rest of paper is organized as follows. Section 2 introduces some relevant concepts of Bluetooth networks and discusses previous relevant work on this field. Section 3 details our collection strategy. Section 4 describes our distributed automated architecture for collecting and analyzing field data. Section 5 reports the results we obtained so far, while section 6 concludes the paper.

2 Background and Related Work 2.1 Bluetooth Bluetooth [6] is a short-range wireless technology operating in the 2.4 GHz ISM band. Many devices such as notebook computers, phones, PDAs, Home Electric Appliances, and other computing devices incorporate Bluetooth wireless technology. The Bluetooth system provides both point-to-point and point-to-multipoint wireless connections. Two or more units sharing the same channel form a piconet. One Bluetooth unit acts as the master of the piconet, whereas the other unit(s) acts as slave(s). Up to seven slaves can be active in the piconet. Multiple piconets with overlapping coverage areas form a scatternet.

Different applications may run on Bluetoothenabled devices using different network and transport protocols, depending on their needs. Nevertheless, these protocols use a common set of datalink protocols, the Bluetooth core protocols, described in the following: Baseband: this layer enables the physical RF link between Bluetooth units forming a piconet. It provides two different kinds of physical links, Synchronous Connection-Oriented (SCO) and Asynchronous Connectionless (ACL). ACL packets are used for data only, whereas SCO packets are used for audio as well. All packets can be provided with different levels of Forward Error Correction (FEC) or Cyclic Redundant Code (CRC). Integrity checks and retransmissions are performed, providing a reliable data-link wireless connection. Link Manager Protocol (LMP): The LMP is responsible for link set-up (connection establishment) between Bluetooth devices. This includes security aspects like authentication and encryption. It also provides Bluetooth devices with the inquiry/scan procedure. Logical Link Control and Adaptation Protocol (L2CAP): This layer provides connection-oriented and connectionless data services to upper layers with protocol multiplexing capability, segmentation and reassembly operation, and group abstractions. It operates on ACL Baseband links only. Service Discovery Protocol (SDP): Discovery services are crucial part of the Bluetooth framework. Using SDP, device information, services, and characteristics of services can be retrieved. The Bluetooth specification also defines a Host Controller Interface (HCI), which provides a API to the baseband controller, link manager, and access to hardware status and control registers. In this paper, focus is on the use of IP over Bluetooth, since it aims to be a last meter access network for the wireless Internet. The Bluetooth Special Interest Group defined the Personal Area Networking (PAN) Profile, that provides support for common networking protocols such as IPv4 and IPv6. The PAN profile exploits the Bluetooth Network Encapsulation Protocol (BNEP) to encapsulate IP packets in L2CAP packets and to provide the Ethernet abstraction.

2.2 Related Research A great deal of research efforts have been profused on the field of measurement-based analysis of error-data collected from a real system. An excellent survey of measurement-based analysis can be found in [3]. In particular, our focus is on data collection and processing, which consists of

extracting information from field data, and classifying errors and failures. A plenty of studies have been focused on this topic, proposing techniques and methodologies for studying error logs collected from a variety of distributed systems. Examples are studies of Network of Workstations [7], Windows NT operating systems [8], and, more recently, Large-Scale Heterogeneous Server Environments [9]. On the other hand, there are a growing number of works which have been studying dependability issues of wireless and mobile infrastructures. A redundant-based technique has been proposed and evaluated in [10] for improving the dependability of 802.11 networks. A similar work has been done in [11], where focus is on AP Failures. An interesting study on the reliability of ad-hoc wireless network is presented in [12]. In [13] a dependability analysis of GPRS network is presented, based on modeling approaches. Finally, in [14], data collection and processing for a wireless telecommunication system have been addressed. The field failure data are collected from the core entities (base stations) of a cellular telephone system, and an analysis of failure and recovery rates is discussed.

files: failure data concerning protocols involved in the Bluetooth protocol stack (L2CAP, LMP, HCI, BNEP, SDP), that are stored in Bluetooth devices’ system log files, are collected. We call this kind of data system level data. These data are collected by our particular daemon application, the LogAnalyzer (see section 4.1), ii) Users: a user is whoever Bluetooth user who is in charge of sending a failure report We call this kind of data user level data. Since user level data might be unreliable (users are not expert and may forget to send a failure report), they are collected only as a support to interpret system level data, iii) Emulative applications: user level and system level information are produced by using our emulation software, the BlueTest (see section 4.2), that is, an application emulating the workload that a real Bluetooth user may produce, and storing information about the failures that the user would have seen. Hence it produces user level data. Furthermore, it uses the Bluetooth protocol stack, thus producing system level data.

3 Data Collection Methodology

Three collection processes are defined according to the three considered kinds of sources. System level data are gathered from the system logs. In the Linux OS, the system logs are managed by the syslogd daemon. It can be configured so as to specify log files locations and severity level, and applications can log information with respect to a specific severity level. Diverse severity levels are defined. For example, the severity level err means that the logged information is relative to failure detection, alert, critical conditions and emergency, whereas a file with severity level info also contains warnings and general information about applications behavior. In particular, we configured the syslogd so as to collect err data in a file (called, ErrLog), whereas info data are collected in the common messages log file. Data from the ErrLog and messages files are then gathered by our daemon, the LogAnalyzer. User level data are manually sent by users. The process is straightforward and involves the user authentication to the data repository and the compilation of a form by using a web application. Users have to specify the time of the failure, the observed error message/s and behavior, the performed recovery action, and the time of the recovery. Emulative applications data gathering methodology is similar to the one used for system logs, except from the fact that data are collected also from the file produced by emulative applications.

3.1 Assumptions The proposed field data collection strategy collects failure data from both Bluetooth users (human expressed) and system log file, in order to understand how system related failures cause userobservable failures to manifest. The collected field failure data are relative to spontaneous failures, that is, failures manifested during the normal system operation, without forcing stressful conditions or failure prone behaviors. This helps in understanding and modeling the normal faulty behavior of a common Bluetooth application. The collected data are relative to failures activated in the communication drivers (LMP, L2CAP), in the SDP, in the API (HCI), and in the IP emulation layer (PAN profile/BNEP protocol). Hence, we are not concerned in failures in firmware and hardware/communication layers. Finally, we assume the use of IP communication over Bluetooth (via the PAN profile). The focus is thus on piconets, since the PAN profile works only on piconets.

3.2 Field Data Sources The field data have been being collected using three kinds of information sources: i) System Log

3.3 Data Collection Process

Bluetooth interface

Bluetooth interface

LogAnalyzer

wired net interface

wired net interface

LogAnalyzer

Bluetooth interface

Internet Internet

wired net interface wired net interface

LogAnalyzer

User-log (web) Server Collection log Server

BN

UWN

wired net interface

D B

DB Server

(web) Log Client

tion from both the TestLog and system log files. Finally, the Log Server Node (LSN) is the central repository of the collected data.

4.1 LogAnalyzer

SSN

CSN

Syst em Log

BlueTest Server

Syst em Log

Syst em Log

Test Log

BlueTest Client

LSN

Figure 1. Overall architecture

The LogAnalyzer activity encompasses four phases: i) extract data from logs, ii) filter the data, iii) coalesce them, and iv) send them to the Collection Log Server, deployed in the LSN. It performs its operations each hour by using the crond task of the operating system. It should be noted that, for devices with poor resources (e.g. PDA, smart phones) steps ii and iii can also be performed by the Collection Log Server. 4.1.1 Source Log Files and Data Extraction

4 Collection System Architecture Figure 1 shows the overall high-level data collection system architecture. Both data producer nodes and a consumer node are shown. A Bluetooth Node (BN) is whatever appliance (laptop, PC, smart phone, PDA, tablet PC) equipped with Bluetooth facilities and running a LogAnalyzer that gathers data from log files, and sends them to the Log Server Node (LSN). Since the LogAnalyzer gathers data from log files, it does not interfere with others applications running on the device. Furthermore, it sends data via the wired channel, in order to be non-invasive on the wireless channel. A User Workstation Node (UWN), is whatever appliance equipped with a web browser and an available access to the Internet. Trough a UWN, an user can contact the Log Server and can add a data item describing an occurred Bluetooth failure. A Client Simulation Node (CSN) is a device with a Bluetooth interface, a BlueTest client application, and a LogAnalyzer running on it. The BlueTest client application is in charge of emulating the use of the Bluetooth channel established with the Server Simulation Node (SSN). The channel is used according to a random behavior, allowing more workload profiles to be executed at the same time by more CSNs (see section 4.2.2). The SSN is implemented to run itself as a Bluetooth AP and it runs a BlueTest server application, that accepts connections from clients. The set of all CSNs and the SSN forms a Bluetooth piconet, where the master is the SSN. The BlueTest client saves information about Bluetooth failures and relative recoveries in a private log file (the TestLog). The LogAnalyzer running on CSNs, is responsible of extracting informa-

The data collection system collects data from several sources using several LogAnalyzers. Each daemon operates on a different log file, that can be a system log file or a TestLog file. Since the LogAnalyzer gathers data from ErrLog, messages, and Testlog files, it operates on different kinds of log files. Furthermore, system log files may contain different entry formats from a machine to another. Nevertheless, the LogAnalyzer can be easily configured by describing the entry format in a configuration file. Furthermore, the proposed filtering and coalescence solutions are general enough to be adopted in spite of log differences, as explained in following sections. As far as data extraction is concerned, it is necessary to read entries from the log file without interfering with other applications that are writing on it. For this reason, the Linux logrotate utility is used. When applied on a log file x, the logrotate renames x as x.1, creates a new log file named as x, moves the contents of x.1 in x.2, overwrites x.3 with x.2, and so forth. Hence, for instance, applications continue to use the ErrLog file, whereas the LogAnalyzer opens and reads the ErrLog.1 file. 4.1.2 Filtering Filtering allows to reduce the volume of data to be stored on the LSN database. Three filtering strategies are adopted: blacklist, whitelist and Start/Stop (S/S). The blacklist is a list of all the words that have to be filtered, that is, a log entry containing whatever word belonging to the blacklist is rejected. On the contrary, the whitelist is the list of the allowed words, that is only log entries containing words belonging to the whitelist are not re-

jected. Obviously, the two proposed strategies are mutually exclusive. Finally, the S/S strategy collapses in two log entries all those entries related to the same event, for instance, entries related to the system bootstrap procedure. The S/S task is performed by the LogAnalyzer trough a pattern matching algorithm. Log entries are collapsed according to S/S patterns, that are described by two sequences of entries: the start and the stop sequences. The start sequence is a sequence of entries that unambiguously identifies the pattern beginning. Analogously the stop sequence unambiguously identifies the pattern end. The blacklist, the whitelist and the S/S patterns are stored in a configuration file so as to manually configure and tailor the filtering strategies. Hence, different filtering strategies can be applied to different log files belonging to different nodes. For example, the Info level log file (messages) of a user’s BN can be filtered whit a whitelist (including only the entries related to Bluetooth), whereas the ErrLog of a user’s BN can be filtered whit a blacklist (excluding the information we are not interested in). 4.1.3 Coalescence The adopted coalescence algorithm is a temporalbased scheme, also called tupling [15]. It groups multiple log entries on a temporal basis in a unique logical entity, called tuple. Each tuple contains a set of log entries which time-stamps belongs to a same temporal window. Those entries are hopefully related to the same failure event. Indeed, when an error is activated, multiple log entries are written as a burst on the log file, within the same time interval. This kind of analysis is particularly useful to analyze ErrLog files, in order to define failure classes by observing tuples, instead of single entries. To be more precise, let us to represent whit Xi the i-th entry in the log, and with t(Xi ) the time-stamp of the entry Xi . The tupling algorithm respects the following rule: IF t(Xi+1 ) − t(Xi ) < W THEN add Xi+1 to the tuple ELSE create a new tuple

where W is a configurable time window. A sensitivity analysis has to be conducted in order to choose the size of W that minimizes truncations (entries related to the same error event grouped in more than one tuple) and collapses (entries related to different error events grouped in the same tuple).

4.2 BlueTest The BlueTest (Bluetooth Testbed) is the testbed, along with its applications, that we use

BlueTest Client

BlueTest Client

BlueTest Client

BlueTest Client

BlueTest Server (AP)

Fixed Network Bluetooth links

Log Server Node

Figure 2. BlueTest architecture

to emulate Bluetooth users and to gather more user-level and system-level data. Figure 2 shows the testbed architecture. Basically, it is a Bluetooth piconet with a master, running a BlueTest Server application, and n slaves (up to 7) running BlueTest client applications. The master also represents the AP of the wireless network. The testbed is heterogeneous in terms of devices. It is composed of PCs and laptops equipped with CSR Chipset Based Bluetooth USB dongles and with Mandrake Linux 10 or Debian based distributions, and Compaq Ipaq PDAs with the Linux Familar 0.7.0 distribution. Devices’ clocks are all synchronized by using a Network Time Service. In order to emulate different distances among devices, their antennas (and the PDAs themselves) are periodically moved around the laboratory. Distances from half a meter to 10 meters far from the AP are used. 4.2.1 Workload Description The BlueTest client aims to emulate Bluetooth user operations, with the assumption of using the PAN profile for IP applications. These operations are: i) Bluetooth devices scanning, ii) SDP search for Network AP (NAP), iii) AP connection, iv) role switching from master to slave, v) creation of IP abstraction via BNEP, vi) sending and receiving IP packets, and vii) disconnection of the device from the AP. Failure messages, their time-stamps and the distance between the device and the AP at the moment of the failure, are all logged on the TestLog file. When a failure occurs, it is automatically recovered in order to let the testbed working without human intervention, and to determine the coverage of recovery actions. The recovery actions taken into consideration are those that can be performed by a real user: R1) Bluetooth connection destruction/creation; R2) Bluetooth stack reset; R3) Application restart; and R4) System reboot. If a recovery action recovers from the failure, an entry is logged onto the TestLog file, containing the recovery action name and the associated time-stamp.

Figure 3 shows the workload statechart diagram. As the statechart points out, if no errors occur, the BlueTest client periodically evolves from the state scanning to the state wait. The scanning state is optional, whereas different kinds of use can be defined, as deeply explained in next section. If failures were activated, the client would transit in the state recovery in which recovery actions are undertaken. Different recovery attempts are tried, with respect to the warning level (WL) variable status. The WL is defined as the number of consecutive transitions from the states scanning, connection, use or disconnection to the recovery state, that is, if a cycle from scanning to wait does not produce errors, WL is equal to 0, otherwise, each time the cycle is interrupted, WL is incremented. A WL value equal to n causes the Rn recovery action to be performed. 4.2.2 Workload Profiles The BlueTest workload can be configured according to several parameters, so as to emulate different channel utilization behaviors. The set of parameters characterizing the workload is defined as a workload profile. A different workload profile is randomly chosen each emulation cycle, by randomly select the values for its parameters. This allows us to emulate a different behavior each emulation cycle. The parameters defining a workload profile are listed in the following: S - Scan flag: the value of the flag causes the scanning step to be executed or not. Indeed, the scanning operation is not always performed. The value of the flag is randomly chosen according to a Bernoulli distribution, which parameter is the scanning probability. B - Baseband ACL packets’ type: since different applications can choose different ACL packets, on the basis of their requirements, the packet type is also randomly chosen according to a binomial distribution. N - Number of IP packets to be sent/received: this RECOVERY

reboot [ if WL=max ]

do/ R[W L] recovery do/ WL++ exception

exception

exception exception

SCANNING

CONNECTION

USE

event scan/ scanning event scan/ SDP search

do/ L2CAP connect do/ PAN connect

entry/ switch role do/ send/receive

DISCONNECTION do/ disconnect

WAIT exit/ WL = 0

Figure 3. Workload statechart diagram

parameter, along with the following two, models the volume of data being sent each emulation cycle. Its value is chosen by using an uniform distribution between a minimum and a maximum number of packets. LS - Average length of sent packets: the length of each packet being sent is chosen according a uniform distribution, which expected value is randomly determined each cycle by using another uniform distribution. LR - Average length of received packets: this parameter has the same meaning of the previous one, other than it is related to received packet size. All the above mentioned parameters are determined by BlueTest clients, whereas LR is determined by the BlueTest server. It should be noted that setting different values for these parameters means modeling different application behaviors. For instance, if (S, B, N , LS , LR ) = (1, DH5, 100, 10, 1000), the BlueTest models an application that does scanning, uses DH5 ACL packets, sends/receives few packets, and which received packets are bigger on average than sent packets. Hence, the resulting workload profile may model a short web browsing session. Another example may be (S, B, N , LS , LR ) = (0, DM1, 10000, 100, 5) that may model the upload of a big file, without scanning.

5 Preliminary Experimental Results The results discussed in this section are most user-level data from emulation, over a period of six months of observation. Only one CSN and one SSN have been used so far, since this data come from a first testing period of our infrastructure on Linux based systems, with the BlueZ software stack. The claims we make are thus not to be considered definitive. Nevertheless, the results we have so far can help us to improve our work. For example, it resulted that the collected system-level data were neither enough nor significant, since the BlueZ does not write many useful failure-related information on system log files. Hence, one of our next steps will be to modify the BlueZ source code so as to be more verbose on the ErrLog. The collected data also demonstrate the effectiveness of our approach for field failure data collection for Bluetooth systems. Indeed, by analyzing them, we can provide an overall picture of the dependability of such systems. For example, table 1 contains the MTBF, MTTR and availability, with respect to all the user-level failures that have been recognized. The availability is evalu−MT T R ated as: Av = 100 · MT BF . The obtained MT BF

MTBF (sec)

MTTR (sec)

availability

13102.00

50.807

99.6%

Table 1. MTBF, MTTR and availability #failures

%

MTBF (sec)

Recovery

Bind Failed

365

67.34

15707

R1

Unable to create L2CAP connection

84

15.50

44444

R4

Receive Timeout Expired

36

6.64

742029

R1

Unable to create PAN connection

35

6.46

148689

R4

PAN create connection timeout

14

2.58

16112

R4

Switch role command failed

7

1.29

976608

R4

Cannot send datagram

1

0.18

-

R1

Error Message

a)

Table 2. User-level failure classification, occurrence, MTBF, and Recovery b)

MTBF shows how a single Bluetooth device may often fail, with respect to Bluetooth failures. Indeed a failure occurs about each 3 hours and thirty minutes. Nonetheless, the device exhibits an high availability, due to the fact that automatic recovery actions are performed by the testbed workload. This result shows that, even if Bluetooth connections are failure prone, an highly available behavior may be obtained by using automated recoveries. A more detailed analysis is provided in table 2, in which a classification of failure modes is proposed. For each pinpointed class, the table shows the number of failures, their percentage, the MTBF, and the adopted recovery action. Hence, for each failure class, we identify the recovery action that has to be used to effectively recover from the failure. Our data also help to identify dependability bottlenecks. Figure 4 shows that most of the failures are manifestations of BNEP errors. This is due to the fact that the creation process of a virtual IP interface via BNEP takes a time that can be sometime longer than the expected one. Hence, when an application (the BlueTest, in this case) tries to use such an interface (e.g. binding), a system-level failure cannot locate module bnep0 is observed on system logs, where bnep0 is the virtual IP interface that is supposed to exist. It is thus clear why the “bind failed” failure in table 2 is so fre-

Figure 4. System-level errors breakup

Figure 5. Bind failure TBF: a) lognormal frequency distribution b) lognormal percentile diagram

Result

p-value

Kolmogorov-Smirnov

Test

D = 0.101

> 0.25

Cramer-von Mises

W2 = 0.583

> 0.25

Anderson-Darling

A2 = 3.585

> 0.25

Table 3. Goodness-of-fit tests quent. The interesting result is that the failure can be reproduced by using the command line interface provided by BlueZ for the PAN profile. This shows that the BlueTest is an effective way to discover dependability bottlenecks, as the BNEP interface creation process. Besides, this also represents an example of usefulness of system-level data. Further insights can be obtained by analyzing system logs produced by our modified, verbose BlueZ stack. Another valuable result of our work is the opportunity to model the failures’ temporal behavior. We tried to model the TBF of the “bind failed” failure, since is the only one with enough occurrences, so far. Figure 5a shows the TBF histogram and frequency distribution we obtained fitting the real data with a lognormal distribution. Figure 5b depicts the percentile diagram of the data related to lognormal percentiles, along with lognormal parameters. Since real data almost lie on the lognormal percentiles line, we can claim the “bind failed” to exhibit a lognormal behavior. The claim is also supported by the table 3, where the results of three different goodness-of-fit tests are reported. Indeed

Recovery Action

#recovery

%

MTTR (sec)

Connection destruction/creation (R1)

402

74.2

11.7

System Reboot (R4)

140

25.8

163.1

Table 4. MTTR and occurrence of effective Recovery Actions

the significance level of the tests (p-values) is large enough to confirm the lognormal distribution to be a good model for the “bind failed” TBF behavior. The same kind of analysis can be done for other failures types. Finally, table 4 summarizes the MTTR and occurrence for recovery actions being used. An interesting result is that only R1 and R4 recovery actions are effective. Since the R1 action recovers from the most frequent failures, it is the most frequent performed recovery and it contributes to drastically reduce the overall MTTR, thus increasing the availability level.

References [1] J. E. Bardram. Applications of context-aware computing in hospital work-examples and design principles. Proc. of the 19th ACM Symposium on Applied Computing (SAC 2004), March 2004. [2] M. Lampe, M. Strassner, and E. Fleisch. A Ubiquitous Computing Environment for Aircraft Maintenance. Proc. of the 19th ACM Symposium on Applied Computing (SAC 2004), March 2004. [3] R. K. Iyer, Z. Kalbarczyk, and M. Kalyanakrishnam. Measurement-based analysis of networked system availability. Performance Evaluation Origins and Directions, Ed. G. Haring, Ch. Lindemann, M. Reiser, Lecture Notes in Computer Science, Springer Verlag, 1999. [4] M. Gerla, P. Johanssona, R. Kapoor, and F. Vatalaro. Bluetooth: ”last meter” technology for nomadic wireless internetting. Proc. of 12 th Tyrhennian Int. Workshop on Digital Communications, 2000. [5] P. Johansson, R. Kapoor, M. Kazantzidis, and M. Gerla. Personal Area Networks: Bluetooth or IEEE 802.11? International Journal of Wireless Information Networks Special Issue on Mobile Ad Hoc Networks, April 2002. [6] Bluetooth SIG. Specification of the Bluetooth System core and profiles v. 1.1, 2001. [7] A Thakur and R. K. Iyer. Analyze-now - an environment for collection and analysis of failures in a networked of workstations. IEEE Transactions on Reliability, 45(4):560–570, 1996.

6 Conclusions and Future Work

[8] J. Xu, Z. Kalbarczyc, and R. K. Iyer. Networked Windows NT System Field Data Analysis. proc.of IEEE Pacific Rim International Symposium on Dependable Computing, December 1999.

This work proposed an automated distributed architecture to collect Bluetooth Field Failure Data. Both the data collection methodology and details about the software architecture have been discussed. The proposed infrastructure has been implemented for Linux OS. Thus, obtained results are representative for Linux based Bluetooth systems, and, although still preliminary, they provided an overall picture of BlueZ Bluetooth applications faulty behavior, along with failure classification, modeling, and dependability bottlenecks identification. We believe that the proposed infrastructure can also be used to collect data from other wireless communication systems, such as IEEE 802.11, IrDA, and RFID. Future work will be devoted to the extension of the testbed to other piconets, a sensitivity analysis of our coalescence technique (in order to choose an optimal window size), and to let the data be representative of a wide spectrum of systems. To this purpose, we are extending the infrastructure on different operating systems, such as MS Windows and some mobile phone OS, as the Symbian OS. Furthermore, we are going to use other implementations of the Bluetooth software stack, such as the Linux OpenBT stack from Axis, and the Widcomm Bluetooth stack, for MS Windows.

[9] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. proc. of the 2004 International Conference on Dependable Systems and Networks (DSN’04), June 2004. [10] D. Chen, S. Garg, C. Kintala, and K. S. Trivedi. Dependability enhancement for ieee 802.11 with redundancy techniques. proc. of IEEE 2003 International Conference on Dependable Systems and Networks (DSN ’03), June 2003. [11] R. Gandhi. Tolerance to access-point failures in dependable wireless lan. Proc. of the 9th Int. Workshop on Object-Oriented Real-Time dependable Systems (WORDS’03), June 2003. [12] S. Cabuk, N.Mahlotra, L. Lin, S. Bagchi, and N. Shroff. Analysis and evaluation of topological and application characteristics of unreliable mobile wireless ad-hoc network. proc. of 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. [13] S. Porcarelli, F. Di Giandomenico, A. Bondavalli, M. Barbera, and I. Mura. Service-level availability estimation of gprs. IEEE Transactions on Mobile Computing, 2(3), July-September 2003. [14] S. M. Matz, G. Votta, and M. Malkawi. Analysis of failure recovery rates in a wireless telecommunication system. proc. of the 2002 International Conference on Dependable Systems and Networks (DSN’02), 2002. [15] M. F. Buckley and D. P. Siewiorek. A comparative analysis of event tupling schemes. proc. of The 26th IEEE International Conference on Fault-Tolerant Computer Systems (FTCS ’96), June 1996.