Operating System Reliability from the Quality of Experience Viewpoint ...

4 downloads 8156 Views 753KB Size Report
essential to collect, analyze, and characterize OS failure information from field data. A very important aspect to be considered during these activities is to clearly ...
Operating System Reliability from the Quality of Experience Viewpoint: An Exploratory Study Rivalino Matias Jr., Geycy Dyany Oliveira

Lucio Borges de Araujo

School of Computer Science Federal University of Uberlandia Uberlandia, Brazil [email protected], [email protected]

School of Mathematics Federal University of Uberlandia Uberlandia, Brazil [email protected]

ABSTRACT In this paper we present an exploratory study in operating system (OS) reliability. We focus on OS reliability from the user quality of experience perspective. Our approach considers not only OS Kernel failures, but also failures observed in OS distribution components, because the OS user experience is also affected in the latter case, regardless of the correct OS Kernel functioning. We adopt this approach under the assumption that ordinary computer users do not have sufficient technical skills to discriminate when a failure in the OS operation is caused or not at Kernel level. Hence, we look for reliability measures that are closer to the users' perception about the OS quality. We analyzed 2,634 OS failures from 106 real computers, and calculated several reliability metrics for the investigated operating systems.

software that also should guarantee the expected application reliability level. Thus, increasing the OS reliability is a major requirement towards the reliability of computing systems as a whole. Modern operating systems have two main characteristics that make them unreliable: i) they are huge in terms of lines of code, and ii) they have very poor fault isolation [20]. We know from the literature (e.g., [10], [14], [22]) that software size is an important factor in software reliability, where failures increase according to the software size [15]. Figure 1 illustrates this relationship.

Categories and Subject Descriptors D.4.5 [Operating Systems]: Reliability – Fault-Tolerance. G.3 [Mathematics of Computing]: Probability and Statistics – reliability and life testing.

General Terms Measurement and Reliability. Figure 1. Relationship between software size and quality [15].

Keywords Operating System, Experience.

Software

Reliability,

and

Quality

of

1. INTRODUCTION Software has become one of the most important tools of modern society. Nowadays, there is probably no other human-made material that is more omnipresent [11]; it is found in home appliances, telecommunications, automobiles, airplanes, science, education, business, and others. This ubiquity of software makes it part not only of regular systems, but also of highly critical systems. This high dependency leads to a scenario where software failures may vary from a simple inconvenient to risking human lives [9]. Therefore, reliability is a quality attribute that has become increasingly important as the society depends more and more on software systems [4]. Formally, software reliability is defined as the probability of failure-free software operation for a given period of time, and in a particular environment [1]. The size and complexity of software have grown dramatically over the past few decades, and this trend will certainly continue [11]. To deal with this challenging scenario, the discipline of software reliability engineering (SRE) [11] has continually evolved its body of theoretical and practical knowledge. Many research works have been developed in this area, considering different software systems. In this paper, we focus on operating system (OS) reliability. Achieving a high-dependable user-level application is not sufficient, since it runs under operating system

In order to improve the OS reliability, designers and programmers must understand the patterns of the main OS failures. So, it is essential to collect, analyze, and characterize OS failure information from field data. A very important aspect to be considered during these activities is to clearly define the meaning of OS failure. To do so, it is essential to specify which software components are considered part of the operating system. From the OS literature, we observe two different views. The first view considers OS as a synonym for OS Kernel, and all other supplemental components (e.g., shell, window manager, boot manager, etc.) are considered parts of the OS distribution. So, OS failures are only those related to the OS Kernel; for example, device drivers and system calls malfunctioning. The second view considers the OS as not only the Kernel code, but also the supplemental parts that compose the OS distribution. For example, a failure in the “window manager” application should be counted as an OS failure, differently from the first view that would not count it, since the “window manager” is a user-level process not running at Kernel level. In this work, we analyze OS failures according to the second view. The reason for our choice is that we are interested in investigating OS reliability from the users perspective, i.e., based on their quality of experience (QoE). For example, in our approach a failure during a system updating procedure, caused by a fault in the “update manager” application, is considered an OS

failure. In this case, the user experience with respect to the operating system reliability is certainly affected, regardless of the correct OS Kernel functioning at that time. We adopt this approach under the assumption that the ordinary computer user does not have sufficient technical skills to discriminate when a failure is caused by the OS Kernel or OS distribution components. For these users, in both cases the OS has failed, and this is the experience perception that reverberates on the OS quality reputation. Therefore, we look for reliability measures that are closer to the users’ perception about the quality of the operating system. This paper is organized as follows. Section 2 discusses related works. Section 3 explains the experimental plan adopted, describing the methods and materials used in our study. Section 4 discusses the obtained results. Finally, Section 5 presents our final remarks.

2. RELATED WORKS Ganapathi and Patterson [5] analyzed OS field failure data collected from MS Windows machines in the EECS department at UC Berkeley. They used the Microsoft’s Corporate Error Reporting software to collect failure data related to user-level applications and operating system. In [5], the definition of OS failure followed the first view as described in Section 1, that is, an OS failure is considered only at Kernel level (e.g., bad drivers and faulty system-level routines). They conclude that OS failures are less prevalent than application failures. In [6], Ganapathi et al. analyzed 2,528 occurrences of Windows XP kernel crashes, collected from 617 volunteers through the BOINC project [2]. Similar to [5], they considered OS failure only Kernel-level events. The authors concluded that poorly written device drivers contributed most for the OS crashes in their dataset. Authors in [8] analyzed event logs from Windows NT servers. The study considered failure data from 70 Windows NT based mail servers, collected over six months. The results showed that, in average, the servers’ uptime was 283.68 hours. Similar to [8], Xu et al. [21] investigated the reliability of Windows NT servers. They collected failure data from 503 servers over four months. They considered as OS failures all unexpected events leading to a system reboot/crash/halt. Based on the failure times, they calculated some reliability metrics. We highlight the MTBFs caused by hardware (92.74 hours), applications (31.52 hours), system configuration (13.61 hours), and maintenance (5.92 hours). The main differences between our study and the above-cited works are summarized as follows. First, we focus on the OS reliability from the user perspective, considering not only Kernel failures but also failures of OS distribution components. In [5] and [6], the OS failures analyzed are only Kernel related. Refs. [8] and [21] focused on system availability, investigating failures that lead to system downtime. Failures of OS Kernel and OS distribution components that did not cause a system downtime were not evaluated. In summary, we conclude that [5] and [6] analyzed OS failures from OS architects or programmers viewpoints, and [8] and [21] from the OS administrator perspective. Second, majority of the cited works restricted their dataset analysis to ordinary descriptive statistics. In addition to descriptive analysis, we conduct a more sophisticated statistical evaluation of the time-tofailure data sets.

3. EXPERIMENTAL STUDY In this section, we present the main aspects of our experimental plan. We describe the data sets and procedures adopted to collect them. Next, we explain the statistical techniques used to analyze the failure data.

3.1 Material In this study, we focus on desktop operating systems from the Microsoft Windows family, specifically on Windows 7 (Win7). We chose this operating system due to two main aspects. First, we found that currently the Win7 has a very large installed base in different production environments (e.g., commercial, industrial, academic). Second, it provides detailed failure registries as part of its Reliability Analysis Component (RAC) [12], which supplies data about different reliability-related events to the Reliability Monitor (RM) application [13]. The RAC and RM are automatically enabled during the Win7 installation. For collecting and handling the data provided by RAC, we created specific instrumentations that will be described next. Firstly, we set up a data repository to store the RAC files collected from real production machines. This repository receives the collected files through a webpage1, created for this purpose, where before uploading the file it is required to fill out an online form that helps to characterize the system assigned to the uploaded dataset. Essentially, this form requires information regarding the machine location and usage/application profiles. Alternatively, it is also possible to copy the RAC files, manually from its directory (e.g., C:\ProgramData\Microsoft\RAC\PublishedData), to a temporary storage (e.g., USB flash disk) and then upload it later. Sometimes this manual procedure was necessary in order to overcome connectivity problems or security filters that prevented our access to the online form. Second, we developed programs/scripts that read data from RAC files and process them under different filtering options. For example, selecting all event records related to unsuccessful program installations. In this study, we focus on records related to OS Kernel and OS distribution component failures. In order to define the filtering rules for these categories of OS failures, we had to perform a previous manual classification of the most frequent RAC events observed in our data sets. Specifically, we analyzed the following RAC’s record fields: “Source Name”, “Event Identifier”, and “Product Name”. The first identifies the category of the OS failure event source (i.e., OS application, OS service, Device driver, or Kernel subsystem). The second is a number (ID) that uniquely identifies the type of Windows event related to the failure. The third indicates the product name that is associated with a given event, if available. Table 1 shows an excerpt of this classification on the collected data. We store the filtered data into a MySQL database. Thus, we can easily extract the information we need to process it using different statistical techniques (see Section 3.2). Table 1. Example of OS Failures Source Name MsInstaller Application Error (explorer.exe) Application Hang (explorer.exe) Microsoft Windows UserPnp

1

http://hpdcs.facom.ufu.br/dcs-team/index.php

Category OS Service OS Application OS Application Device Drivers

Event Log Microsoft Windows StartupRepair Microsoft Windows WER SystemErrorReporting

The A-D test is a quadratic test, given that it is based upon a weighted square of the vertical distance between the empirical and fitted cumulative distribution function (cdf) [3]. Equation 1 describes the A-D statistic, A2.

OS Service OS Service OS Service

In this study, we organized the collected data sets in two groups (G1 and G2). The G1 dataset is composed of failure records collected from 50 machines, located at the same academic environment. Differently, the G2 dataset is composed of 56 machines located in distinct corporate environments. Table 2 summarizes the main G1 dataset characteristics. Table 2. General characterization of G1 Summary Country Language Workplace Usage Profile Operating System Version

Application Profile

Brazil Portuguese Academic Desktop Windows 7 Office Applications, Software & Web development, Graphic Editing and Multimedia Applications

Differently, G2 dataset represents environments summarized in Figure 2.

multiple

corporate

A2  n 

1 n  (2i  1) ln F ( xi )  ln(1  F ( xni1)) (1) n i 1

where, n is the sample size, F(x) is the cumulative distribution function for the specified distribution, and i is the ith sample observation for the data arranged in ascending order. The K-S test compares a stepwise empirical cdf and the hypothesized cdf. The maximum difference (Dn) in the estimated cumulative probabilities, for the two cdf’s, is calculated by Equation 2 [3].

Dn  max | F ( x)  Sn ( x) | (2) x

where, F(x) is the cumulative distribution function, and Sn(x) is the empirical distribution function for n observations.

3.2.2 Zero Inflated Data Based on a preliminary evaluation of the collected data sets, we detected a high number of failure times equal to zero. This data set characteristic is known as zero-inflated data and requires the use of specific statistical techniques; it prevents us of using ordinary methods commonly applied to lifetime analysis. Zeroinflated data is commonly found in areas such as meteorology, ecology, biology, epidemiology, social sciences, and others. To the best of our knowledge, this is the first research work that detects zero-inflated data in an operating system reliability study. Zero-inflated data cause large sample heterogeneity and over dispersion, affecting significantly the data analysis. Ref. [7] discusses this problem and presents some approaches to deal with zero-inflated data adequately. In our work, we evaluated two approaches, Mixture Distributions [18], [19] and Percent NonZero (PNZ) [17], and decided to use the later due to its good results and minor computational cost. PNZ is the percentage of the population with non-zero failure times, which is given by (3). PNZ 

1  nf (0) n  s(0)

(3)

Figure 2. General characterization of G2

3.2 Methods In this subsection, we present the statistical techniques used to conduct the reliability analysis of the sampled operating systems. In addition to regular descriptive statistics, we used the following techniques.

3.2.1 Goodness of Fit (GoF) tests We applied GoF tests to evaluate the hypothesis that a sample of failure times follows a given probability distribution [16]. These tests measure how good a random sample is modeled by different theoretical probability distribution functions. It allows us to select the distribution function that fits better to the sample. We used two widely used tests for this purpose: Anderson Darling (A-D) and Kolmogorov-Smirnov (K-S) tests.

where nf(0) is the number of failures at time zero, n is the sample size, and s(0) is the number of suspensions at time 0. Suspension means right-censored observation. As can be seen, the suspensions at time 0 are ignored in (3). Firstly, we estimate the reliability model without the zeros, R(t). Next, we compute the reliability model, for the complete dataset, multiplying the estimated model by the proportion of non-zero data, or PNZ. Equation 4 represents this procedure. R`(t )  PNZ   R(t ) (4)

where R’(t) is the reliability function for the complete data set, and R(t) is the reliability function for the non-zero data subset.

Weibull showed the best results for both individual and grouped analyses.

4. RESULTS 4.1 Goodness of Fit Test As described in Section 3.2.1, we firstly evaluate the adequacy of different density functions to our data sets, in order to select the best-fit model to calculate the OS reliability metrics of interest. We tested the following models: 1-Parameter Exponential, 2Parameter Exponential, Normal, Lognormal, 2-Parameter Weibull, 3-Parameter Weibull, Gamma, G-Gamma, 3-Parameter Gamma, Logistic, Loglogistic, 3-Parameter Loglogistic, Gumbel, Smallest Extreme Value, and Largest Extreme Value. These are well-known models widely used in reliability engineering studies. Firstly, we tested all models against the individual samples of time to failures obtained from each machine of G1 and G2, respectively. Next, we identified the percentage of samples each model fitted, which are shown in Tables 3 and 4. Table 3. Percentage of model fitting (G1 dataset) Model G-Gamma 2P-Weibull Exponential 2P-Exponential 3P-Weibull Lognormal No failure

Fitting Percentage 62% 6% 6% 2% 2% 2% 2%

According to A-D and K-S test results, for G1, we observe that GGamma distribution presents the best fit for 62% of all systems in this group. Note that 2% (one machine) did not present failures and 18% (9 machines) presented less than 10 failures, which is not sufficient to apply the A-D and K-S tests. These tests do not produce reliable results for sample sizes less than 10 observations. The models not listed in Table 4 did not fit to any sample in G1. The same procedure was applied to G2. We verified that GGamma also showed the best results for individual samples in this group, fitting to 38% of all samples. Note that 20% of G2 samples were not evaluated because they had less than 10 failure records. The models omitted in Table 4 did not fit to any failure sample. Table 4. Percentage of model fitting (G2 dataset) Model G-Gamma 3P-Weibull 2P-Weibull Lognormal Loglogistic 1P-Exponential Logistic

Fitting Percentage 38% 18% 11% 5% 4% 2% 2%

The PNZ values calculated for both groups are listed in Table 5. Table 5. PNZ for G1 and G2 PNZ

G1 0.981050

G2 0.987763

Furthermore, we conducted GoF tests for each group as a whole. We clustered all individual samples of the same group and evaluated the goodness of fit. Tables 6 and 7 present the results in terms of the ranking of best fitting. In general, Gamma and

Table 6. Ranking of model fitting for G1 Model Gamma 2P-Weibull Lognormal Loglogistic Logistic 1P-Exponential Normal Gumbel

Ranking 1 2 3 4 5 6 7 8

Table 7. Ranking of Model fitting for G2 Model 2P-Weibull Gamma Lognormal Loglogistic Logistic 1P-Exponential Normal Gumbel

Ranking 1 2 3 4 5 6 6 7

4.2 Reliability Metrics Based on the best-fit model found for each group (see previous section), we calculated several metrics for G1 and G2, which were: reliability, probability of failure, warranty time, Bx% life, mean life, and failure rate. The first and second metrics provide the reliability and unreliability of the system, i.e., R(t) and 1-R(t), respectively. The third metric indicates the time, t, for a given demonstrated reliability (e.g., for R=0.85 t ≤ 3 hour). The fourth metric shows the time at which x% of the sampled systems will have failed. For example, a B5% life of 70 hours means that five percent of the operating systems under study will have failed by 70 hours of operation. The fifth metric is the average time to failure, MTTF. The sixth metric is the instantaneous failure rate for a given time. Table 8 presents the estimated reliability metrics for G1 and G2. For the warranty time we used R=0.90. The B10% life of G1 occurs at 0.0053 hour, against 0.0273 hour for G2. Consequently, the MTTF is higher in G1 than in G2, and the failure rate for t=100 hours is the double in G2 (0.56%) than in G1 (0.27%). Table 8. Reliability Metrics (time in hour) Number of failures Average per machine Standard deviation Minimum (time to failure) Maximum (time to failure) Mode (time to failure) Warranty time B10% life MTTF Failure rate

G1 686 239.75 333.05 0 2491.98 0.0041 0.0589 0.0053 411.48 0.0027

G2 1948 133.45 368.76 0 7534.68 0 0.1085 0.0273 128.22 0.0056

As can be seen, G2 presents a higher number of failures than G1. Our explanation for this finding is that the operating systems of

G1 are managed in a more systematic manner. We know that the maintenance activities (e.g., software updates) of all G1’s operating systems are executed uniformly, following the same procedures, and performed by the same technical personnel. Differently, the operating systems of G2 are not under the same maintenance rules. Figure 3 shows the comparison of reliability for both groups. For the first 4000 hours (or 167 days), G1’s systems demonstrate higher reliability than G2’s systems. Figures 4 to 7 present the Reliability and Unreliability (i.e., Failure Probability) curves for G1 and G2. For these figures, the straightline curves represent the theoretical model fitted to the observed data (dotted curves). Finally, we calculated the percentage of failures for each OS failure source considered in this study, which are listed in Table 9. Figure 6. 2-P Weibull reliability model (G2) 0.80 0.60 0.40 0.20

G1

G2

8000

7000

6000

5000

4000

3000

2000

500

1000

400

300

200

1

0.00

100

Reliability, R(t)

1.00

hours

Figure 3. G1 and G2 reliability curves

Figure 7. 2-P Weibull unreliability model (G2)

Table 9. Percentages of OS Failures per Category OS Failure Category OS Service OS Application OS Kernel

Figure 4. Gamma reliability model (G1)

(%) G1 78.28 14.87 6.85

G2 77.72 22.12 0.15

5. CONCLUSION In this work, we present an exploratory study on operating system reliability. We use real field data to estimate reliability metrics, considering 2,634 OS failures of different selected sources (OS services, OS applications, and OS Kernel). Based on the analyses of the two data sets, we conclude that G1 presents a higher MTTF than G2 due to the systematic OS maintenance procedures adopted in its production environment, especially related to system updates. As can be observed in Table 9, G2 presents higher number of failures in OS applications. This failure category has a significant impact on the OS user experience, since they occur more frequently than the other considered OS failure types. Thus, although G1 presents a higher number of OS Kernel failures, it has a lower impact on the general user experience since it occurs less frequently than the other OS failure categories.

Figure 5. Gamma unreliability model (G1)

6. ACKNOWLEDGMENTS This work was supported partially by FAPEMIG, CAPES and CNPq.

7. REFERENCES [1] ANSI/IEEE Standard Glossary of Software Engineering Terminology. 1991. [2] Boinc, Open-source software for volunteer computing and grid computing. http://www.boinc.berkeley.edu/index.php. [3] Cullen, C. and Frey, H. C. Probabilistic Techniques in Exposure Assessment: A Handbook for Dealing with Variability and Uncertainty in Models and Inputs. Springer, 1999.

[12] Microsoft, Reliability Analysis Component. http://technet. microsoft.com/en-us/library/cc774636(v=ws.10).aspx [13] Microsoft, Using Reliability Monitor. http://technet.microsoft.com/en-us/library/cc722107(v= ws.10).aspx. [14] Murphy, B. Automating Software Failure Reporting,. Magazine Queue - System Failures, 2 (Nov. 2004), 42-48. [15] Murphy, B. Reliability Estimates for the Windows Operating System. Microsoft Research Cambridge, 2008. http://www.dcl.hpi.uni-potsdam.de/meetings/mshpsummit// slides/brendan.murphy.pdf. [16] Rayner, J. C. W., Thas O. and Best, D. J. Smooth Tests of Goodness of Fit: Using R, 2nd edition, John Wiley & Sons, 2009.

[4] Fuggetta, A. Software Process: A Roadmap. In Proceedings of the Conference on the Future of Software Engineering (Limerick, Ireland, Jun. 04 - 11, 2000). 25-34.

[17] Reliasoft, Life Data Analysis with Zero-Time (Out-of-TheBox) Failures, Reliability HotWire. http://www.weibull.com/ hotwire/issue83/hottopics83.htm.

[5] Ganapathi, A. and Patterson, D. Crash Data Collection: A Windows Case Study. In International Conference on Dependable Systems and Networks (Yokohama, Japan, Jun. 28-Jul. 1, 2005). 280-285.

[18] Ridout, M., Demetrio, G. B. and Hinde, J. Models for count data with many zeros. In Proceedings of the XIXth International Biometric Conference, Invited Papers (1998). 179-192.

[6] Ganapathi, A., Ganapathi, V. and Patterson, D. Windows XP Kernel Crash Analysis. In Large Installation System Administration Conference (Washington, DC, Dec. 3-8, 2006). 149-159.

[19] Slymen, D. J., Ayala, G. X., Arredondo, E. M. and Elder, J. P. A demonstration of modeling count data with an application to physical activity. Epidemiologic Perspectives & Innovations, 3 (2006), 1-9.

[7] Hinde, J. and Demetrio, C. G. B. Over dispersion: Models and Estimation. In Computational Statistics & Data Analysis, 27 (Apr. 1998), 151–170.

[20] Tanenbaum, A., Herder, J. N. and Bos, H. Can We Make Operating Systems Reliable and Secure?, IEEE Computer, 39,5 (May 2006), 44-51.

[8] Kalyanakrishnam, M., Kalbarczyk, Z. and Iyer, R. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Reliable Distributed Systems (Lausanne, Switzerland, Oct. 19-22, 1999). 178-187.

[21] Xu, J., Kalbarczyk, Z. and Iyer, R. K. Networked Windows NT System Field Failure Data Analysis,. In Pacific Rim International Symposium on Dependable Computing (Hong Kong, 1999). 178 - 185.

[9] Leveson, N. G. and Turner, C. S. An Investigation of the Therac-25 Accidents. In IEEE Computer (July 1993). 18-41.

[22] Yuan, D. and Zhang, C. Evaluation Strategy for Software Reliability. Electronics, Communications and Control (Sep. 9-11, 2011), 3738-3741.

[10] Lipow, M. Number of Faults per Line of Code. In IEEE Transactions on Software Engineering, 8, 4 (July 1982), 437439. [11] Lyu, M. R., Software Reliability Engineering: A Roadmap. In Future of Software Engineering (Hong Kong, May 23-25, 2007). 153 - 170.