A Method for Evaluating the Impact of Software Configuration Parameters on E-Commerce Sites Monchai Sopitkamol
Daniel A. Menasce´
Kasetsart University Dept. of Computer Engineering Bangkok, Thailand
George Mason University Dept. of Computer Science, MS 4A5 4400 University Drive Fairfax, Virginia
[email protected]
ABSTRACT E-commerce systems are composed of many components with several configurable software parameters which, if properly configured, can optimize system performance. The cost and burden of implementing upgrades can be deferred by judicial tuning of these parameters. Tuning is a time consuming effort and energy has to be spent on the most relevant parameters. This paper provides a method to evaluate the statistical significance of configurable parameters in e-commerce systems, the interaction effects between each parameter and e-commerce function types, and a method to rank key configurable e-commerce system parameters that significantly impact overall system performance, and the performance of the most significant Web function types. Both on-line and off-line parameters at each of the e-commerce system layers—Web server, application server, and database server—are considered. The paper discusses the design of a practical and ad-hoc approach that involves conducting experiments on an e-commerce site compliant with the TPCW benchmark. The performance metrics of interest include server’s response time, system throughput, and probability of rejecting a customer’s request. The paper also provides a set of useful guidelines for tuning the performance of ecommerce sites.
1.
INTRODUCTION
Typical e-commerce sites are complex, consisting of hundreds of machines with a large number of software configuration parameters that may take many different values. The number of possible combinations of the values of these parameters can be extremely large. The performance of ecommerce sites significantly depends on the proper setting of these parameters. In addition, interaction effects between parameters and the types of requests made to the system also impact the site’s performance. Therefore, it is imperative that one be able to determine the sensitivity of an
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WOSP ’05 July 12-14, Palma de Mallorca, Spain Copyright 2005 ACM 1-59593-087-6/05/0007 ...$5.00.
[email protected] e-commerce site’s performance to the various configuration parameters. Given that it is not feasible to test all possible combinations of configuration parameters, we provide a practical and ad-hoc experimental methodology based on statistical techniques to 1) identify key configuration parameters that have a strong impact on performance, 2) rank the parameters in order of their relevance to performance, according to criteria designed in this work, and 3) provide an interaction analysis between the various parameters and the types of requests submitted to the site. Our work draws on concepts and techniques from statistical design and analysis of experiments [6, 8] as well as on hypothesis testing [7]. We also evaluated the impact of these configuration parameters under light and heavy workload intensity levels. However, due to space limitations, we only present the results for the high workload intensity level. In the conclusion section, we discuss results for both workload intensity levels. Most e-commerce sites are organized in a three-tiered architecture that consists of three logical layers of servers that support the various functionalities offered by these sites. These layers include Web server, application server, and database server. Web servers act as a front end of an ecommerce site, where HTTP requests are received from the clients, processed according to the type of request, or, if necessary, forwarded to the next tier, the application layer. At the middle tier, application servers implement the business logic by invoking server-side applications. Application servers may interact with the next tier, transaction and/or database servers, if they need transaction processing to be performed, and/or need to access a database. It is very difficult, if not impossible, to conduct experiments on a commercial e-commerce site. For that reason, we built an experimental e-commerce site that adheres to the Transaction Processing Performance Council’s TPC-W benchmark [12, 14] for e-commerce. This approach gives us the opportunity to run a wide variety of controlled experiments. TPC-W emulates the activities of an on-line bookstore. An emulated browser (EB) simulates a client interacting with the on-line site by performing different Web functions available to the browser. The primary performance metric reported by TPC-W is the number of Web interactions (i.e., full Web pages returned) processed per second (WIPS). Multiple Web interactions are used to simulate the activity of a retail store and each interaction is subject to a response time (i.e., Web Interaction Response Time (WIRT)
as defined by TPC-W) constraint. In addition to WISP and WIRT we also measured the probability of rejection of Web requests, defined as the fraction of submitted requests that were rejected by the site. This paper is organized as follows. Section two provides some background information. Section three presents the methodology for evaluating the significance of each factor and their interaction effect, if any, with e-commerce function types. It also presents the ranking methodology. The next section presents the results obtained with the application of the methodologies to an experimental testbed. Section five presents some practical guidelines for performance tuning of e-commerce systems. Section six provides some concluding remarks.
2.
THE METHODOLOGY
The methodology for identifying factors that have significant impact on system performance and for ranking them according to their degree of impact consists of three major phases as shown in Fig. 1:
Initialization
Initialize Initialize Test Test Bed Bed
Hypothesis Testing and Ranking
Data Collection
Generate Generate Data Data
PerformTwo Two-Way Perform -Factor
for for ANOVA ANOVA
ANOVA ANOVA
Yes Drop Drop factor factor
Accept Is eitherHypotheses Hypothesis 1 or 2 accepted? 1 and 2?
No
Factor Ranking results
PerformOne One-Way Perform -Factor ANOVA ANOVA
Figure 1: Phases of the Methodology
Step 1 Initialization: This involves running experiments on all factors in order to determine the “best” initial level of each factor. This is necessary since it would be counter productive to perform studies on an unoptimized system that may exhibit poor performance. The details of the initialization process are discussed in the following section. Step 2 Data Collection: The initialization process yields a system that is well-configured to a certain point and exhibits reasonable performance under different levels of workload intensity. The goal of this phase is to determine the factors that are statistically significant and determine the interaction effects, if any, between each factor and the types of requests submitted to the system. Step 3 Hypothesis Testing and Ranking: After determining the factors that are significant and any interaction effects, the next step is to rank the factors in the order of their impact on various performance metrics. The results obtained in the previous phase are used for ranking purposes.
2.1 Initialization The rationale of the initialization phase is, for each factor, to repeatedly vary the level of each factor and drive the system with the same workload until a certain level of that factor is found to give the relatively best system performance for the round. This results in a set of levels for all factors that, in effect, delivers good system performance and provide a good starting point for the data collection to be performed in the following step. An example of the initialization process helps to clarify the idea. Consider four factors (A, B, C, and D), each factor having two levels (i.e., (a1 , a2 ), (b1 , b2 ), (c1 , c2 ), and (d1 , d2 )). The goal is to initialize all the factors by finding the level of each factor that yields relatively best system performance, in term of response time. An illustration of each step is shown in Figure 2 and is described below. From the figure, the initialization is divided into two passes because one initialization pass may not provide a sufficiently good set of factor levels. Step 1 of the first pass involves marking the factors for which we either i) know the level that would provide the best individual layer performance (i.e., minimum response time), or ii) can intuitively guess their “best” level. These factors have their levels marked with “*” in Fig. 2. Let factors A and C be in this category and assume that their “best” levels are a∗1 and c∗2 , respectively. For the remaining factors, we cannot predict their performance impact under different levels. They are marked with a “?”. Experiments are performed only on such factors to determine their temporary, initial “best” level. Step 2 of pass 1 tries to determine the “best” level for factor B, while setting factor D to its default level, d1 , and A and C to a∗1 and c∗2 , respectively. In this step, experiments are run against both levels of factor B: b1 and b2 , one after the other. Assume that, at the end of the experiments, level b2 gives better performance (i.e., smaller response time) than b1 . In step 3 of pass 1, factor B is set at the “best” level (b2 ), keeping the levels of factors A and C the same as in the previous step. We now run experiments on factor D against its levels d1 and d2 . Assume that after the experiments, level d2 yields better performance than d1 . As a result, at the end of the first pass, we have a tuple (a∗1 , b2 , c∗2 , d2 ) corresponding to factors A, B, C, and D, respectively, which gives an initial temporary good system configuration. The result from the first pass is carried over to the second pass. In this pass, the same process from pass one is repeated, except that we first start by setting the ?-marked factors (i.e., B and D) to the levels obtained at the end of pass one — instead of to the default value — and leaving the *-marked factors (i.e., A and C) unchanged throughout this pass (i.e., a∗1 and c∗2 , respectively). (Note that, in real situations, we may arbitrarily want to test our intuition upon some of these *-marked factors by running experiments against them). In step 2), for each of factors B and D, if the resulting value of the “best” level obtained from pass one (i.e., b2 and d2 ) was not the same as the default value of that factor (i.e., b1 and d1 ), experiments are performed against such factors. In this example, since the resulting value, b2 , is not the same as the default value, b1 , experiments need to be performed against factor B, as step 2) of pass 2 indicates in the figure. The same situation also applies for factor D, and thus step 3). At the end of the second pass (step 4), some of the factors marked with “?” might retain their level from the first pass and some might find a better level. In
Factor Levels
A (a1 , a2 )
B (b1 , b2 ) Pass One
C (c1 , c2 )
1) 2) 3) 4)
(A = a∗1 , B =?, C = c∗2 , D =?) (a∗1 , B =?, c∗2 , D = d1 ) (a∗1 , B = b2 , c∗2 , D =?) (a∗1 , B = b2 , c∗2 , D = d2 )
1) 2) 3) 4)
D (d1 , d2 ) Pass Two
(a∗1 , B = b2 , c∗2 , D = d2 ) (a∗1 , B =?, c∗2 , D = d2 ) (a∗1 , B = b1 , c∗2 , D =?) (a∗1 , B = b1 , c∗2 , D = d2 )
Figure 2: Example of the Initialization phase. the example, factor B finds a better level of b1 while D remains at its best level of d2 . The complete algorithm for the initialization process just discussed considers two classes of factors: discrete (i.e., countable number of levels) and nondiscrete (i.e., uncountable number of levels). Experiments are performed against all the levels of discrete factors. It is not possible to examine all levels of non-discrete factors. The binary search procedure shown in Fig. 3 is used to find initial temporary “best” levels for these factors. In either case, the minimum value of the response time and the respective level are derived for each factor, and the level of the current unknown “best” level factor is set to the derived level before next iterations of experiments are executed for the successive unknown “best” level factors. Factors whose pass-one results are equal or close to their respective default values are not experimented with in pass two because the default value is sufficiently adequate for that factor.
Initialize all factors F[i], (1 min by at least 10% ?
No
Yes
SET factor level to midpoint between the min level and the previous level
Is new level being repeated ?
No
No
Is new WIRT > min ?
Yes
STOP, SET factor level to min level for iteration of next factor
No
Yes
SET min = WIRT
Have all levels been considered?
Yes
Figure 3: Binary search routine to determine the “best” level for non-discrete factors
STOP
Figure 4: Algorithm for the Data Collection Process
2.2 Data Collection The goal of this phase is to generate input data for an ANOVA analysis [7] to determine the statistical significance of all factors and determine if there are statistical interactions between the factors and all fourteen TPC-W Web interactions. Ideally, it would be best to run experiments and produce data points at every possible level of each factor. This is possible for discrete factors: factors with countable number of levels. However, it is too expensive and timeconsuming for the ones with unbounded number of levels due to the large number of experiments incurred. As a compromise, certain rules and conditions described in Fig. 4 are used for the latter factors to determine how the levels are chosen and how experiments should be executed for each factor. The first step of the data gathering phase consists of setting all the factors to the initial value obtained in the Initial-
ization phase. Next, factors are divided into two categories: factors with a fixed number of levels (discrete factors) and factors with unlimited number of levels (non discrete factors). In the former case, experiments are run on all levels of each factor. For the latter case, we further subdivide them into two subcategories. The first subcategory involves factors for which experiments were not performed during the Initialization phase. In this case, we need to limit the total number of levels to be evaluated. We start the experiments with the minimum value of the factor. Levels for successive experiments are doubled until either 1) the maximum value of the factor level has been reached, or 2) we have covered a maximum number of levels (four in our case) and the corresponding experiments for the factor have been fulfilled. The second subcategory includes factors whose results were collected from the Initialization phase. In this subcategory, two levels or more, if applicable, of a factor that exhibit “similar” response times are arbitrarily grouped into the same level. Up to four levels or more, when necessary, were selected for a batch of experiments for the factor. At the end of each factor’s experiments, its level is set back to the initial value before performing experiments for the next factor. At the end of this step, we have a set of experimental data, each corresponding to a combination of factors and levels, which can be analyzed through the use of the statistical techniques discussed in the next section.
2.3 Hypothesis Testing and Ranking The first goal here is to determine whether to accept or reject hypotheses based on the experimental data obtained in the previous step. In this step, we perform—for each SUT configuration (i.e., 20 Emulated Browsers and 100 Emulated Browsers)—the following hypothesis tests: Hypothesis 1: “Average response times, throughputs, and probabilities of rejection, of all Web interaction types are the same at all levels of a factor.” Hypothesis tests are carried out independently for each metric; Hypothesis 2: “There are no interaction effects between all Web interactions and all levels of a factor”; Hypothesis 3: If either one of the first two or both hypotheses is/are rejected, we perform the last hypothesis test: “For each Web interaction, all average response times, throughputs, and probabilities of rejection are the same among all levels of a factor.” Note that hypothesis tests are carried out independently for each metric. However, we do not perform this hypothesis test if we accepted the first two hypotheses. In this case, such factor is dropped from further consideration. A two-way ANOVA model is used to test hypotheses 1 and 2 and a one-way ANOVA model is used to test hypothesis 3. Ranking: The purpose of the ranking approach is to sort the various factors in decreasing order of impact on the three metrics of interest: response time, throughput, and probability of rejection. Ranking is constructed by comparing the measured minimum, maximum, and the computed range of the performance metrics (i.e., response times, throughputs, and probabilities of rejection) due to a factor, against the corresponding target performance metrics values. Since not all TPC-W Web interactions have the same frequency of access, we focus only on the top four most frequently accessed
Web interaction types—Search Request (20%), Search Results (17%), Product Detail (17%), Home (16%)—and on the Buy Request (2.6%) interaction because it is the most accessed interaction using Secure Socket Layer (SSL) connections, compared with the other three SSL-related interactions. TPC-W pre-defined target response times constraints for each Web interaction are shown partly in the second column of Table 1. The 90% Web Interaction Response Time (WIRT) constraints specified by TPC-W stand for the 90th percentile on response time, i.e., at least 90% of Web interactions of each type must have a WIRT less than the constraint specified (in seconds) for that Web interaction. The 10% WIPS constraints, in Web Interactions Per Second (WIPS), are derived from the 90% WIRT constraints with the help of the Interactive Response Time Law [11]: W IRT = (number of emulated browsers)/W IP S−Z, where Z is the average think time, specified by TPC-W as 7 seconds. Therefore, W IP S =
number of emulated browsers . W IRT + 7
(1)
The probability of rejection is not defined in TPC-W. We established a constraint for it based on the system availability concept as follows. A rule of thumb used in the on-line industry, is that a fault-tolerant Web-based system should have at least a 99.99% availability [11]. This translates to a probability of rejection constraint of no greater than 0.01% for any Web interaction. Table 1 summarizes all the performance metric constraints for the selected five Web interaction types used during the ranking process for 100 emulated browsers. Table 1: 90% WIRT, 10% WIPS, and of Rejection Constraints for Each Web Type 90% WIRT 10% WIPS Web Interactions Constraint Constraint (seconds) Buy Request 3 10.00 Home 3 10.00 Product Detail 3 10.00 Search Request 3 10.00 Search Results 10 5.88
Probability Interaction Probability of Rejection Constraint 0.01% 0.01% 0.01% 0.01% 0.01%
The ranking is also carried out for all the factors considering all fourteen types of Web interactions combined. The response time constraint of eight seconds—a de-facto industry standard [10]—was used here instead of TPC-W’s defined constraints. The throughput is obtained as 6.67 WIPS using Equation (1) with WIRT equal to eight seconds and 100 emulated browsers. The probability of rejection constraint is kept at 0.01% as before. Before the ranking is performed, for each factor and its respective Web interaction, we compute or find, from the “best” level of each factor obtained from experimental data, the following: • the 95% confidence interval of the average value of each performance measure (i.e., response time, throughput or probability of rejection), • the coefficient of variation of each factor’s sample data points,
• the minimum value, maximum value of the performance measures (response time, throughput, or probability of rejection) and its respective range (= maximum - minimum). For each of the five Web interactions in Table 1, all factors are ranked according to the following steps: Step 1: Divide the factors into three groups (Group 1, Group 2, and Group 3) with Group 1 ranked higher than Group 2 and Group 2 ranked higher than Group 3, for the response time and probability of rejection metrics. The ranks are, however, reversed for the throughput metric (i.e., Group 3 ranks higher than Group 2 and Group 2 ranks higher than Group 1) because SLAs for throughput are specified as minimum throughput. Figure 5 illustrates how the factors are grouped and the following section describes the grouping criteria. In the figure, each horizontal line represents a range of the values of the response variables, from the minimum to the maximum. The dark-circle dot inside the horizontal lines with surrounding small vertical lines are the average value of the response variable and its corresponding 95% confidence intervals, respectively. Group 1: If the lower bound of the 95% confidence interval is larger than the requirement in Table 1 for that Web interaction. Group 2: If the requirement in Table 1 falls within the 95% confidence interval of the average performance measure, or if the upper bound of the 95% confidence interval is smaller than or equal to the requirement and the upper bound of the range is greater than the threshold. Group 3: If the upper bound of the range of the performance measure is smaller than the requirement in Table 1.
Group 1
mean value, which results in a larger impact on system performance. (b) If the coefficients of variation are the same between two or more factors, the one that has a wider range of response time (throughput or probability of rejection) values than the others receives higher score or rank. This is because that factor has a wider effect on system performance. (c) If two factors have the same range of response time (throughput or probability of rejection) values, the one with the higher maximum response time (higher maximum probability of rejection or lower minimum throughput) is ranked higher. This is because we wish to try to reduce the system’s overall response time (increase throughput or decrease the probability of rejection). (d) If the maximum response times (maximum probability of rejection or minimum throughput) of two or more factors are the same, the one with the smallest value of minimum response time (smallest value of minimum probability of rejection or largest value of maximum throughput) is ranked the highest. Steps (a)-(d) above are iluustrated by the decision-tree of Fig. 6. L, H, and E indicate lower rank, higher rank, and equal rank, respectively. CV (a)
< L
(b)
(c)
Largest Max R (Largest Max P, Smallest Min X) < > =
H
H
Smallest Min R (Smallest Min P, Largest Max X) < > =
(d) L
confidence interval
> =
L
Group 2
Threshold
H
Range
=
E
H
Figure 6: Factor Ranking Decision Tree
Figure 5: Dividing Factors into Three Groups
3. APPLYING THE METHODOLOGY Step 2: Within each group, sort the factors as follows: (a) Factors with larger coefficient of variation (CV) are ranked higher than those with smaller coefficient of variation. This is because a factor with a large coefficient of variation exhibits a wide variation of the performance measure compared to its
3.1 Description of the Factors Twenty-eight factors from a three-tiered e-commerce site architecture were chosen for the experiments. The following list briefly describes the factors, organized by the layer in which they reside and sorted in an alphabetical order. The Web server layer uses Microsoft Internet Information Server
(IIS) 5.0, the application server layer uses Apache Tomcat 4.1, and the database layer uses Microsoft SQL Server 7.0. Factors 1-13 are the Web server factors, 14-16 the application server factors, and 17-28 the database server factors. 1. Application Optimization: Whether to allow performance optimization of only the foreground applications (more processor resources are given to the foreground program than to the background program), or all applications (all programs receive equal amounts of processor resources) [3]. 2. Application Protection Level: Whether applications are run in the same process as Web services (low), in an isolated pooled process in which other applications are also run (medium), or in an isolated process separate from other processes (high) [13]. 3. Connection Timeout: Sets the length of time in seconds before the server disconnects an inactive user [5]. 4. HTTP KeepAlive: Whether to allow a client to maintain an open connection with the Web server [5]. 5. ListenBacklog: Set the maximum active connections held in the IIS queue [4]. 6. Logging Location: Sets a specific disk and path where the log files are to be saved [13]. 7. MaxCachedFileSize: Sets the size of the largest file that IIS will cache [13]. 8. MaxPoolThreads: Sets the number of I/O worker threads to create per processor [4].
17. Cursor Threshold: Tells SQL Server whether to execute all cursors synchronously, or asynchronously [9]. 18. Fill Factor: Sets the default fill factor for indexes when they are built [9]. 19. Locks: Sets the amount of memory reserved for database locks [15]. 20. Max Server Memory: Sets the maximum amount of memory, in MB, that can be allocated by SQL Server to the memory pool [15]. 21. Max Worker Threads: Determines how many worker threads are made available to the SQL Server process from the operating system [9]. 22. Min Memory Per Query: Sets the amount of physical memory in KB that SQL Server allocates to a query [9]. 23. Min Server Memory: Sets the minimum, in MB, to be allocated to the SQL Server memory pool [15]. 24. Network Packet Size: Sets the packet size that SQL Server uses to communicate to its clients over a network [9]. 25. Priority Boost: Whether to allow SQL Server to take on higher priority than other application processes in terms of receiving CPU cycles [9]. 26. Recovery Interval: Defines the maximum time, in minutes, that it will take SQL Server to recover in the event of a failure [15].
9. MemCacheSize: Sets the size of the virtual memory that IIS uses to cache static files [13].
27. Set Working Set Size: Specifies that the memory that SQL Server has allocated cannot be paged out for another application’s use [15].
10. Number of Connections: Sets the maximum number of simultaneous connections to the site [5].
28. User Connections: Defines the maximum number of concurrent user connections allowed to SQL Server [15].
11. Performance Tuning Level: Sets the performance optimization level of IIS to the expected total number of accesses to the Web site per day [5]. 12. Resource Indexing: Whether to allow Microsoft Indexing Service to index a specific Web directory and files in that directory [13]. 13. worker.ajp13.cachesize: This is not an IIS parameter but it is a configurable parameter of the Web server. It specifies the maximum number of sockets that can be opened between two Tomcat out-of-process processes [1]. 14. acceptCount: Sets the maximum queue length for incoming connection requests when all possible request processing threads are in use [1]. 15. minProcessors: Specifies the number of request processing threads that are created when a Tomcat connector is first started [1]. 16. maxProcessors: Specifies the maximum number of request processing threads to be created by a Tomcat connector, which determines the maximum number of simultaneous requests that can be handled [1].
3.2 Experimental Testbed The testbed used in our experiments (see Fig. 7) is an ecommerce site built based on the TPC-W specifications [14]. The system under test (SUT) consists of a Web server, an application server, and a database server. Two workload generator machines were used to guarantee that contention at the workload generator would not limit the system throughput. The workload generator modules emulate client browser sessions that submit requests to the Web server and receive responses back from the Web server after the request has been processed. After a response page is received, the client browser simulates a user “think time” before submitting the next request to the SUT by putting itself to “sleep” for a random interval specified by the TPC-W specifications. The workload intensity to the SUT is varied by changing the number of simultaneous browser sessions. We considered light and heavy workloads. The light workload was generated wit 20 emulated browsers (EBs) and the heavy workoad with 100 EBs. The CPU utilizations for the web server, application server, and database server for the light workload range between 10-30%, 10-20%, and 30-60%, respectively, after the warm-up period. The corresponding figures for the heavy workload are around 20-40%, 20-30%, and 80-100%, respectively. The Web server receives requests
from the workload generator, parses them, submits requests, if necessary, to the application server, receives the results back from it, and provides the results to the client running on the workload generator. The application server module takes requests from the Web server, processes the request, and if needed, generates database queries that are submitted to the database server. After query results are received from the database server, the application server dynamically generates a response page, which is sent to the Web server. The database server receives database access requests from the application server, executes the queries and/or updates, and sends back the results to the application server. System Under Test
PIII 667 MHz 512 MB RAM 1 x 80 GB HD 1 x 10 GB HD
PIII 667 MHz 512 MB RAM 10 GB HD
Web Server Workload Generator # 1 10/100Mbps HUB
Celeron 2.0 GHz 1 GB RAM 2 x 20 GB HD
Table 3: Two-Way ANOVA Results for 100 EBs and 100K items
Factor Name
Application Server
Database Server
for Web interaction response time, X for throughput in Web interactions per second, and P R for probability of rejection. The table indicates with the letters R or A if a given hypothesis is rejected or accepted, respectively. From the table, it can be observed that we reject both null hypotheses 1 and 2 for most of the factors. Factors whose null hypothesis 1 results are “accepted” include Logging Location, MaxPoolThreads, and Max Worker Threads.
PIV 2.4 GHz 384 MB RAM 40 GB HD
Workload Generator # 2 PIII 667 MHz 512 MB RAM 80 GB HD
Figure 7: Experimental Testbed Setup
3.3 Workload Model In order to generate “real world” workloads to the SUT, we used a combination of the TPC-W workload model [14] and the Scalable URL Reference Generator (SURGE) model [2]. Table 2 summarizes the workload model used in our experiments. Table 2: Workload Model Category Distribution User Think Time Pareto: f (x) = ax−(a+1) , a = 7/6, truncated at 70 sec User Session Negative exponential: f (x) = µe−µx , Minimum Duration 1/µ = 15, truncated at 60 min Objects per Page Varied according to TPC-W’s specs HTML Object Size Varied according to TPC-W’s specs In-Line Object Size 45%=5KB, 35%=10KB, 15%=50KB, 4%=100KB, 1%=250KB
3.4 Experimental Results This section describes the results obtained by applying the methodoloy described in section 3 to the e-commerce site described above.
3.4.1 Two-Way ANOVA Results Table 3 shows the results of the null hypotheses one and two tests for all factors. Within the table, the two-way ANOVA results are divided into two main sets of either hypothesis one or two. Within each set, the results are further divided into three columns for each performance metric: RT
Web Server HTTP KeepAlive App. Protection Level Connection Timeout Number of Connections Logging Location Resource Indexing Performance Tuning Level Application Optimization MemCacheSize MaxCachedFileSize ListenBacklog MaxPoolThreads worker.ajp13.cachesize Application Server maxProcessors minProcessors acceptCount Database Server Cursor Threshold Fill Factor Locks Max Worker Threads Min Memory Per Query Network Packet Size Priority Boost Recovery Interval Set Working Set Size Max Server Memory Min Server Memory User Connections
Accept Hyp. 1?
Accept Hyp. 2?
RT
X
PR
RT
X
PR
R R R R A R R R R R R A R
R R R R R R R R R R R R R
R R R R R R R R R R R R R
A A A A A R R A R R A R R
R R R R R R R R R R R R R
R R R R R R R R R R R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R R A R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R
R
R
R
R
R
R
R
R
A
R
R
The acceptance of hypothesis 1 for the Logging Location factor could be explained by the fact that there may not be much difference when writing the Web access log files to the same disk as the one where the image files are stored than to a different disk. This is possibly due to the relatively light load of writing to disk at the IIS system. The other possible explanation could be that disk write time is relatively small compared to the mean Web interaction response time, which incurs long delay, if any, by the other tiers. Six levels of MaxPoolThreads, an IIS parameter, are tested: 8, 16, 32, 64, 128, 256, with the WIRT result of each level being 25.89, 25.77, 25.55, 26.01, 25.76, and 25.81 seconds, respectively. Thus, it can be seen that the increased number of maximum pool threads does not necessarily always improve system performance (i.e., WIRT) as expected. This is likely due to the increased overhead of managing a large thread pool by the IIS process. The same explanation could apply for the Max Worker
Threads factor at the SQL Server with the mean WIRT result for 128, 255, 510, and 1020 work threads of 25.68, 25.94, 25.72, 25.55, respectively. In this case, the increased overhead occurs on the database side (SQL Server).
3.4.2 One-Way ANOVA Results Table 4 presents the null hypothesis 3 tests for all 28 factors for each performance metric for Buy Request, Home, Product Detail, Search Request and Search Results.
3.4.3 Ranking Results For each Web interaction type, the rankings can be used to indicate which factors improve most significantly the performance of that interaction type for each performance metric of interest. Tables 5, 6, 7, 8, and 9 present ranking results for all 28 factors by Web interaction type across all three performance metrics (response time, throughput, and probability of rejection). The factor ranked number one is indicated in bold in each of these tables. For example, Table 5 shows that the factor that has the highest impact on the response time of a buy request is Cursor Threshold. This factor also has the highest impact on the probability of rejection. However, for the throughput, the factor that has the highest impact is Number of Connections. Table 5: Factor Ranking for Buy Request Buy Request Rank Factor WIRT WIPS Prej Fill Factor 9 5 9 Set Working Set Size 11 25 11 Application Protection Level 17 14 17 Network Packet Size 4 15 4 Locks 17 9 17 MaxCachedFileSize 12 7 12 Http KeepAlive 18 4 18 minProcessors 19 3 19 Application Optimization 6 26 6 Resource Indexing 17 11 17 Performance Tuning Level 13 24 13 ListenBacklog 17 22 17 Number of Connections 16 1 16 Recovery Interval 2 6 2 Connection Timeout 8 2 8 Logging Location 17 21 17 MaxPoolThreads 15 18 15 worker.ajp13.cachesize 3 16 3 maxProcessors 20 10 20 Cursor Threshold 1 19 1 User Connections 5 13 5 Max Worker Threads 14 17 14 Priority Boost 22 12 22 Min Memory Per Query 21 23 21 MemCacheSize 7 8 7 acceptCount 10 20 10
4.
PRACTICAL PERFORMANCE TUNING GUIDELINES
This section provides some useful guidelines for tuning the performance of typical e-commerce sites. We reveal important and unimportant factors together with their perfor-
Table 6: Factor Ranking for Home Home Rank Factor WIRT WIPS Application Optimization 22 19 MaxCachedFileSize 3 9 minProcessors 21 17 Priority Boost 22 19 worker.ajp13.cachesize 16 3 Application Protection Level 17 15 Http KeepAlive 7 1 Logging Location 22 10 Min Memory Per Query 8 7 Network Packet Size 5 2 Set Working Set Size 9 11 Locks 2 18 Performance Tuning Level 22 19 User Connections 13 14 MaxPoolThreads 4 6 acceptCount 14 4 Recovery Interval 12 22 MemCacheSize 6 21 Number of Connections 1 13 Connection Timeout 15 16 ListenBacklog 20 8 Max Worker Threads 18 19 Resource Indexing 22 19 Cursor Threshold 11 12 maxProcessors 10 5 Fill Factor 19 20
Prej 12 3 4 10 6 11 15 12 16 12 17 8 18 12 14 9 2 7 13 1 12 12 12 12 12 5
Table 7: Factor Ranking for Product Detail Product Detail Rank Factor WIRT WIPS Prej acceptCount 17 20 5 Priority Boost 18 3 2 Application Optimization 18 3 5 Logging Location 15 10 5 Min Memory Per Query 21 6 11 Network Packet Size 22 5 12 Set Working Set Size 6 1 7 Connection Timeout 24 18 13 MaxPoolThreads 4 22 4 MemCacheSize 13 21 5 Locks 3 11 5 Http KeepAlive 18 3 10 Recovery Interval 16 9 5 MaxCachedFileSize 8 14 8 minProcessors 10 19 5 Performance Tuning Level 5 23 6 Number of Connections 1 13 5 Cursor Threshold 9 17 5 Max Worker Threads 11 16 5 worker.ajp13.cachesize 20 12 1 ListenBacklog 14 8 5 maxProcessors 23 2 3 Application Protection Level 19 15 5 Resource Indexing 7 3 5 Fill Factor 2 4 5 User Connections 12 7 9
Table 4: One-Way ANOVA Results for 100 EBs/100K items 100 EBs/100K items Factor Name Web Server HTTP KeepAlive App. Protection Level Connection Timeout Number of Connections Logging Location Resource Indexing Performance Tuning Level Application Optimization MemCacheSize MaxCachedFileSize ListenBacklog MaxPoolThreads worker.ajp13.cachesize Application Server acceptCount maxProcessors minProcessors Database Server Cursor Threshold Fill Factor Locks Max Worker Threads Min Memory Per Query Network Packet Size Priority Boost Recovery Interval Set Working Set Size Max Server Memory Min Server Memory User Connections
Buy Request RT X PR
RT
Home X PR
Accept Hypothesis 3? Product Detail Search Request RT X PR RT X PR
Search Results RT X PR
R A R R A A A R R R A R R
R R R R R R R R R R R R R
A A A A A A A A A A A A A
A R A R A R R R A R R A A
A R R R R A R R R R R R R
R R R R R R R R R R R R R
A A A R A R R R R R R R R
R R R R R R R R R R R R R
R R R R R R R R R R R R R
A R R R A R R R R R A R R
R R R R R R R R R R R R R
R R R R R R R R R R R R R
A R R R A A A R R R R R R
R R R R A R R R R R R R R
A R R R R A A R R R R R R
R R R
R R R
A A A
R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R R
R R A R R R R R
R R R R R R R R
A A A A A A A A
R R R A R R R R
R R R R R R R R
R R R R R R R R
R R R A R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R R R R R R R R
R
R
A
R
R
R
R
R
R
R
R
R
R
R
R
R
R
A
R
R
R
R
R
R
R
R
R
R
R
R
Table 8: Factor Ranking for Search Request Search Request Rank Factor WIRT WIPS Prej Network Packet Size 5 6 3 Application Optimization 10 10 5 ListenBacklog 8 20 15 worker.ajp13.cachesize 9 18 16 Cursor Threshold 1 11 10 MemCacheSize 2 14 11 Priority Boost 10 10 4 MaxCachedFileSize 10 1 17 Application Protection Level 4 15 12 Http KeepAlive 10 10 17 Logging Location 10 9 9 Min Memory Per Query 7 13 14 Performance Tuning Level 10 10 17 Set Working Set Size 10 5 17 minProcessors 6 22 13 Recovery Interval 11 2 18 Locks 3 7 7 MaxPoolThreads 10 16 1 acceptCount 14 3 21 Number of Connections 10 17 8 maxProcessors 10 12 17 Max Worker Threads 10 21 2 Connection Timeout 12 8 19 Resource Indexing 10 10 6 Fill Factor 15 19 22 User Connections 13 4 20
Table 9: Factor Ranking for Search Results Search Results Rank Factor WIRT WIPS Prej worker.ajp13.cachesize 15 17 15 Application Optimization 20 9 20 Connection Timeout 10 19 10 MaxCachedFileSize 9 6 9 ListenBacklog 17 10 17 maxProcessors 11 12 11 Cursor Threshold 5 15 5 Application Protection Level 2 21 2 Http KeepAlive 20 9 20 Logging Location 20 8 20 Min Memory Per Query 18 2 18 Network Packet Size 14 3 14 Performance Tuning Level 20 9 20 Set Working Set Size 1 1 1 Max Worker Threads 8 9 8 minProcessors 12 16 12 MaxPoolThreads 3 11 3 MemCacheSize 4 14 4 acceptCount 16 4 16 Number of Connections 7 13 7 Priority Boost 20 9 20 Recovery Interval 13 7 13 Locks 6 20 6 Resource Indexing 20 9 20 Fill Factor 21 18 21 User Connections 19 5 19
mance gain in percentage under a heavy workload condition. The discussion involves overall system performance and performance of the most critical Web interactions. Factors that do not provide significant overall performance gains under a heavy workload include Logging Location, MaxPoolThreads, and Max Worker Threads. The first two factors are Web server factors and the last one is a database factor. However, tuning Number of Connections, Min Memory Per Query, and Set Working Set Size factors under a heavy workload provides overall response time gains of 33%, 44%, and 91%, respectively. As for the overall throughput performance, Number of Connections (678%), MemCacheSize (12%), Locks (13%), Min Memory Per Query (66%), Network Packet Size (24%), and Set Working Set Size (2125%), provide throughput boosts above 10%. As far as most critical individual Web pages are concerned, the Set Working Set Size, a database factor, is the most statistically significant one and provides the greatest variations in response time for the Home page under both low and high workload intensities. The percentages of variation for the low and high workload levels are 97% and 86%, respectively. In order to boost the Home page throughput, the User Connections factor, a SQL Server parameter, provides the highest gain of 2319% out of all twenty eight factors under a light workload. For a heavy workload, the Set Working Set Size factor gives the highest gain of 2245%. Search Request pages exhibit the highest frequency of access (e.g., 20% in TPC-W) in most e-commerce sites. To minimize response time for this page, Set Working Set Size must be tuned before other factors due to its statistical significance and its highest percentages of variation of 98% and 97% under low and high workload levels, respectively. Similarly to the Home page, under a light workload, the throughput of the Search Request page may be tuned first by the User Connections factor, which provides throughput difference of 2727%. However, under a high workload, it is the Number of Connections factor that boosts the throughput the most (2701%). Interestingly, Set Working Set Size is also the most significant factor to improve response time and throughput for the SSL-connection-oriented Buy Request pages, regardless of the workload intensity level.
5.
CONCLUDING REMARKS
We provided a practical, ad-hoc methodology to evaluate the sensitivity of performance metrics to software configurable factors in a three-tiered e-commerce architecture. We performed tests of three null hypotheses for each response variable to determine the statistical significance of 1) the main effects of each factor, 2) the interaction effects between each factor and the fourteen types of Web interactions, and 3) the main effects of each factor for the five most important Web interaction types (i.e., Buy Request, Home, Product Detail, Search Request, and Search Results) independently. In the end, we conducted factor rankings grouped by each performance metric considering i) each of the five most important Web interaction types, and ii) all fourteen Web interaction types combined. The following conclusions were obtained from this study: 1. As far as the null hypothesis 1 is concerned, the two-
way ANOVA results reveal that most factors had a significant impact across all performance metrics and interact with the fourteen Web interaction types on all performance metrics, for both small and large workload configurations. A few factors, however, did not have such a significant impact. Under a small workload configuration (20 EBs with 10,000 books), the Application Protection Level and the User Connections factors did not have a significant impact on the response time metric, and the Resource Indexing factor did not have a significant impact on the throughput metric. For a large workload configuration (100 EBs with 100,000 books), the Logging Location, MaxPoolThreads, and Max Worker Threads factors did not have a significant impact on the response time metric. This means that, except for these factors, most other factors should be considered when optimizing the system performance under a certain workload configuration. 2. As for the null hypothesis 2, the two-way ANOVA results reveal a similar behavior to that of hypothesis 1’s results: Most factors had interaction effects with the fourteen Web interaction types for both small and large workload configurations. There were a few exceptions, however. The User Connections and Number of Connections factors did not have such interaction effects with the fourteen Web interaction types for the response time metric under small and large workload configurations, respectively. 3. From the one-way ANOVA results, the null hypothesis 3 test provides conclusions for each of the five most important Web interactions as follows: Buy Request: The response time of the Buy Request and the other four Web interaction types was significantly impacted by more factors under a large workload. In other words, a heavy workload increases the number of factors that are significant for improving the response time of those Web interactions. The throughput of the Buy Request interaction was significantly impacted by all factors under a heavy workload level. For a light workload, all factors, except for the Logging Location, provided a significant impact on the throughput of this interaction. Therefore, this factor should not be considered when optimizing the throughput of the Buy Request interaction for a light workload. The probability of rejection of the Buy Request interaction was not significantly impacted by most factors and by any factors under small and large workload configurations, respectively. This means that the probability of rejection of the Buy Request interaction may not be improved by varying any factors under a heavy workload level, but it may be improved by varying the Locks, Max Worker Threads, Recovery Interval, and User Connections factors under a light workload level. Home: Factors that should be given less attention when optimizing the response time for the Home interaction under a heavy workload level are the HTTP KeepAlive, Connection Timeout, Logging Location, MemCacheSize, MaxPoolThreads, worker.ajp13.cachesize, and Max Worker Threads.
In contrast to the throughput results for the Buy Request interaction, the throughput of the Home interaction was significantly impacted by all factors when the workload is light. For a heavy workload, all factors, except for two factors (the HTTP KeepAlive and Resource Indexing), can be tuned to provide different throughputs for the Home interaction. The probability of rejection of the Home interaction observed a significant impact by more factors and by all factors under small and large workload configurations, respectively. Factors that should not be considered for the probability of rejection tuning for a light workload level include the HTTP KeepAlive, Application Protection Level, Resource Indexing, Performance Tuning Level, Application Optimization, maxProcessors, Locks, and Max Worker Threads. Product Detail: Five factors that should be avoided when optimizing the response time for the Product Detail interaction under a heavy workload level include the HTTP KeepAlive, Application Protection Level, Connection Timeout, Logging Location, and Max Worker Threads. Similarly to the Buy Request’s results, the throughput of the Product Detail interaction was significantly affected by almost all factors, except for the Cursor Threshold, and by all factors under small and large workload configurations, respectively. Consequently, the Cursor Threshold factor should not be considered for throughput optimization of the Product Detail interaction under a light workload level. The probability of rejection of the Product Detail interaction observed similar results to the Home interaction. Most factors had a significant impact on the probability of rejection of the Product Detail interaction under a light and heavy workload. Under a heavy workload, all factors had a significant impact on the probability of rejection of the Product Detail interaction. The same behavior applies for the Search Request’s results. Search Request: Factors that do not need to be considered when optimizing the response time for the Search Request interaction under a heavy workload level are the HTTP KeepAlive, Logging Location, and ListenBacklog. The throughput of the Search Request interaction was significantly impacted by most factors, except for the MaxCachedFileSize and the Locks factors under a light workload. Under a heavy workload, all factors had a significant impact on the throughput for this Web interaction. Search Results: Four factors that should not be considered when optimizing the response time for the Search Results interaction under a heavy workload level include the HTTP KeepAlive, Logging Location, Resource Indexing, and Performance Tuning Level. Almost all factors had a significant impact on the throughput of the Search Results interaction under both light and heavy workloads. Under a light workload, the MemCacheSize and the Locks factors did not have such a significant impact and therefore should not be considered. Under a heavy workload, the only factor
that did not have such a significant impact is the Logging Location factor. All factors had a significant impact on the probability of rejection of the Search Results interaction under a heavy workload. The light workload level observed similar behavior for almost all factors, except for the HTTP KeepAlive, Resource Indexing, and Performance Tuning level. 4. As far as rankings are concerned, Table 10 shows the highest ranking factors for the five most important Web interaction types and for each of the three performance metrics (response time, throughput, and probability of rejection). The table shows that Set Working Set Size is the highest ranked factor for Search Results for the three performance metrics. This indicates that under heavy load, a proper setting of the amount of main memory available to the SQL server is crucial for good performance. It can also be seen from the table that Cursor Threshold is the best factor to tune to improve the response time of Buy Request and Search Request Web interactions. This table also shows that eight factors, namely Cursor Threshold, Number of Connections, Set Working Set Size, HTTP Keep Alive, MaxCachedFileSize, Connection Timeout, worker.ajp13.cachesize, and MaxPoolThreads, are the ones to be tuned to improve the response time, throughput, and probability of rejection of the five most important Web interactions.
Table 10: Best Ranking Factors by Web Interaction Type for 100 EBs/100K Items for Each Performance Metric WIRT Buy Request Home Product Detail Search Request Search Results WIPS Buy Request Home Product Detail Search Request Search Results Prej Buy Request Home Product Detail Search Request Search Results
Best-Ranking Factors Cursor Threshold Number of Connections Number of Connections Cursor Threshold Set Working Set Size Best-Ranking Factors Number of Connections HTTP KeepAlive Set Working Set Size MaxCachedFileSize Set Working Set Size Best-Ranking Factors Cursor Threshold Connection Timeout worker.ajp13.cachesize MaxPoolThreads Set Working Set Size
The process and guidelines provided in this paper can be quite helpful to performance engineers who need to limit the amount of experiments required to configure a complex software system. Others have tackled a similar problem in a slightly different context. Yilmaz et al. [16, 17] have looked at the problem of continuous quality assurance for performance purposes. They show an approach that limits the number of software configurations to be tested when software changes are made while providing good levels of performance assurance.
Acknowledgements The work of Monchai Sopitkamol was partially supported by the Royal Thai government and the work of Daniel Menasc´e was partially supported by grant NMA501-03-1-2022 from the National Geospatial-Intelligence Agency (NGA).
6.
REFERENCES
[1] A. Bakore et al., “Professional Apache Tomcat,” Wrox Press Inc, 2002. [2] P. Barford and M. Crovella. “Generating Representative Web Workloads for Network and Server Performance Evaluation.” In Proc. ACM SIGMETRICS Conf., Madison, WI, Jul. 1998. [3] M. Baute, “White Paper: Multi-Tier Performance Tuning,” www.gasullivan.com/whitepapers/PerformanceTuning A.pdf. [4] B. Curry et al, “The Art and Science of Web Server Tuning with Internet Information Services 5.0,” www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/iis/iis5/maintain/optimize/iis5tune.asp [5] “IIS 5.0 Documentation,” Microsoft Windows 2000 Help. [6] R. Jain, The Art of Computer Systems Performance Analysis, John Wiley & Sons, 1991. [7] D.M. Levine, P.P. Ramsey, R.K. Smidt, “Applied Statistics for Engineers and Scientists: using Microsoft Excel & MINITAB,” Prentice Hall, 2001. [8] R.L. Mason, R.F. Gunst, and J.L. Hess, Statistical Design and Analyis of Experiments with Applications to Engineering and Science, 2ed., John Wiley, 2003. [9] B. McGehee, “SQL Server Configuration Performance Checklist,” http://sql-serverperformance.com/sql server performance audit5.asp [10] D. A. Menasc´e and V. A. F. Almeida, Scaling for E-Business: technologies, models, performance, and capacity planning, Prentice Hall, Upper Saddle River, NJ, 2000. [11] D. A. Menasc´e and V. A. F. Almeida, Capacity Planning for Web Services: Metrics, Models, and Methods, Prentice Hall, Upper Saddle River, NJ, 2002. [12] D. A. Menasc´e, “TPC-W: a benchmark for E-commerce,” IEEE Internet Computing, May/June 2002. [13] Microsoft, “Windows 2000 Performance Tuning, White Paper,” Mar. 22, 2000, http://www.microsoft.com/windows2000/professional/ evaluation/performance/reports/perftune.asp [14] Transaction Processing Performance Council, www.tpc.org. [15] E. Whalen et al, “SQL Server 2000 Performance Tuning: Technical Reference,” Microsoft Press, 2001. [16] C. Yilmaz, A.S. Krishna, A. Memom, A. Porter, D.C. Schmidt, A. Gokhale, and B. Natarajan, “Main Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems,” Proc. Intl. Conf. Software Engineering (ICSE’05), ACM Press, St. Louis, Missouri, May 15-21, 2005. [17] C. Yilmaz, A. Memom, A. Porter, A.S. Krishna, D.C.
Schmidt, A. Gokhale, and B. Natarajan, “Preserving Distributed Systems’ Critical Properties: A Model Driven Approach,” IEEE Software, Nov./Dec., 2004.