MySQL performance on Itanium 2

1 downloads 0 Views 1MB Size Report
Nov 5, 2004 -
MySQL performance on Itanium 2 Morten Olsen & Mads Kristensen November 5, 2004

Contents 1 Introduction

3

2 Background

3

2.1

The Itanium 2 processor . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.2

Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2.1

Bubble accounting . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.2

Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Optimizing with the Intel C++ compiler . . . . . . . . . . . . . . . . . .

8

2.3.1

Automatic optimizations . . . . . . . . . . . . . . . . . . . . . . .

8

2.3.2

Interprocedural optimizations . . . . . . . . . . . . . . . . . . . .

8

2.3.3

Profile-guided optimizations . . . . . . . . . . . . . . . . . . . . .

9

2.3.4

Parallelization optimizations . . . . . . . . . . . . . . . . . . . .

9

2.3

3 Method

9

3.1

Compile options

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.2

Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2.1

Data organization and indexes . . . . . . . . . . . . . . . . . . .

10

3.2.2

Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.3

Measuring performance . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.4

Hardware, software, and practicalities . . . . . . . . . . . . . . . . . . .

14

4 Results 4.1

4.2

15

Performance impact of compiler optimizations . . . . . . . . . . . . . . .

15

4.1.1

Optimization levels . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.1.2

Interprocedural optimizations . . . . . . . . . . . . . . . . . . . .

17

4.1.3

Profile-guided optimizations . . . . . . . . . . . . . . . . . . . . .

17

Execution time characteristics . . . . . . . . . . . . . . . . . . . . . . . .

18

1

4.3

4.2.1

Cycle accounting . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4.2.2

Data cache stall accounting . . . . . . . . . . . . . . . . . . . . .

20

Profile-guided optimization . . . . . . . . . . . . . . . . . . . . . . . . .

21

5 Conclusion

23

6 References

25

A Configurations

26

B Workloads

27

B.1 Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

B.2 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

B.3 Execution plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

C Measured data

30

C.1 Configuration 1.1 (-O1) . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

C.2 Configuration 1.2 (-O2) . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

C.3 Configuration 1.3 (-O3) . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

C.4 Configuration 2.1 (-O3 -ip) . . . . . . . . . . . . . . . . . . . . . . . . .

31

C.5 Configuration 3.1 (-O3 -ip -prof use, broad) . . . . . . . . . . . . . . . .

33

C.6 Configuration 3.2-3.6 (-O3 -ip -prof use, narrow) . . . . . . . . . . . . .

35

C.7 Configuration 4.1 (-O1 -ip) . . . . . . . . . . . . . . . . . . . . . . . . .

37

D Result graphs

39

D.1 Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

D.2 Useful instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

D.3 Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

D.4 Instruction-level parallelism . . . . . . . . . . . . . . . . . . . . . . . . .

42

2

1

Introduction

As microarchitectures continue to develop, compilers may play a greater role in utilizing the advanced architectures. This is especially the case with explicitly parallel instruction computing (EPIC) architectures like Intel Itanium 2, where more responsibility is put on the compiler to group and sequence instructions in order to exploit instruction-level parallelism. Similarly, as compilers incorporate more and more sophisticated optimization mechanisms they may increase their impact on application performance. The continued development of microprocessors should facilitate database applications, which are becoming increasingly compute and memory bound. According to Ailamaki et al. (1999) however, studies have shown that faster processors do not improve database system performance to the same extent as scientific workloads. Notably, some of these studies show that on average half the processor execution time or more is wasted on stalls or flushes in the processor. We contribute to the body of studies on database systems’ execution characteristics on modern processors, by examining the execution characteristics of MySQL with different workloads on an Itanium 2 system. In addition we take a small step in the examination of compilers’ impact on the performance of database management systems. We experiment with the Intel C++ compiler on the MySQL source and examine how different compiler optimizations impact performance of the database system. Also we examine if a special type of optimization especially relevant to Itanium 2 – profile-guided optimization – can significantly change the execution time characteristics for any of our workloads. The rest of this paper is organized as follows. Section 2 presents the background for our experiments in more detail. We shortly describe related work breaking up and examining execution times for database systems, and briefly discuss the Itanium 2 architecture and optimization options of the Intel C++ compiler. Section 3 describes our experimental setup and section 4 presents the results of our experiments. Finally we discuss the results and conclude on our experiments in section 5.

2

Background

According to Ailamaki et al. (1999) several interesting studies have evaluated database workloads. The studies agree that the DBMS behavior depends on the type of workload; decision-support systems (DSS) or on-line transaction processing (OLTP) workloads. DSS type workloads benefit more from out-of-order processors with increased instruction-level parallelism than OLTP. Finally, memory stalls are a major bottleneck for database workloads (Ailamaki et al., 1999). Ailamaki and her colleges studied four

3

anonymous commercial DBMSs on a Pentium II system using a workload consisting of range selections and joins running on a memory resident database. Their results showed that on average, half the execution time was spent in stalls, 90% of the memory stalls were due to second-level data cache misses and first-level instruction cache misses, and about 20% of the stalls were caused by subtle implementation details like branch mispredictions. Mortensen (2002) carried out a study similar to Ailamaki et al. (1999) on the open-source DBMS Predator also on Intel Pentium II. Among other things he found that actual computation time rarely accounted for more than 40% of the execution time and when running queries having many random accesses to data, the dominant stall component became the level 2 data cache misses. We will experiment with compiling the MySQL source using different optimization options and examine MySQL’s execution of a variety of workloads on an Itanium 2 system. We therefore give a brief description of the Itanium 2 processor in the next section, to detail how instructions flow in its core pipeline and how they may stall it. We then discuss how to measure performance and how to account for the cycles spent by the processor and break them into meaningful categories. Finally, we describe compile optimizations commonly suggested by Intel on optimizing application performance with their compiler.

2.1

The Itanium 2 processor

The core pipeline of the Itanium 2 processor organizes instructions and data so they can be issued to the processor’s functional units in a synchronized fashion (Intel, 2002, p. 6). The pipeline is divided into a two-stage front-end and a six-stage back-end which are connected through an instruction buffer as illustrated in figure 1.

Frontend

IPG

ROT

Backend

IB

EXP REN REG EXE

DET WRB

Figure 1: The core pipeline of the Itanium 2 processor is divided into a two-stage front-end and a six-stage back-end, which are connected through an instruction buffer. Adapted from Intel (2002, p. 6, figure 3-1).

The two ends of the pipeline work asynchronously. The front-end will only stall the back-end if it is unable to prepare a sufficient flow of instructions for the back-end to keep execution flow going. Instructions are packaged in bundles of three and the instruction buffer can hold 4 bundle-pairs, ie. 24 instructions (cf. Intel, 2004b, p. 195). 4

The front-end can fetch up to two bundles per cycle and the back-end can execute as many as two bundles per cycle (cf. Intel, 2004b, p. 25). That is, up to six instructions can be executed in parallel per cycle. Instructions are explicitly grouped in parallel instruction groups that may extend over an arbitrary number of bundles. The Itanium 2 processor has several functional units of various types that allow several combinations of instructions to be executed per cycle. But for many combinations, less than the maximum six instructions can be executed in parallel. If a parallel instruction group contains more instructions than there are available functional units, the first instruction for which an appropriate unit cannot be found will cause a split issue and break the parallel instruction group (Intel, 2004b, p. 25). The instructions after the split point will stall one or more clock cycles, even if there are sufficient resources for some of them to execute, as the Itanium 2 processor issues instructions in-order. Once instructions are issued as a group, they will proceed as a group through the pipeline. If one instruction has a stall condition the whole group will stall as well as all instructions behind it in the pipeline (Intel, 2004b, p. 25). The Itanium 2 processor provides four performance counters, a large number of monitorable events, and several advanced monitoring capabilities (cf. Intel, 2004b, chapter 10-11). The monitoring capabilities allow insight to the work done and the flow of instructions in the pipeline, and thus can help characterize and optimize the behavior of applications. We will use the monitors to determine the performance as well as account for the cycles spent by our workloads, as described next.

2.2

Performance monitoring

To determine the performance of our workloads we can count the total number of cycles spent executing the loads. To detail the performance further we can examine how many cycles the pipeline is not carrying out work and how many instructions are in fact executed. When a stall occurs in the back-end a bubble is inserted in one location in the pipeline. When there is a flush, bubbles are inserted in all locations in the pipeline. During every clock, the back-end pipeline has either a bubble or retires one or more instructions (cf. Intel, 2004b, pp. 118-119). Knowing the number of bubbles leaving the pipeline and the number of retired instructions give information on the number of cycles where actual work is carried out as well as how much instructionlevel parallelism is obtained. We are only interested in bubbles in the back-end of the pipeline, and ignore stalls in the front-end that do not succeed in stalling the back-end. In determining the amount of instruction-level parallelism we may exclude NOP instructions that can be inserted by the compiler but does not represent any work. The monitorable event cpu cycles allow us to measure the number of cycles executed. With ia64 inst retired and nops retired we can measure the number of retired Itanium and individual NOP instructions. Finally, back end bubble gives us the number of bubbles inserted into the back-end pipeline when stalling or flushing. 5

2.2.1

Bubble accounting

The back-end can waste cycles due to five distinct mechanisms. The event back endbubble gives the overview while five other back-end bubble accounting events exists to give more explicit information of bubbles. The events relate closely to the stages in the pipeline and are prioritized in order to mimic the operation of the pipeline. The priority increases as the downstream end is approached and guarantees that the events are mutually exclusive and only the most important cause (latest in the pipeline) is counted. Thus the counts of the five events sum up to the count of the parent event back end bubble. To break up the execution time and the bubble cycles into more meaningful categories we follow a method suggested by Jarp (2002). In this, we investigate and group different subcounters of the overall bubble counters together in order to get the categories shown in table 1. A difficulty we face is that one of the five bubble counters, detailing stalls in the back-end caused by the front-end (back end bubble.fe), cannot be broken down via subcounters. To overcome this problem we look at all front-end conditions that may stall the back-end and assume they do so in the same proportion. We count the pertinent front-end bubbles by the counter fe bubble.allbut ibfull. This excludes stalls in the front-end due to the instruction buffer being full which will surely not stall the back-end. The front-end bubbles must then be scaled down with the factor: back end bubble.fe . fe bubble.allbut ibfull

2.2.2

Memory hierarchy

The Itanium 2 processor contains three levels of cache. The first level has a one-cycle latency and will not stall the pipeline if data is found here. To determine how the other levels of the memory hierarchy are contributing to the data cache stalls again presents some challenges. Once more we use a model suggested by Jarp (2002) to determine the number of cycles stalled in order to access data at the different levels, see table 2. We count the number of accesses fullfiled by the different levels and multiply each with a cycle cost for accessing data at that level. The actual costs may vary, but we use values proposed by Jarp (2002). At the first level the processor has separate data and instruction caches. At the second and third levels the cache miss counters include not only data accesses but also activity caused by instruction fetching. To filter out the instruction fetching activity, we adjust both counts with the ratio of data accesses in L2: l2 data references.l2 all . l2 references.all

6

Category

Formula

Description

D-cache stalls

be exe bubble.grall be exe bubble.grgr + be l1d fpu bubble.l1d

Cycles spent waiting for data to be retrieved from the memory hierarchy through the level 1 D-cache.

Branch misprediction

be flush bubble.bru + fe bubble.bubble∗ + fe bubble.branch∗

Cycles lost recovering from mispredicted branches.

Instruction miss stalls

fe bubble.imiss∗

Cycles stalled because the next instructions are not in the instruction cache.

RSE stalls

be rse bubble.all

Cycles spent by the register stack engine spilling and filling registers during procedure calls.

FLP units

be exe bubble.frall + be l1d fpu bubble.fpu

Cycles stalled in floating point units, either waiting for possible exceptions or floating point register dependencies (waiting for data loaded from L2 or written-back by earlier instruction).

GR Scoreboarding

be exe bubble.grgr

Cycles stalled due to general register dependencies; waiting for a result from a later stage in the pipeline.

Front-end flushes

fe bubble.feflush∗

Cycles lost due to flushes in the front-end.

Table 1:

Grouping of counters into a few meaningful categories. Taken from Jarp (2002, p. 6). ) The front-end bubble counters must be scaled down with the reduction factor: back end bubble.fe/fe bubble.allbut ibfull. ∗

Category

Formula

L2

(l2 data references.l2 all l2 misses∗ ) × 2

L3

(l2 misses∗ - l3 misses∗ ) × 10

Cycles spent accessing data in the level 3 cache.

Memory

l3 misses∗ × 150

Cycles spent accessing data from memory.

D-TLB

l2dtlb misses × 30

Cycles spent accessing and maintaining the data translation buffers.

-

Cycles spent accessing data in the level 2 cache.

Table 2: Grouping of counters to detail stalls in the different levels of the memory hierarchy. Taken from Jarp (2002, p. 8). ∗ ) L2 and L3 misses should be adjusted by the ratio of accesses in the L2 that are data accesses.

7

2.3

Optimizing with the Intel C++ compiler

The Intel C++ compiler offers several compile options that may potentially optimize the performance of an application. The following four groups of options seem emphasized and repeated by Intel on optimizing with their compiler (cf. Intel, 2004c,d). • Automatic optimizations • Interprocedural optimizations • Profile-guided optimizations • Parallelization optimizations In the following section we will briefly describe the options in each of the four groups. 2.3.1

Automatic optimizations

The automatic optimizations category include a set of optimization levels (0-4). Level 0 disables optimizations and should generally only be used in the early stages of application development and then be replaced by a higher setting when the developer knows the application is working correctly (Intel, 2004c, p. 3). Level 1 includes a set of optimizations whose names are listed in the man-page of the compiler. In general level 1 omits optimizations that tend to increase object size. This contrasts level 2 that focus on fast code and may increase code size significantly. Level 2 adds loop unrolling to the list of optimizations and enable software pipelining. Software pipelining is a technique for trying to overlap loop iterations by dividing each iteration into stages, with several instructions in each stage (cf. Intel, 2004c, p. 16). The last level, level 3, includes the same optimizations as level 2 and in addition enables more aggressive optimizations such as loop and memory access transformation like prefetching (cf. Intel, 2004c, p. 4). 2.3.2

Interprocedural optimizations

The Intel compiler offers different interprocedural optimizations within files and also between files. Inline function expansion is one of the main optimizations performed by the interprocedural optimizer (cf. Intel, 2003, pp. 91-92). According to Intel (cf. 2004c, p. 4) interprocedural optimizations decrease the number of branches, jumps, and calls within code, and further reduces call overhead through function inlining. In addition improved alias analysis, leads to better loop transformations, limited data layout optimization may result in better cache usage, and interprocedural analysis of memory references, allows registerization of more memory references and may reduce application memory accesses (Intel, 2004c, p. 4). 8

2.3.3

Profile-guided optimizations

Profile-guided optimization is an interesting feature of the Intel compiler. By building and running an instrumented version of an application, a profile of its use can be created. This profile can then be used to guide optimizations when producing a new executable version of the application. Profile-guided optimization moves frequently accessed code segments adjacent to one another, and moves seldom-accessed code to the end of the module. This eliminates some branches and shrinks code size, resulting in more efficient processor instruction fetching (Intel, 2004c, p. 5). Profile-guided optimization also generates branch hints for the processor during the optimization process, which should enable better branch prediction (Intel, 2004c, p. 5). This is very relevant for the Itanium 2 processor that has simpler branch prediction hardware than other processors. In addition, Intel (2003, p. 100) claims that the use of profileguided optimization often enables the compiler to make better decisions about function inlining, thereby increasing the effectiveness of interprocedural optimizations. 2.3.4

Parallelization optimizations

Another interesting feature of the Intel compiler is auto-parallelization. By analyzing the dataflow of a program’s loops and generating multithreaded code for those loops which can be safely and efficiently executed in parallel, the optimizer enables the potential exploitation of the parallel architecture found in symmetric multiprocessor systems (cf. Intel, 2003, p. 138). The Intel compiler also recognize industry-standard OpenMP directives, which give developers explicit control of how their application is multi-threaded (Intel, 2004c, p. 7). However, these directives are not used in the MySQL source.

3

Method

To examine the execution of MySQL and the influence of compiler optimization we compile the MySQL source with different optimization options. We then run different types of workloads on these compilations while measuring several performance metrics. The compile options, workloads, and performance measuring will be described in the following subsections along with more detailed information of the setup.

3.1

Compile options

As mentioned above, Intel seems to suggest especially four groups of compile options for optimizing application performance with their compiler. Accordingly, we wish to experiment with optimization levels, interprocedural optimizations, profile-guided op9

timizations, and auto-parallelization. Since optimization level 0 disables optimizations and is only intended for the early stages of application development, we choose to experiment only with the levels 1, 2, and 3. We have had trouble getting Intel’s compiler to compile with different optimization options, as described further in section 3.4. In particular, we haven’t been able to get the compiler to work with interprocedural optimizations across files and hence will only test interprocedural optimizations within files. Similar we haven’t been successful in compiling with auto-parallelization options, and have had to drop these options. The final group, profile-guided optimizations, is very relevant on Itanium 2, as it has simpler branch prediction hardware than other processors. We will build profiles based on all our workloads, and also on individual categories in our set of workloads, described in the next section. This will detail what kind of performance can be gained with a broad profile as well as with a narrow and specialized profile. The configurations we compile are listed in table 3. As can be seen we choose to target the Itanium 2 platform (-tpp2) and use static linking (-static) in all compilations. We believe these two options to be favorable to performance, but will not test this belief. We choose optimization level three for the interprocedural and profile-guided optimization configurations, as initial tests of optimization levels generally indicated this level to be the best. In marketing material on accelerating MySQL by use of the Intel compiler, Intel (2004a) uses an “aggressive” configuration consisting of optimization level one and interprocedural optimizations within files. As the configuration seems to be an attempt to get the best performance from MySQL, we add this configuration to our set in order to see how it will perform on our workloads.

3.2

Workloads

We examine MySQL’s execution and the compiler optimizations’ performance impact on a set of workloads consisting of different query types. Our choice of query types includes point, range, grouping, and joining queries. By suggestion from Peter Zaitsev of MySQL AB we also experiment with the EXPLAIN statement on queries joining a lot of tables to investigate performance impact on plan search. The different query types constitute one dimension in our workload set. A second dimension we consider next is data organization and indexing. 3.2.1

Data organization and indexes

MySQL offers different table types with different characteristics (cf. MySQL, 2004, p. 650). For our project we choose to use InnoDB which is the transaction-safe standard 10

Configuration

Compile Options

Comments

Optimization Levels 1.1: -O1 1.2: -O2 1.3: -O3 Interprocedural Optimizations 2.1: common:

-ip -O3

Profile-guided Optimizations 3.1: -prof use 3.2: -prof use 3.3: -prof use 3.4: -prof use 3.5: -prof use 3.6: -prof use common: -O3 -ip -prof dir=...

Broad profile based on running all our workloads. Narrow profile based only on single-point queries. Narrow profile based only on range queries. Narrow profile based only on grouping queries. Narrow profile based only on joining queries. Narrow profile based only on plan search queries.

Intel aggressive configuration 4.1: -O1 -ip All configurations common: -tpp2 -static

Table 3: The configurations we compile.

table type. Data cannot explicitly be clustered in InnoDB, but every InnoDB table has a special index called the clustered index in which the data of the rows is stored (cf. MySQL, 2004, p. 691). If we define a primary key on a table, then it will be used as the key in the clustered index otherwise internally assigned rowids will be used in the index. InnoDB only explicitly supports B-trees for indexes, but may on its own create hash indexes for pages in memory if it thinks it will be favorable to performance (cf. MySQL, 2004, p. 692). We will not explicitly try to test performance impact on these internally created hash indexes. We will instead experiment with the different types of queries and statements mentioned previously combined with (i) no indexes, (ii) clustered and (iii) non-clustered b-tree indexes. We will not discriminate between indexes of different datatypes, although there may be some differences in how they are handled. Similarly, we will not use multi-column indexes, nor will we consider covering indexes in the queries. To have somewhat realistic data we create a database with TPC-W schema and populate it with data according to the TPC-W specification (Tpc, 2002). The schema is illustrated in appendix B.1. The primary keys in the tables will be used in clustered indexes and we define non-clustered indexes for the foreign keys. To be able to construct all our workload queries, we define some additional indices. We define a unique non-clustered index on customer(c uname) to be able to do singlepoint queries 11

via a non-clustered index, and non-clustered indices on customer(c birthdate) and orders(o date). 3.2.2

Workloads

Combining the query types and index categories gives several combinations. The combinations we find interesting and create workloads for are shown in table 4 along with high-level descriptions of how we expect the different query types to execute in combination with the index categories.

Type

No index

Singlepoint

Clustered

Non-clustered

Traverse index to find the record.

Traverse index and touch data.

Range

Scan and extract records (or parts of records) matching criteria.

Traverse index to find first occurrence matching criteria and scan data to find the rest.

Traverse index to find first occurrence matching criteria and scan index and touch data to find the rest.

Grouping

Scan data and sort using some internal algorithm.

Scan data in sorted order.

Scan index and touch data in order of index.

Index-nested loop. Joining on fields in clustered indexes.

Index-nested loop. Joining on fields in nonclustered indexes.

Joining

Plan search

We will test plan search in a single case using queries joining multiple tables and vary the join conditions to also test complex ranges which was suggested to us by Peter Zaitsev from MySQL AB.

Table 4:

High-level description of how we expect the different query types to execute in combination with the index categories. The grid sketch the different cases we test with the different compile configurations.

Some of the tests may appear similar from a high-level view. For example, grouping queries requires a sort of the records with or without a non-clustered index. According to MySQL (2004, p. 378f) the sorting can however be done using an index, in which case the records are not sorted, but rather read in order dictated by scanning the index. Thus we find it reasonable to try grouping queries on fields both with and without indexes. MySQL apparently does not utilize different join algorithms like hash and merge-sort, but rather employs a uniform nested loop like approach using indexes when available (see MySQL, 2004, pp. 356-258). Different types of joins are described in the documentation (ibid.), but seems to indicate only how the appropriate records of a table can 12

be selected, e.g. using and index or a full table scan. We choose to test equi-joins and only consider joining of fields in clustered indexed and joining of fields in non-clustered indexes. The queries for the individual workloads are shown in appendix B.2. As can be seen we use aggregation in the queries. For the grouping queries, we further use criteria on the aggregation to trick MySQL to sort before dropping the records. This is all done to reduce the amount of data to transfer to the MySQL client, which we will discuss later.

3.3

Measuring performance

We use the performance monitoring facilities of Itanium 2 to measure the occurrence of different events as described previously. The tool we use for our measurements cannot measure the work of the MySQL server alone but measures the whole system. The work of other processes will therefore contribute to the measurings. Our profiling tool allows us to exclude time spent in kernel-space from the measurings. The contribution to our measurings from other processes should therefore be minimal, as our workloads take up almost all processing time, and we do not actively run other processes on the system. The Itanium 2 processor only has four performance counters, so we need to run our workloads several times to get all the data we need. This will further introduce uncertainties as counters from different runs may not precisely sum up to expected totals, as they would if we were able to measure all counts in a single run. To easily automate the extensive measuring process we run the MySQL client locally on the same machine as the MySQL server process. To reduce the work of the client that is also measured by our profiling tool, we use aggregation in the queries. We are interested in processor performance and will therefore run our workloads with warm buffers. This reduces the number of context switches that will let other processes influence the measurings. Running with warm buffers and using aggregation limits how different parts of the source code is touched – e.g. I/O code and code transferring data across the database application interface should be touched very little. It may be interesting to experiment with these parts of the code, but this is outside the scope of our experiments. We run our experiment on a dual-processor system. We are not interested in determining the work distribution between the processors and therefore just measure the sum of occurrences of the different monitorable events on the two processors. This does not mean that the maximum number of instructions per cycle increases to 12, since we measure all the events the same way. That is, we do not only measure the work carried out on both processors but also the number of cycles. When one cycle has passed in time, we therefore measure two cycles – one from each processor. The summations gives us a single unified view of the system, similar to running the workloads on a single processor. 13

We measure basic performance counts for all the configurations running the different workloads, to examine if and how the compile options impact performance of the workloads. As mentioned earlier, profile-guided optimization is very relevant for the Itanium 2 processor because it has simpler branch misprediction hardware than other processors. We therefore examine the profile-guided optimization configurations further by measuring several events and breaking up the execution time as described previously. As the baseline for comparison we also measure the events and account for the execution time of the interprocedural optimization configuration. The different narrow profile-guided configurations are based on profiles compiled by running only a few mutually exclusive workloads from our set corresponding to workloads of the same query type. The narrow profile-guided configurations are only run on these workloads and every workload is therefore only run on one narrow-profile configuration. It may be interesting to examine a configuration’s potential performance degradation on workloads not used in its profile, but this is not our focus. We obtain several measurings for most workloads by repeating the runs 20 times. Initial results showed us that a few of the results would lay further apart than the rest, so we decided to eliminate these outliers by removing the top and bottom ten percent of the measurings before determining the averages of the repeated measurings. One exception is the plan search workload which we only run once, as we risk that MySQL will not generate new plans but rather reuse old plans on repetitions.

3.4

Hardware, software, and practicalities

We run our tests on a dual-processor 900 Mhz Itanium 2 system with 4 Gb RAM running a Linux kernel version 2.4.25. The MySQL source we use is version 4.1. For measuring the monitorable events of the processors we use the pfmon 2.0 profiling tool. We initially tried using OProfile instead since this tool allows profiling of threadgroups, but we were not able to get any consistent measurings with this tool. The pfmon tool also allows monitoring of threadgroups but this is only possible with pfmon version 3.0 and this version requires a 2.6 kernel to be installed on the machine. This was not possible in our test environment since we were sharing the machine with other students, and were not allowed to change the kernel to a new version. To compile our configurations we had to use different versions of the Intel C++ compiler 8.0, as we experienced a lot of different problems with the compiler such as crashes and generation of unusable code, ie. executables that crash when run. In the end though we ended up using two different versions of the compiler, since we were unable to find a single version that were able to compile with the options we wanted. We have listed the compile configurations along with the compiler version used for each in appendix A. Because of our problems with the icc compiler we have had to skip testing the auto-parallelization options and also the -ipo option since we were unable to make these options work with any of the compiler versions we tried. 14

4

Results

In the following we present the results of our experiments. We start by examining the overall performance impact of the different compile optimizations. We then break down the reasons for stalls and discuss general execution characteristics of the database systems. Finally we study the profile-guided optimization option of the compiler and determine if and how the execution characteristics change with these optimizations. In different graphs in this section we use a numbering of the workloads rather than their full description for compactness. We use the following numbers for the categories: 1) singlepoint, 2) range, 3) grouping, 4) joining, and 5) plan search. And the following letters for the index categories: a) no index, b) clustered index, and c) non-clustered index (see also appendix B.2). We will use the descriptions in the text to map to the numbers. To scale the bars in the graphs somewhat evenly, when showing data for several workloads (that may have very different execution times) we normalize the values of the variable depicted. We do this by taking the average of its value across the configurations shown in the graphs for a given workload. We then then divide each of the values for the workload by the calculated average. This way the average height of the bars for any single workload should be 1.

4.1

Performance impact of compiler optimizations

In this first part of our data analysis section we focus on the overall performance of the different compilations of MySQL. This means that the section will contain a comparison of the running times that the different compilations get on the individual workloads. The data from all the test runs can be seen in appendix C. For the performance discussion the graphs in figure 2 will be used, and to a lesser extend the graphs in appendix D.4 and D.3. 4.1.1

Optimization levels

Looking at the graphs in figure 2 we find the optimization levels in the first three bars. From these graphs it is clear that, in our tests, the third optimization level is the best performing of the batch. The heights of the bars indicate the number of clock cycles used during the execution and the -O3 bar is the lowest in all workloads except for workload 4.c (joining, non-clustered index) where -O2 is a bit better than -O3. The difference between the optimization levels is not very big though, with -O3 only being on average 3.54% faster than -O1 and -O2. For most workloads the speedup when using -O3 is under 5% and is as such not significantly improved, but for three workloads (range and grouping using non-clustered indexes and plan-search) the speedup is more than 5%.

15

Figure 2:

Shows the the number of cycles broken down in stalls and cycles that actual computations are carried out. Also illustrates the minimum number of cycles acquired to do the work if 6 instructions per cycle could be obtained. The bars are scaled to fill the graphs, so the cycles cannot be compared between workloads, but only within workloads.

The next question to ask is why -O3 performs better. The answer probably lies in the instruction-level parallelism. If we look at the graphs in appendix D.4 it is clear that the -O3 bar is always a bit higher than the -O1 and -O2 bars, except for workload 2.b (range, clustered index) where -O2 is slightly better. This means that -O3 gets a little more work done in each clock cycle which makes it finish its work a bit faster than the other optimization levels. Another answer to why the third optimization level performs best could lie in the amount of stalls done, but looking at the graphs in appendix D.3 this does not seem a reasonable explanation. In some of the workloads -O3 stalls more than the other two but because of its higher instruction-level parallelism it still executes faster. On average though -O3 stalls slightly less than the two others.

16

4.1.2

Interprocedural optimizations

In the test of interprocedural optimizations we only have two compilations, one that uses the -O1 and one that uses the -O3 optimization level. We choose to do our test with -O3 since this gave the best performance when only optimization levels was considered, and further we choose to test the -O1 optimization level as well since we have seen in an official Intel paper (Intel, 2004a) that this option gave the best performance of MySQL when compiling with Intel’s C++ compiler. Looking at the graphs in figure 2 we see the two compilations with interprocedural optimizations in the fourth and fifth bars. One thing that is clear from these graphs is that the compilation with -O3 performs better than the one with -O1 in almost all workloads, with workload 2.b and 4.c (range, clustered index and joining, nonclustered index) as exceptions where they are comparable in speed. Plan search is also atypical since it shows that the -O1 compilation performs better than the -O3 one for this workload. Taken over all workloads the compilation with -O3 gives a speedup of 2.92% on average. For most workloads the difference in performance between the two compilations is less than 5% but for some, namely 2.c and 3.c (range and grouping, non-clustered index) the speedup is more than 5%. This is exactly the same tendency as was seen in the test of optimization levels, which indicate that the optimizations done by -ip might not be significant in these compilations. This theory is further supported by the fact that the average speedup from the compilation using only -O3 to the compilation using -O3 and -ip is only 0.66%, when we disregard the plan search workload which again shows atypical behavior. When taking plan search into account the speedup is actually the negative, -0.66%. An interesting point to observe is that the configuration using -O1 and -ip actually, in most cases, get a slightly better level of instruction-level parallelism than the other interprocedural optimized configuration (see graphs in appendix D.4). But if we look at the data in appendix C we can see that the compilation with -ip and -O1 performs on average 7.73% more useful instructions than the other configuration. This and the fact that the compilation using -O3 and -ip has slightly less stalls makes the compilation with -O3 and -ip perform better. 4.1.3

Profile-guided optimizations

Again looking at the graphs in figure 2 it is clearly seen that the best performing compilations overall are the two using profile-guided optimizations. There is only one exception to this and that is as usual in plan search where the narrow PGO compilation is the worst performing of all the compilations. The narrow compilations that have been trained on only the workloads within a single workload group was an attempt to see if better optimizations would be gained if the

17

profile used was more narrow as opposed to the broad compilation which has been trained on all workloads. As can be seen in the graphs this was not the case, and the performance of the broad and narrow compilations is comparable in almost every workload. There are two exceptions to this which can be seen in workload 2.b (range, clustered index) where the narrow configuration is best, and workload 5 (plan search) where the broad configuration is better. Because the two compilations are so much alike we will only discuss the broad compilation further in this report. The profile-guided optimization configurations are the first place where we see a really significant speedup in comparison to the other compilations. The best performing compilation when disregarding PGO is the one using -O3 and -ip so this is the one that we will compare the PGO compilation to. The PGO compilation is also compiled with -O3 and -ip, and according to Intel’s documentation PGO may increase the effect of interprocedural optimizations. On average the PGO compilation has a 8.05% speedup compared to the non-PGO compilation, with over half of the workloads showing a speedup of more than 5%. So what makes PGO that much better than the best performing non-PGO compilation? The answer to this may be seen in the graphs in appendix D.3. It can be seen here that, again with plans search as the exception, the PGO compilations almost always has a considerably lower amount of stalls than the other compilations. Also in the graphs of appendix D.4 it can be seen that the PGO compilations have the highest instruction-level parallelism in all the workloads. This means that, even though the data shows us that the broad PGO compilation on average performs 4.41% more useful instructions than the non-PGO compilation, the PGO compilation gets its work done faster, in fewer cycles.

4.2

Execution time characteristics

As discussed in the previous section, the number of cycles, stalls and the rate of instruction-level parallelism measured with each workload is influenced by the compiler optimizations. However, these performance parameters are also influenced by the individual types of workloads and further exhibit some general tendencies. The average number of useful instructions retired per unstalled cycle is influenced by configuration as well as workload, but generally lies within the somewhat narrow interval from 1.28 to 2.4. The percentage of stalls vary more between than within workloads, but lies within 30-45% of the total number of cycles for most workloads. This also applies to the the average stall percentage across all workloads which is 37.34%. To describe the stalls further we have collected data for several monitorable events while executing our workloads on the profile-guided optimization configuration based on a broad profile (PGO B) as well as the interprocedural optimization configuration (-O3 -ip) that serves as a base for the former. This makes it possible to determine

18

in more detail how the profile-guided optimizations impact the performance of the workloads as well as determining general execution characteristics. In the following we discuss the execution characteristics and postpone the former discussion for section 4.3. 4.2.1

Cycle accounting

In figure 3 we have divided the total number of cycles into a few categories for all the workloads on the two above mentioned configurations according to the model of Jarp (2002).

Figure 3: Breakdown of the total number of cycles into a few categories detailing the stalls. It is obvious from the figure that the plan search workload (5) is very different from the other workloads. There are variations between the other workloads, but generally data cache stalling is the single largest reason for the stalls. For all the workloads on the two configurations, excluding plan search, the data cache stalls make up 47-74% of the stalls or 10-36% of the cycles. Still excluding plan search, branch mispredictions and instruction cache misses make up the next biggest reason for stalls. There are great variations between workloads, but branch mispredictions and instruction cache misses make up 10-24% and 3-27% of the stalls respectively, or 2-10% and 1-17% of the cycles. The last group of observable stalls in the workloads, register stack engine stalls and floating-point unit stalls, also vary greatly. Excluding plan search these stalls only make up 2-17% and 0-9% of the stalls, or as little as 0-9% and 0-3% of the cycles. Focusing on the atypical plan search workload these latter stall categories however make up far the largest part of the stalls. The RSE stalls may be due to long 19

program paths, as the register stack engine stalls when spilling and filling registers which is associated with procedure calls where registers are allocated and deallocated. The reason for the floating-point unit stalls is less obvious, but we suspect the stalls may be associated with cost prediction in the optimizer. 4.2.2

Data cache stall accounting

The previous described model of Jarp (2002) for breaking the data cache stalls into the cycles used to access data at the different levels of the memory hierarchy does not apply successfully to our data. When applying the model to our data, the cycles spent at the different levels sum up to a larger number than the total number of data cache stalls. The error is greatest for the plan search workload where the sum is about 17 times larger than the number of data cache stalls. For the other workloads the sum is from about the same to two times the number of data cache stalls. The reason may be that different versions of the Itanium 2 processors exists with different caches, and presumably with different associated memories. Also it may have a significant cost to share the main memory between two processors as in our setup. Therefore it may be that our system has other costs associated with the different levels in the memory hierarchy than the costs used by Jarp (2002). In the following we will however use the model and discuss the distributions as if the proportions between the costs for the different levels of the hierarchy are still valid for our system.

Figure 4: The percentages of data cache stalls spent accessing data at the different levels of the memory hierarchy. While we have taken the sum of the components to be 100%, the sum does not equal the actual number of data cache stalls.

20

The distribution of cycles used to access data at the different levels of the memory hierarchy varies a lot between workloads (see figure 4). The main component of the data cache stalls fluctuates between the second level cache and main memory. An interesting point to observe is that for a lot of workloads little time is spent accessing data in the third level cache. Time is primarily spent accessing data either from the second level cache or memory. This may indicate that when data is missing in the second level cache it is seldomly found at the third level. This does seem somewhat true. For about half the workloads – where the time spent accessing data in the third level cache is also very small – the hit-rate for the L3 is less than 60%. For some workloads most time is spent in the second level of cache, which indicates that data fits well in the cache and the cache is well utilized. This utilization seems to follow the selectivity or scanning behavior of the queries, though. For example the grouping and joining workloads (3-4) spends a greater percentage of time accessing data at the second level cache than the singlepoint workloads. We have unfortunately not controlled or manipulated the selectivity of the range queries. However, we can expect scanning behavior in the no index workload (2.a). In addition the query plans (cf. appendix B.3) indicate that the optimizer expects a much greater number of rows for the no index workload than the other range workloads and a much lower number of rows for the clustered index workload (2.b) than the other two. This corresponds well with our suspection and the data. The cache utilization we see therefore very naturally follow how sequentially or random data is read. Without knowledge of the inner workings of MySQL it is hard to interpret the results and conclude if the caches can be utilized better. The large number of data cache stalls does attract attention for optimizing data access. It may be that the first level of cache can be utilized better or that the third level can be exploited. The fact that the workloads with low scanning behavior, where a lot of the total time is spent in the indexes, do utilize the level two cache less than the other workloads, may indicate that indices rather than data should be optimized for the second level cache. However, the reason for the lower cache utilization could as well be caused by the more random data access and as such hard to optimize.

4.3

Profile-guided optimization

In this section we compare the profile-guided optimization configuration based on a broad profile (PGO B) to its base, the interprocedural optimization configuration based on optimization level three (-O3 -ip), in order to examine the profile-guided optimizations’ impact on performance in more detail. In figure 5 we show the total number and the reasons for stalls for each workload on the two configurations. The bars are scaled to fill the individual graphs, so the amount of stalling cannot be compared across but only within workloads.

21

Figure 5: Shows the amount of stalling for the 2.1 (-O3 -ip) and 3.1 (PGO B) configurations divided into the reasons for stalling. The bars are scaled to fill the charts, which means that the amount of stalling cannot be compared between workloads but only between configurations within the same workload.

Observing the different workloads in figure 5, we see that the number of data cache stalls is reduced a little by the profile-guided optimization. Excluding the plan search workload the fall in data cache stalls from the base configuration to the profile-guided optimization is only between 4% and 9%. The cycles wasted on branch misprediction is surprisingly not consistently reduced for all workloads. For just less than half of the workloads the number of wasted cycles due to branch misprediction is even increased. Most pronounced in the grouping workload on a non-clustered index key where the increase in wasted cycles is as high as 32%. The most visible consistent reduction of stalls is the reduction of stalls due to instruction misses. Excluding plan search the reduction for the PGO configuration is 23-58%. Less visible but not less effective is the reduction of RSE stalls, 27-60%. The reduction of RSE stalls is also the primary single reason for the overall reduction of stalls in the plan search workload. The reduction in instruction misses is presumably due to improved code locality, as the profile-guided optimization process moves frequently touched code together while seldomly touched code is separated from and moved away from the rest of the code. The slight reduction in data cache stalls may also be due to code locality as instructions and data share the level 2 and 3 caches. If less memory is used at these levels for instructions, it can be used for data instead. The profile-guided optimization is supposed to put branch hints in the compilations to help the processor decide on branches. We are a bit surprised to see how little this helps and how it may even increase the number of cycles wasted on branch misprediction. An explanation may lie in our broad profile and the fact that some workloads execute for longer times than others. The generated profile is thus trained more on some workloads and the compilation get optimized for these workloads while the other workloads pay the price. We are however skeptic to this explanation, as we have seen similar behavior

22

for the narrow profile-guided optimized configurations. The reduction of RSE stalls may be due to the fact that the profile-guided optimization enables the compiler to make better decisions about function inlining, as mentioned previously. Reducing the number of function calls will reduce the number of registers allocated during execution, which may reduce the number of cycles the register stack engine spends spilling and filling registers. With very few exceptions the reductions of a single type of stall do not exhibit a significant fall in the total number of cycles. However, the general reduction of stalls results in a fall in cycles between 5% and 13% for most workloads, and an average fall in cycles of 5.71% across all workloads. The average decrease in cycles from the base configuration to the profile-guided optimization configuration is 8%, so the stalls only account for some of the decrease. The rest of the decrease in cycles is due to higher instruction-level parallelism. On average the number of useful instructions retired per unstalled cycle is increased by 11%. The increase varies for the different workloads but falls within the interval 3-22%. Some of this increase is however absorbed by a greater number of useful instructions in the PGO configuration, on average an increase of 5.56% instructions.

5

Conclusion

In this paper we examined the execution characteristics of MySQL with different workloads on an Itanium 2 system. In addition we experimented with the Intel C++ compiler on the MySQL source and examined how different compiler optimizations impact performance of the database system. Finally, we examined if profile-guided optimization can significantly change the execution time characteristics of a MySQL database system. We found that the behavior of MySQL on Itanium 2 is comparable to other database systems, in that a large part of the execution time is spend on stalls. However, the stall percentages for most of our workloads lie between 30% and 45%, which is somewhat less than other systems (Ailamaki et al., 1999; Mortensen, 2002). The single largest stall component for our query workloads is data cache stalls, and secondly branch misprediction and instruction cache misses. Like others we may therefore point towards optimizing data access towards a better cache utilization to optimize the general performance of the database system. Reducing branch mispredictions and instruction cache misses may be harder to do by hand, and this is also the focus of the profile-guided optimizations that we have studied. In respect to the optimization levels we have found that the third level gives the best performance in our workloads. This came as a bit of a surprise for us, since we had not expected that the MySQL system would benefit from the software pipelining, pre23

fetching and loop-unrolling that is activated on the second and third optimization levels. These optimizations did give a little performance boost though and in the end the -O3 option gave on average a 3.54% speedup compared to -O1 and -O2. Another interesting observation was that the interprocedural optimizations did not seem to give any performance speedup at all, when it was used without profile-guided optimizations. The configurations that were compiled with -ip performed almost identical to those without. The profile-guided optimizations gave the best performing compilation in our tests performing on average 8.05% better than its closest rival. In examining the PGO options we did tests with both narrow profiles which were only trained with workloads of the same type, and a broad profile that was trained on all the workloads. Here we found that the narrow and the broad profiles had comparable performance. Therefore it makes most sense to use a broad profile that performs well for several types of workloads rather than a narrow profile that only perform well for specific workloads. The reason that the PGO optimized compilations had a better performance was that they had a higher instruction-level parallelism and thus completed more instructions per cycle, and the fact that these compilations had a lower percentage of stalled cycles. The stalls reduced by the profile-guided optimization are primarily stalls due to instruction cache misses and resource stack engine stalls. These reductions are expected as the profile-guided optimization improves code locality as well as function inlining. More surprisingly, we do not see a consistent reduction of the number of cycles wasted on branch misprediction. For some workloads we even see an increase in the cycles wasted on branch misprediction. It may be interesting to try to make even narrower profiles and examine if these will consistently reduce the cycles wasted on branch mispredictions. Also there were some optimizations we were not able to experiment with due to technical difficulties. These were the interprocedural optimizations across files and the auto-parallelization options. These options may be worthwhile trying, when the errors in the Intel C++ compiler have been corrected.

24

6

References

Ailamaki, A. et al. (1999). DBMSs on a modern processor: Where does time go? In Proceedings of the 25th VLDB Conference, (pp. 266–277). Edinburgh, Scotland: Morgan Kaufmann. Intel (2002). Introduction to Microarchitectural Optimization for Itanium 2 Processors Reference Manual. Intel Corporation. Document number: 251464-001, downloadable from: http://www.intel.com/software/products/vtune/techtopic/Software Optimization.pdf (08.06.2004). Intel (2003). Intel C++ Compiler for Linux* Systems User’s Guide. Intel Corporation. Document number: 253254-014, downloadable from: ftp://download.intel.com/support/performancetools/c/linux/v8/c ug lnx.pdf. Intel (2004a). Accelerating one of the worlds fastest databases. Intel C++ Compiler Case Study. Document number: 300461-002, downloadable from: http://www.intel.com/software/products/global/techtopics/mysql.pdf (08.06.2004). Intel (2004b). Itanium 2 Processor Reference Manual: For Software Development and Optimization. Intel Corporation. Document number: 251110-003, downloadable from: http://www.intel.com/design/itanium2/manuals/251110.htm (05.06.2004). Intel (2004c). Optimizing Applications with the Intel C++ and Fortran Compilers for Windows, Windows CE .NET, and Linux Updated for Intel 8.0 Compilers. Intel Corporation. Document number: 300346-001. Intel (2004d). Quick-Reference Guide to Optimization with Intel Compilers For Intel Pentium 4 and Intel Itanium Processor Families. Intel Corporation. Document number: 254349-002, downloadable from: http://www.intel.com/software/products/compilers/docs/qr guide.pdf (08.06.2004). Jarp, S. (2002). A methodology for using the Itanium 2 performance counters for bottleneck analysis. Technical report. Downloadable from: http://www.gelato.org/pdf/Performance counters final.pdf (05.06.2004). Mortensen, B. B. (2002). Profiling the predator query execution engine. Technical Report 2002-02, Department of Computer Science, University of Copenhagen. Downloadable from: http://www.diku.dk/publikationer/tekniske.rapporter/2002/02-02.pdf (05.06.2004). MySQL (2004). MySQL Reference Manual. MySQL AB. Downloadable from: http://www.mysql.com/documentation/index.html (05.06.2004).

25

Tpc (2002). TPC BENCHMARK W (Web Commerce) Specification Version 1.8. Transaction Processing Performance Council. Downloadable from: http://www.tpc.org/tpcw/spec/tpcw V1.8.pdf (05.06.2004).

A

Configurations

Table 5 show the different configurations of compile options we have compiled the MySQL source with along with the version of the Intel C++ compiler version used.

Configuration

Compile Options

Compiler version

Optimization Levels 1.1: -tpp2 -static -O1 1.2: -tpp2 -static -O2 1.3: -tpp2 -static -O3 Interprocedural Optimizations 2.1:

8.0.058 pl060 8.0.058 pl060 8.0.055

-tpp2 -static -O3 -ip -nolib inline

8.0.055

Profile-guided Optimizations 3.1-6: -tpp2 -static -O3 -ip -prof use -prof dir=...

8.0.055

Intel aggressive configuration 4.1: -tpp2 -static -O1 -ip

8.0.058 pl060

Table 5: The configurations we have compiled and the compiler versions we used.

26

B B.1

Workloads Schema

Figure 6 shows the TPC-W database schema that we use for our workloads. The primary keys in the tables will be used in clustered indexes and we define non-clustered indexes for the foreign keys. To be able to construct all our workload queries, we define some additional indices. We define a unique non-clustered index on customer(c uname) to be able to do singlepoint queries via a non-clustered index, and non-clustered indices on customer(c birthdate) and orders(o date).

Figure 6:

Illustrates the TPC-W database schema which we use. Taken from figure in Tpc (2002, p. 6). Bold types identify primary and foreign keys and the arrows point in the direction of one-to-many relationships between tables. The dotted lines and italics represent relationships that are related through a business rule in the specification.

27

B.2

Queries

Table 6 shows the different queries that we run on each compilation configuration. We execute each query several times with different random values for [rnd]. Letters are used in the query numbering to indicate the index categories, (a) no index, (b) clustered index, and (c) non-clustered index.

Type

SQL statement

Point queries 1.b SELECT c fname, c lname FROM customer WHERE c id = [rnd]; 1.c SELECT c fname, c lname FROM customer WHERE c uname = [rnd]; Range 2.a 2.b 2.c

queries SELECT COUNT(o c id) FROM orders WHERE o ship date >= [rnd1] AND o ship date = [rnd1] AND o id = [rnd1] AND o date , =,