We formulate principles for the clustering of data, applicable to both sequential ... a Sun Ultra Enterprise server, running SunOS Release 5.5.1, on hard disks in a.
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN – EP division CH - 1211 Geneva 23, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications and to farming HEP applications with a high degree of concurrency. We make the case for the reclustering of HEP data on the basis of performance measurements and briefly discuss a prototype automatic reclustering system.
1 Introduction As part of the CMS contribution to the RD45 [1] collaboration, database clustering and reclustering have been under investigation for about 1.5 years. The clustering of objects in an object database is the mapping of objects to locations on physical storage media like disk farms and tapes. The performance of the database, and the physics application on top of it, depends crucially on having a good match between the object clustering and the database access operations performed by the physics application. The principles for clustering discussed in this paper are based on, and illustrated with, a set of performance measurements. The performance measurements shown in this paper were all performed on a Sun Ultra Enterprise server, running SunOS Release 5.5.1, on hard disks in a SPARCstorage array (no striping, no other RAID type processing, 2.1-GB 7200-rpm fast-wide SCSI-2, Seagate ST-32550W). These disks which can be considered typical for the high end of the 1994 commodity disk market. All performance results were cross-validated on at least one other hardware/OS configuration, most results on at least two other configurations. The object database system used was always Objectivity/DB version 4 [2]. 2 HEP data clustering basics Most I/O intensive physics analysis systems, no matter what the implementation method, and no matter whether tape or disk based, use the following simple principles to optimise performance 1. Divide the set of all events into fairly large chunks (in most current systems a chunk is a run or a part of a ‘physics stream’ [3]) 2. Implement farming (both for disks and CPUs) at the chunk level 3. Make sure that (sub)jobs always iterate through the events in a chunk in the same order 4. Cluster the event data in a chunk in the iteration order Though object databases make it perhaps easier than ever to build physics analysis systems which do not follow the principles above, we believe that these principles are currently still the most viable basis for designing a performant production system. Principle 1 above, dividing the event set into chunks, involves coarse grained clustering decisions: strategies like dividing events into ‘physics streams’ [3] are often used here, and newer strategies are a topic of active research [4]. At the chunk level, reducing tape mounts is a very important goal, and an important constraint is that it is not feasible to recluster data often. 1
Most of this paper deals with the refinement of principle 4 above, that is with clustering and reclustering decisions at the sub-chunk level. At this level, an important goal is to achieve nearsequential reading on the disk or disk farm, and frequent reclustering is feasible as a strategy for optimising performance and reducing disk space occupancy. The clustering techniques discussed in this paper can be used with both the Objectivity/DB [2] and Versant [5] object databases, though Objectivity offers far more direct and convenient support for them. Note that clustering and reclustering are not problems which are specific to commercial object databases. At the core, the clustering problem is one of reducing disk seeks, tape seeks and tape mounts, and this problem exists equally well in physics analysis systems not based on object databases, even though other systems may use different terminology to describe the problem. 3 Type-based clustering Detector X
Read performance (MB/s)
2
Event N
Event 1 Event 2 Event 3
The most obvious way to refine principle Detector Y 4 above is to cluster data as shown in Fig. 1. For each event in the chunk, the event data Detector Z is split into several objects of different types. Reconstructed P’s For example, one type can hold all data for Event summary tags a single subdetector. Then, these objects are grouped by type into collections. Inside each collection, the objects for the different events are clustered in the iteration order. This way, Figure 1: Clustering of a chunk into collections a job which only needs one type of data per event automatically performs sequential reading over a single collection, which exactly contains the needed data, yielding the maxi- Figure 2: Reading two collections: logical pattern (left) mum achievable I/O performance. and preferred physical pattern (right) There are some performance pitfalls however for jobs which need to read two or more 4.5 collections from the same disk or disk array. A 4 job reading two collections has a logical ob3.5 ject reading pattern as shown on the left in 3 Fig. 2. To achieve near-sequential through2.5 put for such a job, the logical pattern needs to 800 KB read−ahead 2 160 KB read−ahead be transformed into the physical pattern at the no read−ahead 1.5 right in Fig. 2. We found that this transforma1 tion was not performed by Objectivity/DB, the database on which we implemented our test 0.5 system, nor by the operating system (we tested 0 1 2 5 10 both SunOS and HP-UX), nor by the disk Number of collections hardware, for various commodity brands. The result was a significant performance degrada- Figure 3: One client reading multiple collections of 8 KB objects from a single disk tion, especially when reading more than two collections, see the solid line in Fig. 3. We eliminated the performance degradation by extending the collection indexing/iteration
class to read ahead objects into the database cache. This extension could be made without affecting the end-user physics analysis code. Measurements (see Fig. 3) showed that when 800 KB worth of objects were read ahead for each collection, the I/O throughput approached that of sequential reading (3.9 MB/s). Keeping all collections on different disks would of course be an alternative to the readDisk 1 Disk 2 Disk 3 Chunk 1 type Y Chunk 1 type Z ahead optimisation. That approach would Chunk 1 type X Chunk 2 type X Chunk 2 type Y however create a load balancing problem: for Chunk 2 type Z Chunk 3 type Y Chunk 3 type Z Chunk 3 type X optimal performance one has to make sure that all disks are kept busy, even for jobs which Figure 4: Load-balancing arrangement as an alternative to only read one or a few collections. The probusing a read-ahead optimisation lem can be solved to some extent by mapping collections in different chunks to disks as shown in Fig. 4. This will produce load balancing for any number of collections, assuming that the subjobs running in parallel on each chunk are about equally heavy. A problem with this solution is that it requires a higher degree of connectedness between all disks and all CPUs. We therefore prefer to use the read-ahead optimisation: by devoting a modest amount or memory (given current RAM prices) to read-ahead buffering, we can keep the objects for one event together on the same disk, which gives us greater fault-tolerance and decreases the dependence on subjob scheduling. 4 Random database access
Performance ratio
For small objects, a good clustering is 5000 more important than for large objects. This 2005 disks (est.) 2000 1994 disks is illustrated by Fig. 5, which plots the ra1000 500 tio between the speed of sequential reading and that of random reading for different ob200 100 ject sizes. Fig. 5 shows a curve for 1994 50 disks and one for disks in the year 2005, based 20 on an analysis of hard disk technology trends 10 [6]. The performance of sequential reading is 5 the performance of the best possible cluster2 ing arrangement, that of random reading the 1 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K performance of the worst possible clustering Average object size (bytes) arrangement. Fig. 5 therefore also plots the worst-case performance loss in the case of bad Figure 5: Performance ratio between sequential and random reading. clustering. We see that currently, for objects larger than 64 KB, clustering is not that important: the performance loss for bad clustering is never more than a factor 2. The 2005 curve shows however that the importance of good clustering will increase in future. 5 Selective reading Physics analysis jobs usually don’t read all objects in a collection: they only iterate through a subset of of the collection corresponding to those events satisfying a certain cut predicate. We call this iteration through a subset selective reading. The selectivity is the percentage of objects in the collection which is needed by the job. In tests we found that, as the selectivity increases, 3
Selectivity (%)
Bandwidth (MB/s)
the throughput of selective reading drops rapidly, only to level out at the throughput of random reading. This is shown, for a collection of 8 KB objects, with an 8 KB database page size, in Fig. 6. In other tests we found that the curve in Fig. 6 does not change much as a function of the page size. The curve in Fig. 6 has two distinct parts. In the part covering selectivity values 6 from 100% to roughly 15%, the decrease in throughput exactly mirrors the increase in se5 sequential: 5.4 MB/s lectivity. If we would have sequentially read random: 0.7 MB/s all objects, and then thrown away the un4 needed ones, the job would have taken the same time. Thus, in this part of the curve, 3 selective reading is useless as an optimisation 2 device if the job is disk-bound. However, selective reading will decrease the load on the 1 CPU, the cache, and (if applicable) the network connection to the remote disk. This re0 duction in load depends largely on the selec100 90 80 70 60 50 40 30 20 10 0 Selectivity (%) tivity on database pages, not on objects. See [7] for a discussion of page selectivity. Figure 6: Selective reading of 8 KB objects In the part of the curve between 15% and 0%, selective reading is faster than sequential 100 reading and then throwing data away. On the 50 other hand, it is not faster than random readHere, selective reading is as fast as sequential reading 20 ing. 10 We found that the boundary between the 5 two parts of the curve, which is located at 2 15% in Fig. 6, depends on the average object 1 size and the selectivity. This boundary is visu0.5 Here, selective reading is alised in Fig. 7. From Fig. 7 we can conclude faster than sequential reading 0.2 that for collections of small objects, selective 0.1 reading may not be worth the extra indexing 0.05 complexity over sequential reading and throw32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K ing the unneeded data away. A corollary of Average object size (bytes) this is that one could pack several small ’logical’ objects into larger ’physical’ objects withFigure 7: Selective reading performance boundary out getting a loss of performance even for high selectivities. For collections of large objects, a selective reading mechanism can be useful as a means of ensuring that the performance never drops below that of random reading. 6 Reclustering To avoid the performance degradation associated with an increase in selectivity which we discussed above, reclustering could be used. Reclustering, re-arranging the objects in the database, is an expensive operation, but just letting the performance drop with increasing selectivity can easily be more costly, especially for collections in which objects are small. The simplest form of 4
reclustering is to copy only those objects which are actually wanted in a particular analysis effort to a new collection at the start of the effort. The creation of a data summary tape or an ntuple file are examples of this simple form of reclustering. Much more advanced forms of reclustering are feasible in a system based on an object database. Automatic reclustering, in which the system reacts to changing access patterns without any user hints beforehand, is feasible whenever there are sequences of jobs which access the same event set. We have prototyped an automatic reclustering system ([8], [9]) which performs reclustering transparently to the user code, can optimise clustering for four different analysis efforts at the same time, and keeps the use of storage space within bounds by avoiding the duplication of data
Job size and run time −−−>
Job size and run time −−−>
We refer the reader to [8] for a discussion of the architecture of our automatic reclustering system. Fig. 8 illustrates its performance under a simple physics analysis scenario in which 40 subsequent jobs are run, with a new cut predicate being added after every 10 jobs. Batch reclustering operations
Subsequent jobs −−−>
Subsequent jobs −−−>
Figure 8: Performance of 40 subsequent jobs without (left) and with (right) automatic reclustering. Each pair of bars represents a job: the black bar represents the number of objects accessed, the grey bar is the (wall clock) job run time.
Reclustering is an important research topic in the object database community (see [8] for some references). However, this research is directed at typical object database workloads like CAD workloads. Physics analysis workloads are highly atypical: they are mostly read-only, transactions routinely access millions of objects, and most importantly the workloads lend themselves to ’streaming’ type optimisations. It is conceivable that vendors will bundle generalpurpose automatic reclustering systems with future versions of object database products, but we do not expect that such products will be able to provide efficient reclustering for physics workloads. As far as reclustering is concerned, physics analysis is too atypical to be provided for by the market. Therefore, we conclude that the HEP community will have to develop its own reclustering systems. 7 Conclusions We have discussed principles for the clustering and reclustering of HEP data. The performance graphs in this paper can be used to decide, for a given physics analysis scenario, whether 5
certain clustering techniques can be ignored without too much loss of performance, whether they need to be considered, or whether they are indispensable. We have shown performance measurements mainly for the single-client single-disk case. In additional performance tests ([6], [10]) we have verified that the techniques described above are also applicable to a system with disk and processor farming. Specifically, if a client is optimised to access the database with a good clustering efficiency, then it is possible to run many such clients concurrently, all accessing the same disk farm, without any significant performance degradation. Furthermore, the operating system will ensure that each client will get an equal share of the available disk resources. For a detailed discussion of the scalability of farming configurations with hundreds of clients, we refer the reader to [10]. References [1] RD45, A Persistent Storage Manager for HEP. http://wwwcn.cern.ch/asd/cernlib/rd45/ [2] Objectivity/DB. http://www.objy.com/ [3] D. Baden et al., Joint DØ/CDF/CD Run II Data Management Needs Assessment, CDF/DOC/COMP UPG/PUBLIC/4100, DØ Note 3197, March 20, 1997. [4] Grand Challenge Application on HENP Data. http://www-rnc.lbl.gov/GC/ [5] The Versant object database. http://www.versant.com/ [6] K. Holtman, Prototyping of CMS Storage Management, CMS NOTE/1997 - 074. [7] Using an object database and mass storage system for physics analysis. CERN/LHCC 97-9, The RD45 collaboration, 15 April 1997. [8] K. Holtman, P. van der Stok, I. Willers. Automatic Reclustering of Objects in Very Large Databases for High Energy Physics, Proc. of IDEAS ’98, Cardiff, UK, p. 132-140, IEEE 1998. [9] Reclustering Object Store Library for LHC++, V2.1. Available from http://wwwcn.cern.ch/~kholtman/ [10] K. Holtman, J. Bunn. Scalability to Hundreds of Clients in HEP Object Databases. Proc. of CHEP’98, Chicago, USA.
6