Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files Tzuhsien Wu, Shyng Hao and Jerry Chou†
Bin Dong and Kesheng Wu
Computer Science Department National Tsing Hua University, Taiwan. † Corresponding author:
[email protected]
Scientific Data Management Group Lawrence Berkeley National Laboratory, USA.
Abstract—Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed. Keywords—Scientific data, Block Index and query, Parallel I/O
I.
I NTRODUCTION
Massive amounts of scientific data is generated from scientific experiments, observations, and simulations. The size of these datasets currently ranges from hundreds of gigabytes to tens of petabytes [1]. For example, the Intergovernmental Panel on Climate Change’s (IPCC) is in the process of generating tens of petabytes (1015 bytes) for its fifth assessment report (AR5)1 but critical information related to important events such as hurricanes might occupy no more than a few gigabytes (109 bytes). Therefore, the ability to access only the necessary data records, instead of all of them, can significantly accelerate data analysis operations [20], [19]. The best-known technology for locating selected data records from a large dataset is with indexing technique [17, Ch. 6], such as variants of B-tree [7] used in DBMS, bitmap indexes [22], and inverted indexes. However, the size of indexes produced from those techniques are often close to or even larger than the size of their original datasets. As the volume of datasets continues to grow, indexes will also consume more computing time and storage space, and most 1 More
information available at http://www.ipcc.ch/.
importantly more I/O bandwidth to access. The problem is further exacerbated by the increasing gap between I/O and compute [12], and the growing demand for real-time or in situ data analysis [21], [16]. Recently, there have been attempts [14], [13] to address the problem by applying sort and compression to reduce the size of index. While data reorganization effectively improves query performance, data access time, and compression ratio, it also suffers from several limitations: (1) Duplicated data must be maintained and managed. (2) Sorting on a single variable would only benefit queries involving the sorted variable, but not the others with dependencies among them. (3) Data query is only a part of a data analysis workflow. Hence reorganizing and encoded data may have severe impacts to other processing steps. In this work, we take a different approach from the perspective of I/O characteristics of storage systems. We propose a new indexing technique called ”block index”. The idea is motivated by the fact that storage systems are designed and optimized to read consecutive data. Thus, the read time of an individual record could be just as long as the read time of a data block. Furthermore, reading one block at a time can greatly reduce the number of I/O requests in comparison to random access on individual data record. Therefore, instead of building indexes at the per data record basis, we propose to build indexes and read data from file at the per data block basis. Our experimental evaluation shows that this co-design of indexing and I/O access method can significantly reduce the space and time requirements of index, while delivering comparable or even better query performance comparing to existing solutions. The contributions of this work are summarized as follows. •
•
• •
Propose block index, which exploits the I/O characteristics of storage systems to significantly reduce index size without sacrificing query performance. Implement block index in parallel, and optimize its performance by using an adaptive dynamic scheduling strategy for load balancing. Conduct experimental evaluations using multiple scientific datasets on a supercomputer. Demonstrate block index can reduce query time by a factor of 2 to 50 with negligible index size and minimum index building time in practice.
The rest of paper is structured as follows. Section II introduces the approach and parallel implementation of block index. The experimental setup and results are presented in Section III. Finally, Section IV discusses related work, and Section V concludes the paper. II.
TABLE I. DATASETS DESCRIPTION . D IFFERENT NUMBER OF PROCESSES WERE USED FOR EVALUATION ACCORDING TO THE FILE SIZE . dataset VPIC PTF CAM Synthetic
source plasma physics cosmology climate uniform random
#records 188 billions 900 millions 120 millions 1.7 billions
file size 752GB 3.75GB 457MB 6.8GB
#proc 1728 64 16 64
A PPROACH
A. Block Index The goal of this work is to enable efficient range query (e.g. “temperature > 100”) with position and value retrieval while minimizing the time and space requirements of index. To achieve this goal, we propose a new indexing technique called ”block index” to explore the idea of building index at the level of blocks with consecutive records, instead of at the level of individual records. Block index can be designed in many forms as long as indexes are built with respect to a group of consecutive data records from datasets. Two fundamental design decisions are (1) how to partition dataset into blocks; and (2) what index information should be recorded from each block. For this study, we developed a block index method in the simplest form as the first attempt to prove the feasibility of block indexes. Our proposed method partitions dataset into non-overlapped fixed size blocks with a user-specified block size. The index is built by only recording the maximum and minimum values from each block. When a query arrives, the indexes are loaded and compared to the range constraints of query. If there is an intersection between the range of a block and the query, the block is read from file to perform candidate check on individual records. Finally, the values of matching records are returned to users. Our proposed block index strategy has the following advantages: (1) Both query and index procedures only involve sequential scan of data or indexes, and the minimum I/O size can be controlled by block size. Thus, the I/O access of block index can be handled by the underlying file system efficiently. (2) The computation time of recording the max and min values of each block is negligible and the whole indexing process can be as fast as shifting through the dataset. (3) The size of an index is inversely proportional to the block size. Thus with a reasonable block size (i.e., larger than 1,000 records per block), the total index size will be extremely small (0.1% of the data size) and the time for storing and loading indexes can be negligible as will be shown in our experimental results. (4) The block indexes can be stored in a plain simple array without any additional metadata. In contrast, other existing indexing techniques have to maintain extra file structure and metadata to serialize their complex index structure in file which may cause inefficient and extra I/O workload [4]. (5) Finally, benefiting from the simple data access pattern and index structure, block index can be easily implemented and parallelized in practice as described in the next subsection. B. Parallel Implementation Due to the simplicity of block index, it is rather straight forward to parallelize its indexing and query process. Indexing has three steps: (1) load data. (2) build index. (3) store index. Since both the size of dataset and block are fixed and known, processes can perform all these three steps independently in
parallel. Furthermore, because the computation and I/O time is identical for each block, the load can be balanced by evenly distributing data blocks among processes. Specifically, we use MPI and Parallel HDF5 library to perform all those I/O operations collectively to achieve the best performance. On the other hand, query has five steps: (1) load index. (2) boundary check. (3) gather and schedule selected blocks. (4) candidate check. (5) collect results. Similar to indexing, steps 1, 2 and 4 can be done in parallel. However, the workload may not be well balanced among processes because the number of selected blocks returned from the boundary check in step2 can be varied among processes. As a result, distributing data blocks evenly among processes doesn’t guarantee the load is balanced. To solve this problem, we introduce step3 to collect the selected blocks that require candidate check from processes, and re-distribute them at runtime for load balancing and minimizing query execution time. To overcome the load balance issue, we implemented an adaptive dynamic scheduling algorithm, which combines the strength of both static and dynamic scheduling algorithms. Considering the candidate check of a selected block as a task, static scheduling algorithm evenly partitions tasks among processes, and sends a list of block IDs of assigned tasks to each process before next step. In contrast, dynamic scheduling algorithm only schedule one task to a process at a time. A worker dynamically requests the next task after it completes its previous assigned task. The proposed adaptive dynamic scheduling algorithm is controlled by a parameter θ, which denotes the percentage of the selected blocks that will be scheduled evenly and statically like static scheduler. The rest of blocks will be scheduled dynamically. Furthermore, instead of assigning a fixed number of blocks (or chunk size) to one worker at a time, we use an exponential back-off strategy to decrease the scheduling chunk size over time. The scheduling chunk size will decrease every time a new job has been assigned, and a smaller chunk size will be chosen at the end to balance the load more evenly. At the same time, by applying exponential back-off algorithm, the scheduling overhead between master and workers can be mitigated, and the random access pattern can be avoided due to the larger chunk size used at the beginning. III.
E XPERIMENTAL E VALUATIONS
A. Experimental setup We conducted our experiments on a large super computing system called Edison at the NERSC. It has 5,576 compute nodes, with two 12-core Intel Ivy Bridge 2.4GHz CPU and 64GB of memory per node. Its storage is supported by a Lustre parallel file system, which has 72 GB/s peak performance with 144 OSTs. We evaluated our index technique using four datasets as summarized in Table I. All the datasets were stored on Lustre parallel file system with default stripe setting, except
TABLE II.
VPIC PTF CAM Synthetic
I NDEX SIZE MEASURED BY THE OUTPUT INDEX FILE SIZE DIVIDED BY THE INPUT DATA FILE SIZE . SciDB 1.03 1.10 0.12 1.09
TABLE III. dataset VPIC PTF CAM Synthetic
FQ-P2 0.29 0.72 1.94 0.90
FQ-P3 0.80 1.47 2.91 1.02
FQ-P4 1.72 2.40 3.13 1.04
BI-1KB
BI-64KB
0.0078
0.0012
I NDEX THROUGHPUT ON L USTRE FILE SYSTEM .
SciDB 601MB/s 486MB/s 51.3MB/s 410MB/s
FQ-P2 784MB/s 135MB/s 15.3MB/s 149MB/s
FQ-P3 674MB/s 53MB/s 12.8MB/s 115MB/s
FQ-P4 456MB/s 35MB/s 12.6MB/s 58MB/s
BI 50GB/s 6.67GB/s 1.68GB/s 6.67GB/s
the maximum stripe count 144 is used for the largest VPIC dataset to get better I/O performance and shorter processing time. We present the performance study using the most commonly seen one-side range query which has the general query form of ”V ar > threshold”. The value of threshold is adjusted to meet four different query selectivity constraints from 10−3 to 10−6 . The values of selected records also must be retrieved as query results. In our future work, we would like to further analyze and evaluate the performance of block index on more complex query forms, such as multi-variable query or two-side query, etc. We compare block index with FastQuery [6], SciDB [3], and Scan. We use BI-x to denote the results of block index using block size x, and use 130KB as the default block size setting according to our performance analysis results on Lustre shown in Section III-D. FastQuery is a parallel implementation of FastBit, a well known compressed bitmap indexing technique. We use FQ-Px to denote the FastQuery results with binning precision x. A higher precision value implies more bins are used to build the indexes, so the query time tends to be faster, but index size will also be larger. SciDB is a scientific database designed for array data model with its own distributed data management system. It uses a runlength compression to store data without supporting additional indexing feature. The largest scale we could setup for SciDB at NERSC is 64 instances/processes. For comparison, we consider the size of run-length compressed data, and the time of loading data into database as the space and time requirement of SciDB. Finally, Scan is a parallel implementation that simply answers query by shifting through the dataset then keeping the matched data records. Scan is used as a baseline for performance comparison. Its query throughput implies the peak I/O throughput of reading a dataset from storage system. B. Indexing Comparison This subsection demonstrates block index can significantly reduce the space and time requirements of index by showing its index file size and index throughput over multiple datasets. The index file size comparison is shown in Table II. As observed, the index size of FastQuery ranges from 20% to 300% of the input data size. As the binning precision increases from 2 to 4, the index size also grows significantly. The index size of CAM is the largest, because its records have the widest range in values. For SciDB, we only observed good compression ratio on CAM dataset. The other three datasets with higher randomness basically cannot be compressed by SciDB. In comparison, the index size of block index is extremely small
comparing to the input data size or the FastQuery index size. Since our current implementation only records the max and min values from each block, as shown in the table, even with a block size 1KB, the index size is still less than 1% of the data size. The index size could be further reduced if choosing a larger block size, such as 64KB or even 1MB. Therefore, the index size of block index basically can be negligible. Benefiting from the small index size, the index throughput of block index is also much higher as shown in Table III. The block size setting has limited impact on the index throughput when its size is larger than 1KB. Thus block index is simply denoted as BI in the table. In fact, because the time for computing and writing indexes is negligible comparing to the time for reading input data, we found that block index always achieve an index throughput close to the peak I/O throughput. For instance, VPIC has the highest index throughput because we were using 1728 processes to index the dataset across 144 Lustre storage nodes (OSTs). The observed 50GB/s throughput is already close to the 72GB/s peak I/O throughput reported on the system. The index throughput of the other three datasets are less than 10GB/s, because the file is spread across limited OSTs by default, and much fewer number of processes were used for indexing. As observed, the index throughput of CAM is 4 times smaller than PTF, because CAM uses 16 processes while PTF uses 64 processes. In contrast, the index throughput of FastQuery is heavily impacted by the index size, and it often spends 30 ∼ 60% of time on building bitmap indexes and writing them to file. As a result, the index throughput of FastQuery is significantly lower than block index. Although SciDB has higher throughput than FastQuery, it is still much lower than block index due to data copy and less efficient storage system performance comparing to Lustre. Therefore, block index does have the minimum time and space requirements of indexing datasets, and its index throughput is only bounded by the I/O throughput of shifting through the data on storage systems. C. Query comparison This subsectiton demonstrates block index can achieve comparable or even better query performance comparing to other techniques. Figure 1 shows the query throughput of each indexing technique under different query selectivity. We can observe the throughput of Scan is independent of query selectivity. In fact, the query throughput of Scan is equal to the peak I/O throughput of reading the datasets, and thus can be treated as a comparison baseline. First, for SciDB, its throughput is the lowest in general because it stores data in its own storage system which has much worse I/O performance than Lustre. As evident from the throughput of PTF and Synthetic datasets, both Scan and SciDB use 64 processes, but SciDB is almost 10 times lower than Scan. Even in the case of CAM dataset shown in Figure 1(b), a parallel Scan using 16 processes is still faster than a 64 instances SciDB. We also observed the throughput of SciDB is lower when query selectivity increases due to additional time for retrieving selected data. The results of SciDB for VPIC dataset are not shown in Figure 1(a) and Figure 2(a), because its query throughput is limited by the 64 instances at 5GB/s.
FQ-P2
Scan FQ-P3
10 FQ-P4
102
101 1E-03
1E-04
1E-05
3
Bl
102
FQ-P2
10
SciDb FQ-P4
101 10
0
10-1 10-2 1E-03
1E-06
Scan FQ-P3
1E-04
query selectivity
1E-05
3
Bl
102
FQ-P2
10
0
10-1
1E-04
query selectivity
(a) VPIC dataset.
10
SciDb FQ-P4
101
10-2 1E-03
1E-06
Scan FQ-P3
query throughput(GB/s)
Bl
query throughput(GB/s)
3
query throughput(GB/s)
query throughput(GB/s)
10
1E-05
3
Bl
102
FQ-P2
10
0
10-1
1E-04
query selectivity
(b) CAM dataset.
SciDb FQ-P4
101
10-2 1E-03
1E-06
Scan FQ-P3
1E-05
1E-06
query selectivity
(c) PTF dataset.
(d) Synthetic dataset.
32
FQ-P2 FQ-P3 FQ-P4 Scan
16 8 4 2 1 0.5 0.25
1E-03 1E-04 1E-05 1E-06
query selectivity
(a) VPIC dataset. Fig. 2.
128 64 32 16
FQ-P2 FQ-P3 FQ-P4 Scan SciDb
8 4 2 1 0.5 0.25
1E-03 1E-04 1E-05 1E-06
query selectivity
(b) CAM dataset.
128 64 32 16
FQ-P2 FQ-P3 FQ-P4 Scan SciDb
8 4 2 1 0.5 0.25
1E-03 1E-04 1E-05 1E-06
query selectivity
(c) PTF dataset.
query time relative to block index
64
query time relative to block index
128
query time relative to block index
query time relative to block index
Fig. 1. Query throughput comparison. The throughput is computed as the data size divided by the query response time. Higher throughput implies faster query time and better performance. As observed, block index always performs the best except for the CAM dataset with the smallest selectivity.
128 64 32 16
FQ-P2 FQ-P3 FQ-P4 Scan SciDb
8 4 2 1 0.5 0.25
1E-03 1E-04 1E-05 1E-06
query selectivity
(d) Synthetic dataset.
Query speedup comparison measured as the query time relative to block index. It shows block index can achieve a significant query speedup.
Next, for FastQuery, its query throughput increases significantly as the query selectivity decreases regardless of binning precision. This is because when the query selectivity increases, the amount of data that needs to be retrieved grows significantly. Unfortunately, FastQuery retrieves these random scattered data records from file individually, so its throughput can be even worse than Scan due to inefficient random access I/O behavior. Hence, only in the case of CAM dataset with 10−6 selectivity, FastQuery achieved the best query performance. We also observed higher binning precision can reduce selection candidates, thus FQ-P4 often has higher throughput than FQ-P2 and FQ-P3. But FQ-P4 has a higher constant query overhead loading bitmap keys and offsets from each bin. Therefore, FQ-P3 and FQ-P2 actually performed better than FQ-P4 for the small CAM dataset. Lastly, for block index, it clearly achieved the best query performance across all the cases, except the one with the least amount of selected data from CAM dataset with 10−6 selectivity. As shown from the figure, block index can take advantage of indexing and deliver much higher query throughput when the query selectivity become smaller, while Scan and SciDB can’t because they don’t have an index. At same time, when the query selectivity become larger, block index does not suffer from the random access overhead like FastQuery, and the worst case performance of block index is bounded by Scan. Figure 2 further shows the query speedup of block index by plotting the time relative to block index. As shown, the improvement over Scan gradually increases as selectivity decreases. The improvement over SciDB is always significant between a factor of 16 to 66. Finally, the improvement over FastQuery reduces as selectivity decreases. But block index still delivered
comparable performance in the smallest selectivity, 2 to 4 times speedup was reached in the cases of larger selectivity.
D. Block size analysis Finally, we experimentally analyze the impact of block size setting on query performance. Table IV summarizes the query time under a block size setting varied from 65KB to 64MB, and the I/O statistics from each block size setting. Clearly, block size 130KB gave us the best query performance, while the query time gradually increases with either smaller or larger block size. When the block size increases, the number of read operations decreases because each block can contain more number of hits. However, more data records and larger data size must be read for candidate check. Thus, the I/O bandwidth of read operations increases due to larger I/O size and fewer number of read requests. For instance, the I/O bandwidth of block size 64MB is 2.6 times faster than block size 65KB, but the amount of data read from file for block size 64MB is more than 3 times than block size 65KB. In fact, almost all of the data were read from file, and the query time was close to scan the whole dataset. Therefore, block size setting clearly presents a performance trade-off between data read size and data read bandwidth on Lustre. In particular, we found the Lustre file system in our experimental environment is highly scalable and relatively optimized for small I/O requests. Thus using smaller block size with smaller Lustre stripe size can achieve better performance. However, the smallest Lustre stripe size allowed in our system is 65KB, and any block size setting smaller than that will degrade query performance significantly.
TABLE IV. block size 65KB 130KB 1MB 8MB 64MB
Q UERY TIME AND I/O STATISTICS UNDER DIFFERENT BLOCK SIZE SETTING . # of read operations 7,043,113 4,157,262 834,584 161,456 23,482
read size (% of data size) 30.3 35.8 55.2 85.5 99.5
IV.
read bandwidth (GB/s) 12.14 14.36 20.63 31.33 32.07
query time (sec) 2.44 1.96 3.27 10.69 13.85
R ELATED W ORK
B-Tree is a widely used data structure and index in database and file systems which can sharply reduce the number of disk reads [2]. Inspired by B-tree, other similar structure and index such as B+tree [9] and R-Tree [10] are explored and designed later. In all, these hierarchical data structures and indexes depend on its data organization inside to answer users’ queries quickly. On the other hand, without depending on data organization on disk, hashing functions and bitmaps are two typical indexes for accelerating data search. Usually, hash function [11] works directly on the raw values of data. The bitmap index [18], on the other hand, converts raw values of data into 0 and 1 and perform fast bitwise operation to find the wanted records. Several recent work has been done to reduce the index size. FastBit [22] reduces the bitmap index size using HAW compression method. ISABELA-QA [14] adapts B-splines curve fitting to compress data, and guarantees a user-specified pointby-point compression error bound by storing the relative errors between estimated and actual values. DIRAQ [13] exploits the redundancy of significant bits in the floating-point encoding among similar values, and uses a compressed inverted index to reduce index size.
[3] [4]
[5] [6]
[7] [8] [9]
[10] [11]
[12] [13]
[14]
[15]
[16]
[17]
V.
C ONCLUSION & F UTURE W ORK
This paper proposes an indexing technique called ”block index”. It exploits the I/O characteristics of storage systems to significantly reduce index size without sacrificing query performance. It is also implemented in parallel using MPI and HDF5 library, and evaluated on a supercomputer wtih Lustre file system using multiple real scientific datasets. The results show that block index can reduce query time by a factor of 2 to 50 comparing to existing approaches, including FastQuery, SciDB and scan. But the size of block index is almost negligible comparing to the data size, and the time of indexing can reach the peak I/O speed. We believe block index can be a promising indexing technique for in situ and real-time data analysis. In the future, we plan to integrate block index with existing scientific data management software, such as ADIOS [15], SDS [8], as well as the in-memory query system developed from our previous work [5]. Also we would like to further investigate other indexing methods on blocks instead of records. R EFERENCES [1]
IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCC Fifth Assessment Report. [2] R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. In SIGMOD Workshop on Data Description, Access and Control, pages 107–141, 1970.
[18] [19]
[20]
[21] [22]
P. G. Brown. Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD, pages 963–968, 2010. H.-T. Chiu, J. Chou, V. Vishwanath, S. Byna, and K. Wu. Simplifying index file structure to improve I/O performance of parallel indexing. In ICPADS, pages 576–583, Dec 2014. H.-T. Chiu, J. Chou, V. Vishwanath, and K. Wu. In-memory query system for scientific datasets. In ICPADS, 2015. J. Chou, K. Wu, O. R¨ubel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W. Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale data analysis. In SC, 2011. D. Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121–137, June 1979. B. Dong, S. Byna, and K. Wu. SDS: A Framework for Scientific Data Services. In PDSW, pages 27–32, 2013. D. Giampaolo. Practical File System Design with the Be File System. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1998. A. Guttman. R-trees: A dynamic index structure for spatial searching. SIGMOD Rec., 14(2):47–57, June 1984. S. J. Karpen. Design and implementation of a real time information storage and retrieval system. In Proceedings of the 1971 26th Annual Conference, ACM ’71, pages 37–66, New York, NY, USA, 1971. ACM. S. Klasky, H. Abbasi, et al. In Situ Data Processing for Extreme-Scale Computing. In SciDAC, July 2011. S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1–12, 2013. S. Lakshminarasimhan, J. Jenkins, et al. ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data. In SC, pages 1–11, Nov 2011. J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In CLADE’08, pages 15–24, 2008. K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities. Computer Graphics and Applications, IEEE, 29(6):14– 19, Nov 2009. A. Shoshani and D. Rotem, editors. Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2010. I. Spiegler and R. Maayan. Storage and retrieval considerations of binary data bases. Inf. Process. Manage., 21(3):233–254, Aug. 1985. K. Stockinger, E. W. Bethel, S. Campbell, E. Dart, , and K. Wu. Detecting Distributed Scans Using High-Performance Query-Driven Visualization. In SC. IEEE Computer Society Press, Nov. 2006. K. Stockinger, J. Shalf, W. Bethel, and K. Wu. Query-driven visualization of large data sets. In IEEE Visualization 2005, Minneapolis, MN, October 23-28, 2005, page 22, 2005. T. Tu, H. Yu, et al. Remote runtime steering of integrated terascale simulation and visualization. In SC HPC Analytics Challenge, 2006. K. Wu, S. Ahern, et al. FastBit: Interactively searching massive data. In SciDAC, 2009.