0
CDF ARCHIVAL OF LARGE-SCALED ITS DATA FOR EFFICIENT ARCHIVAL, RETRIEVAL, AND PORTABILITY
Nirish Dhruv Department of Computer Science, University of Minnesota Duluth, Duluth Minnesota, 55812. Phone (218) 728 4997,
[email protected]
Taek M Kwon, Ph.D (Corresponding Author) Professor, Department of Electrical and Computer Engineering, University of Minnesota Duluth, Duluth Minnesota, 55812. Phone (218) 726-8211, Fax (218) 726-7267,
[email protected]
Siddharth A Patwardhan Department of Computer Science, University of Minnesota Duluth, Duluth Minnesota, 55812. Phone (218) 728 3273,
[email protected] Eil Kwon, Ph.D Office of Traffic Engineering and ITS, 395 John Ireland Blvd, MS 725, St. Paul, MN, 55155. Phone (651) 284-3506, Fax (651) 205-4526,
[email protected]
Submission Date: 7/30/2002 Word Count: 6700
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
1
CDF Archival of Large-Scaled ITS Data For Efficient Archival, Retrieval, and Portability Abstract Today’s ITS sensor networks, such as Road Weather Information Systems (RWIS) and traffic sensor networks, typically generate very large amount of data. As a result, archiving, retrieval, and exchange of ITS sensor data for planning and performance analysis are becoming increasingly difficult and thus, there is a need for a new open ITS archive-architecture that is compact and exchangeable and allows efficient and fast retrieval of a large amount of data. This paper proposes an archive system that can meet the present and future needs of large-scaled ITS data archiving. This architecture is referred to as the Common Data Format (CDF), and was developed by the National Space Science Data Center (NSSDC). CDF is an open architecture that is free, portable, self-describing data abstraction, and designed for archiving and management of large scaled array data. This paper introduces CDF and presents its archival and retrieval performance using Minnesota Department of Transportation’s (Mn/DOT’s) 30 sec traffic data collected from about 4,000 loop detectors around Twin Cities freeways. For comparison, the same data was archived using a commercially available relational database and evaluated for its archival and retrieval performance. This result is presented. The paper concludes with a list of why CDF is a good fit for large-scaled ITS data archiving, retrieval, and exchange of data.
INTRODUCTION One of the most important components in today’s Intelligent Transportations Systems (ITS) is monitoring the overall transportation system through a large-scaled network of transportation sensors. Such sensor networks include Road Weather Information Systems (RWIS), Weight In Motion (WIM) networks, and traffic detector networks. Individual sensor data is typically streamed to a central office where it is archived and/or displayed for analysis and monitoring purposes. The ITS sensors continuously operate and produce data 24 hours a day, 7 days a week, and year after year, typically covering large areas such as the whole state. As a result, a massive amount of data is accumulated over the years from most of statewide ITS sensor networks. For better utilization of presently collected data, Archived Data User Service (ADUS), a part of ITS program, suggested that these data could be used beyond the present operational and monitoring uses, e.g., for planning and research (1). However, ever increasing deployment of transportation sensors increased the data size to a staggering amount, preventing it from traditional archiving, analysis, and exchange of data. Recently, the U.S. DOT recognized this problem and defined it as an urgent problem and began promoting the needs for Federal and local level research programs addressing the archiving and multi-agency use of data generated from ITS applications (1). These efforts have been promoted under a new paradigm, “improve transportation decisions through the archiving and sharing ITS data (1).” For further data archive issues, please refer to the references (1-3). In most today’s ITS implementations, sensor data has been archived using a flat file format. Flat file format is extremely simple and has its own benefits. However, such simplicity becomes a barrier of efficiency when one tries to retrieve arrays of spatially and/or temporally correlated data from the archived flat files. In such cases, the applications may need to open files one by one sequentially and search/retrieve only few values from each opened file and move onto the next set of files. Because I/O operations are the slowest part of the computing in computers,
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
2 such retrieval process leads to extreme inefficiency, making large-scaled data analyses difficult. A desirable property of large-scaled ITS archive would be rapid random access of data in any relation, either in temporal or spatial. Flat files do not meet this requirement. A possible solution for random access of data is developing archives using a Relational Database Management System (RDBMS). RDBMS allows retrieval of data in any location using simple query if the tables are flexibly designed. Recently, the California Department of Transportation (Caltrans) implemented such a system (called the Performance Measurement System (PeMS)) for archiving 30 sec loop detector data and successfully created an on-line performance monitoring system (45). Unfortunately, any large scaled archive based on RDBMS is expensive requiring highpowered computer systems and network connections, which was the case in PeMS. Furthermore, the archives created using RDBMS are not directly portable (exchangeable) unless the same types of RDBMS engines are used at the both ends or a text based flat formats are used. Also, RDBMS in general creates a huge sized archive file due to the heavy overheads that the RDBMS adds on to the file, which is not desirable for ITS data archiving. We believe that the archive for large-scaled ITS data should have the following properties: (i) the size of the archive should be compact and small; (ii) the retrieval of large amount of data should be fast; (iii) the archive should be easily transportable between different operating systems and computing systems; (iv) data should be accessible in any location in the archive; (v) it must be easy to use or manage the data; (vi) it must have low initial investment and maintenance cost; (vii) it must be an open-architecture/ open-standard that can last; (viii) it must have a capability of self-description of the data (metadata); (ix) the archive must be supported by many analysis tools and multiple vendors. The Common Data Format (CDF) (6-10) archive developed by the National Space Science Data Center (NSSDC) satisfies above list of desirable properties (6-10). CDF is free and is an open standard. It is designed for manipulating large-scaled multi-dimensional data, which is the case in large-scaled ITS sensor data. CDF is often referred to as a self-describing data format because it supports metadata. CDF employs various data compression algorithms and generates archives smaller than raw binary data. CDF is available for many different computing environments including mainframes, thus CDF created data can be exchanged between different application programs running under different operating systems. CDF is supported by many public domain visualization and analysis tools as well as by commercial software packages. In this paper, we describe CDF and show its application to traffic data archival and retrieval. The traffic data was provided by the Traffic Management Center (TMC), a division of Minnesota Department of Transportation (MN/DOT). Mn/DOT TMC collects traffic data (volume and occupancy) at a 30 sec interval from about 4,000 loop detectors installed at half-mile spacing in metro freeways in and around Twin Cities. Using this Mn/DOT’s traffic data, we created a CDF archive for one year (2001) for archival and retrieval performance study. For comparison, we also created an archive using a commercially available RDBMS. Both archives allow random access to any location of the archived data, but we found significant performance differences. These results are discussed in the paper. We also discuss why CDF is a good fit for the large-scaled ITS data archiving.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
3
OVERVIEW OF COMMON DATA FORMAT (CDF) (6-10) Background The irony of the term "FORMAT" is that the actual data format of CDF is completely transparent to the user. The user is completely free from the burden of knowing the internal CDF data format. Moreover, programmers do not need to perform low-level I/O operations to read and decode the data file. This is all performed by the CDF libraries. The data is accessible through a consistent set of interface (known as the "CDF Interface") routines. The development of CDF arose out of the recognition by the National Space Science Data Center (NSSDC) for a class of data models that is matched to the structure of scientific data and the applications (i.e. statistical and numerical methods, visualization, and management) they serve. The CDF was initially developed for NASA Climate Data System at NSSDC under mainframe computing environment. It had three main requirements driving its development. First was facilitating ingestion of data sets into CDF. Second was to utilize standard common terminology (metadata) to describe the data sets. The third was to develop higher-level applications. The developed CDF was a self-describing data abstraction for storage and manipulation of multidimensional data. CDF is not really a format, but is a scientific data management package (known as the "CDF Library"), which allows programmers and application developers to manage and manipulate scalar, vector, and multi-dimensional data arrays. CDF offers C, FORTRAN, Java and Perl APIs. The advent of the CDF Java APIs significantly benefits the CDF user community, since CDF applications can now be written in platform-independent Java language and can run on any of the Java-supported platforms (Java is supported virtually on all platforms today). The CDF software package is free and used by hundreds of government agencies, universities, and private and commercial organizations as well as independent researchers on both national and international levels. CDF has been adopted by the International Solar-Terrestrial Physics (ISTP) project as well as the Central Data Handling Facilities (CDHF) as their format of choice for storing and distributing key parameter data.
Conceptual Organization CDF is a self-describing data abstraction for the storage and manipulation of multidimensional data in a discipline-independent fashion. Data abstraction means that CDF provides a conceptual view of data and hides actual physical format. It provides a way of generalizing the data model and makes possible the specification of a uniform interface for manipulation of data sets. The data abstraction allows future extensibility and provides for conceptual simplicity while isolating machine and device dependence. The contents of CDF fall into two categories. The first is a series of records comprising a collection of variables consisting of scalars, vectors, and n-dimensional arrays. The second is a set of attribute entries (metadata) describing the CDF in global terms or specifically for a single variable. This dual function of CDF is what provides its "data set independence." An important element of CDF data abstraction in a conceptual level is the "virtual" dimensional layer that allows data objects that share a subset of the overall CDF dimensionality to be projected into full dimensional space. CDF can handle data sets that are inherently multidimensional in addition to the data sets that are scalar. To do this, CDF groups data by "variables" whose values are conceptually organized into arrays. The dimensionality of these variable arrays depends on the data and is specified by the user when the CDF or variable is created. For scalar data, as an example, the arrays of values would be 0-dimensional (i.e., a single value); where for gray-scale image data the
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
4 array would be 2-dimensional. Similarly, the array for volume data would be 3-dimesional. CDF allows users to specify arrays up to 10-dimsnsions. The array for a particular variable is called a "variable record." A collection of arrays, one for each variable, is referred to as a "CDF record." A CDF can, and usually does, contain multiple CDF records. Two types of variables may exist in a CDF: rVariables (where r stands for regular) and zVariables (where z doesn’t stand for anything special). Every rVariable in CDF must have same number of dimensions and dimension sizes. zVariables may have a different number of dimensions and/or sizes. A CDF may contain both rVariables and zVariables. There is no single “correct” way to store data in a CDF. Within the confines of the variable array structure, the user has complete control over how the data values are stored in CDF depending on how the user wish to organize the data.
CDF Library The CDF library is a flexible and extensible software package that gives the user many options for creating and accessing CDF.
File Format Options The CDF library gives the user the option to choose from one of two file formats in which to store the data and metadata. The first option is the traditional CDF multi-file format. The .cdf (dotCDF) file contains all of the control information and metadata for the CDF. In addition to the .cdf file, a file exists for each variable in the CDF and contains only the data associated with that variable. The second option is the single-file format, the default format when a CDF file is created. The whole CDF file consists of only a single .cdf (dotCDF) file. This file contains the control information, metadata, and the data values for each of the variables in the CDF. Both formats allow direct access. The advantage of the single-file format is that it minimizes the number of files one has to manage and makes it easier to transport CDFs across a network. The organization of the data within the single file may, however, become somewhat convoluted, slightly increasing the data access time. The multi-file format, on the other hand, clearly delimits the data from the metadata and is organized in a consistent fashion within the files. Updating, appending, and accessing data are also done with optimum efficiency. For multi-file format CDFs, certain restrictions are applied. Compression is not allowed for the CDF or any of its variables. Sparse records or arrays for variables are not allowed. Preallocation of records or blocks of records is not allowed. For each variable, the maximum written record is the last allocated record.
Compression Compression may be specified for a single-file CDF by instructing the CDF library to compress as it is written to disk. This compression occurs transparently to users. When a compressed CDF is opened, the CDF library automatically decompresses it. An application does not have to even know that a CDF is compressed. Any type of access is allowed on a compressed CDF. When a compressed CDF is closed by an application, it is automatically recompressed as it is written back to disk. The individual variables of a CDF can also be compressed. The CDF library transparently handles the compression and decompression of the variable values. The application does not have to know that the variable is compressed as it accesses the variable's values. The CDF library supports several different compression algorithms. The supporting compression algorithms include run-length encoding, Huffman compression, adaptive Huffman
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
5 compression, Gnu ZIP compression (Lempel-Ziv coding). For Gnu ZIP (or GZIP) compression, users can select one of the nine different levels of compression rates. When compression is specified for a CDF or one of its variables, the compression algorithm to be used must be selected. There will be tradeoffs between the different compression algorithms regarding execution performance and disk space savings. The nature of the data in a CDF (or variable) will affect the selection of the best compression algorithm to be used.
Sparseness Two types of sparseness are allowed for CDF variables: sparse records and sparse arrays. Sparse records are presently available, but sparse arrays won't be available until a future CDF release. When a variable is specified as having sparse records, only those records actually written to that variable will be stored in the CDF. This differs from variables without sparse records in that for those variables every record preceding the maximum record written is stored in the CDF. For example, if only the 1000th record were written to a variable without sparse records, the 999 preceding records would also be written using a pad value. If sparse records had been specified for the variable, only the 1000th record would be stored in the CDF (saving a considerable amount of disk space). Sparse records are ideal for variables containing gaps of missing data.
Metadata Attributes Another important component of CDF is the metadata. Metadata values consist of usersupplied descriptive information about the CDF and the variables in the CDF. Attributes can be divided into two categories: attributes of global scope (gAttributes) and attributes of variable scope (vAttributes). gAttributes describe the CDF as a whole while vAttributes describe some property of each variable (rVariables and zVariables) in the CDF. Any number of attributes may be stored in a single CDF. The term "attribute" is used when describing a property that applies to both gAttributes and vAttributes. gAttributes can include any information regarding the CDF and all of its variables collectively. Such descriptions could include a title for the CDF, data set documentation, or a CDF modification history. A gAttribute may contain multiple entries (called gEntries). An example of this would be a modification history kept in the optional gAttribute, MODS. This attribute could be specified at the CDF creation time and a gEntry made regarding creation date. Any subsequent changes made to the CDF, including additional variables, changes in min/max values, or modifications to variable values could be documented by writing additional gEntries to MODS. vAttributes further describe the individual variables and their values. Examples of vAttributes would include such as a field name for the variable, the valid minimum and maximum. Further examples include the units in which the variable data values are stored, the format in which the data values are to be displayed, a fill value for errant or missing data, and a description of the expected order of data values: increasing or decreasing (monotonicity). The entries of a vAttribute correspond to the variables in the CDF. Each rEntry corresponds to an rVariable and each zEntry corresponds to a zVariable.
Mn/DOT TMC TRAFFIC DATA ARCHIVE For many years, Traffic Management Center (TMC), a division of Mn/DOT, has collected traffic data from the loop detectors embedded in metro freeways in and around Twin Cities. This raw data consists of volume (number of vehicles, sometimes called "flow") and occupancy (percentage of time a detector is "occupied"). The data is collected at a 30-second
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
6 interval from about 4,000 loop detectors, seven days a week, and all year round. The collected data is then daily packaged into a single zip file and archived. The same data is also loaded into the UMD (University of Minnesota Duluth), Data Center ftp server. (This data center is a part of the Transportation Data Research Laboratory (TDRL), a division of Northland Advanced Transportation Systems Research Laboratories (NATSRL)). The file format follows "yyyymmdd.traffic" where yyyy is the four digit representing year, mm is the two digit representing month, and dd is the two digit representing date, for example, data for May 8, 2000 would have the file name "20000508.traffic". This file is a zipcompressed format that consists of about 8,000 individual files. When uncompressed, it produces a directory size 43MB. When this data is zip compressed, its size becomes about 13 MB. The directory size of the zipped files for the entire year is around 5 GB.
Unified Traffic Data Format (UTDF) Earlier Mn/DOT TMC’s file format had complicated bit field manipulation, which made it harder to develop data analysis tools. The Unified Traffic Data Format (UTDF) is a new simplified Mn/DOT format of traffic data that was devised to eliminate this problem by storing all data in either 8-bit or 16-bit binary integers. Unzipping a "*.traffic" (UTDF) file creates about 8,000 files, in which half the files are volume files with file name "###.v30" and the other half of the 8,000 files are occupancy files with "###.o30" or "###.c30" where ### corresponds to the detector ID number.
Volume Format The volume files (*.v30) in UTDF are flat binary files of 2880 bytes each that are collected from a single loop detector for the duration of 24 hours from 00:00:00 hours to 23:59:30 hours at a 30 second interval. Each byte is an 8-bit signed volume for the corresponding 30second period in the day. A negative value indicates missing or error conditions such as communication error.
Occupancy Format There are two types of occupancy files in UTDF. The first type has file extension .o30 and has very similar format as the volume files, except each value is a 16-bit signed integer. Each file has 5760 bytes in length (2880 * 2). The occupancy values are fixed-point integers ranging from 0 to 1000 (tenth of a percent units). A negative value indicates missing or faulty data, as with the volume files. The 16-bit values are in high-byte first order. The .c30 files are recorded in "scans" and are more precise than the .o30 files. Scans are defined as 1/60 second, so the valid range for data is 0 to 1800 (30 seconds * 60 scans/second). The old .o30 files are in 1/10th percent occupancy, so they range from 0 to 1000. That is the only difference between the two file formats. To get numbers in the range of 0 to 100, scan data can be divided by 18 or occupancy data by 10. Any data outside the valid ranges is considered "bad".
CDF TRAFFIC ARCHIVE The data component of CDF is organized into arrays for the individual variables. CDF can accommodate any type of data that can be organized into arrays. Two types of variables are supported: rVariables and zVariables. All rVariables have the same dimensionality (number of dimensions and dimension sizes). zVariables are similar to rVariables in all respects except that
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
7 each zVariable can have a different dimensionality. zVariables are used when number of variables is large and storing them in rVariables would waste some storage space. Mn/DOT traffic data can be constructed as a large two-dimensional array as shown in Table 1. Each row is constructed as an equivalent to a single record in relational database. In CDF, four rVariables can be allocated to each column, i.e., Detector ID, Time, Volume and Occupancy. This table and variable allocation were purposely made this way to later compare the CDF’s archive performance with RDBMS by constructing each method using the same table structure. Notice from Table 1 that every sensor number has the same number of time records (every thirty seconds making 2880 records for the entire day). The value of detector ID remains the same for the entire 2880 records. This repetition can be removed in CDF without indexing into another variable. For each rVariable, there are variances associated with the array dimensions as well as the records. "Record variance" indicates whether or not an rVariable has unique values from record to record in the CDF. Detector ID changes for each record so the record variance for Sensor ID is [TRUE]. Time repeats its values from record to record so the record variance of Time is [FALSE]. The Volume and Occupancy values change from record to record so they are record variant. These settings of record variances are summarized in Table 2. When the record and dimension variances are defined as described, the amount of physical storage needed for the CDF is drastically reduced. The 1-dimensional arrays in Table 1 are not physically stored for each rVariable in a CDF record. Instead, the physical storage for each rVariable consists of just one value for Detector ID in each CDF record, a single 1dimensional array of values for the Time rVariable (in only the first CDF record), and a full 1dimensional array of values for Volume and Occupancy in each CDF record. The conceptual view of the CDF, however, still remains the same as Table 1. With the traffic data structure defined, CDF archives can be created using the API calls: CDFcreate(), CDFvarCreate(), and CDFvarPut(). CDF offers data compression options in the archiving process. In order to see what file size each compression option produces, we ran a single day traffic data (2/11/2001) using different options. The result is summarized in Table 3. Notice from the table that GZIP compression provides nine different levels of compression rates. Among the available compression options, the GZIP algorithm produced the highest compression rate. We further tested the whole year and found that the CDF size for a single day varied between 5MB and 17MB depending on how many missing data exist in the traffic data. This size could have been further reduced if we had used integer data type for the occupancy instead of the float data type.
RETRIEVAL TEST Retrieval Task Design The retrieval test was designed using a simplified version of one of the MN/DOT applications that requires computation of daily traffic count on stations. MN/DOT has defined about 492 stations (short-duration count stations) around the Twin-Cities metro freeway system for estimation of annual average daily traffic (AADT). Each station consists of a set of loop detectors (two to eight) to collect data from certain locations of a freeway. The retrieval task is to generate the daily volume count for the 492 stations. The output of the retrieval is a text file consisting of all the stations followed by their total daily volumes for that day. Since the detectors in most stations are not defined sequentially, this task requires random access to detector files. This retrieval task was used to compare the retrieval performance of CDF against RDBMS.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
8
Archival and Retrieval Using RDBMS The designed retrieval task requires random location access to the traffic data, so that directly comparing the retrieval performance of the flat file format against CDF is unfair, since CDF is designed for random access of data. Therefore, we decided to compare the retrieval performance of CDF archive with an RDBMS. RDBMS allows random access to any data in the database table like CDF. For the database engine, we chose Microsoft® SQL Server 2000, since it has been touted for e-commerce, line-of-business, and data warehousing solutions. The database table for the Mn/DOT’s 30 sec traffic data was created using the Table 1 format, which was also used in creating the CDF archive for comparison. This corresponds to the following SQL statement. CREATE TABLE [dbo].[traffic] ( detectorID int NOT NULL , timeID int NOT NULL , Volume int NULL , Occupancy float NULL ) For the time stamp (named timeID) in the database table, integer data type was used instead of using the datetime data type to reduce the size of the table and to increase the efficiency in retrieval. This table structure may not be the most efficient way of designing the database table, but it would be the correct way if we wish allow random access to any detector at any time in the same manner as the CDF archive. The retrieval was done using the following query statement embedded into a Microsoft® Visual Basic program where X is determined based on the detectors in the station. SELECT Volume FROM Traffic WHERE (detectorID = X) ORDER BY timeID For the database interface, Open Database Connectivity (ODBC) with Microsoft Visual Basic code was used to retrieve and compute the retrieval task. The ODBC interface to the database may not be the most efficient way for data retrieval, but it is accepted as an industry standard and commonly used for application interfaces.
Retrieving CDF Data Using APIs CDF API provides wide range of functions/procedures for data retrieval from CDF data files. CDF standard interface consists of functions to access the CDF. These functions include single value access and hyper access. Hyper access to the CDF means accessing more than a single element in a single read. It is mostly used to access a large set of values by a single command. The CDF retrieval of the designed task was done using the CDFvarGet() call.
Archiving/Retrieval Performance Comparison Between CDF and RDBMS The test case archive and retrieval for CDF was performed on a standard IBM clone machine with 1 GHz Intel processor with 384 MB of RAM running on a Win 2000 workstation
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
9 platform. RDBMS was installed using a Dell® Poweredge Sever that runs on a dual IGHz Intel processor with 1 GB RAM. The data was retrieved using a 1 GHz PC connected to the database server through 100 M bps Ether switch. For simplicity, the comparison was performed using one day (2/11/2001) traffic data, and the result is summarized in Table 4. Performance estimate of one year may be obtained by simply multiplying 365 to the single day performance.
Archiving Performance •
Archival Speed: Archival speed is defined here as the inverse of the amount of execution time required for creating one day’s archive from UTDF (zipped binary) file to CDF or RDBMS. The amount of time to create a CDF file from UTDF file averaged around 5 minutes, and the amount of time to create a RDBMS archive in SQL-Server from UTDF averaged a little over 6 hours for the single day test data. The number of records (rows) loaded in the RDBMS was about 11,520,000 (4,000 * 2,880). In the archiving speed, CDF was 72 times faster than RDBMS (see Table 4).
•
Archive Size: Archive size refers to the amount of secondary storage (hard-disk) space required for storing one day’s traffic data. When CDF file was created with GZIP level-4 compression option, it was about 16MB. If the GZIP level-9, which is the highest compression rate in CDF, is used, the size could be further reduced (see Table3). However, it also increases the retrieval time. When the same amount of data was stored into RDBMS, it created a file with 370 MB. In the archive size, CDF was 23 times smaller than RDBMS (see Table 4).
Retrieval performance The retrieval task defined was executed using both CDF and RDBMS. For the same task, CDF took around 2 minutes to generate the final output file, and RDBMS took around 2 hours to generate the same output file. In this test, CDF retrieval time was about 60 times faster than that of RDBMS.
Summary and Comments on Performance Under the same traffic data with the same data structure, CDF archival time was about 72 times faster than RDBMS; CDF archive size was about 23 times smaller; CDF retrieval was about 60 times faster than RDBMS. This clearly suggests that CDF is a better choice. However, this performance comparison should only be applied for the specific database setup we used in this study as described in the previous sections, which strictly followed the same table structure. Many other factors in the database design or setup could affect the outcome. Although the cost factors were not discussed, it can additionally provide some clear ideas on suitability of RDBMS for traffic archiving. For our experiments, a regular PC workstation was too small for archiving the traffic data using RDBMS, so a bigger machine (multiprocessor server) with very large RAM (1 GB) was used. The cost of the PC used for CDF was $1,200, and the cost of the sever used for RDBMS was $19,000. The server used for RDBMS was about 16 times more expensive. If we add the software cost of the SQL Enterprise version, the cost difference is even more significant. Since CDF package itself is free, no more comparison is necessary.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
10
CONCLUSIONS In this paper, we proposed CDF as a tool for large-scaled ITS data archiving. Many properties of CDF that related to ITS data archiving were discussed. Mn/DOT’s 30 sec traffic data was used to demonstrate the archival and retrieval efficiency of CDF. An RDBMS archive was constructed for the same data and compared with the archival and retrieval performance of CDF. In all aspects we considered, CDF was clearly a better choice. As a summary, we provide a list of why CDF is a good fit for archiving large scaled ITS sensor data. • • • • • • • • • • •
CDF is an open standard and has been existed for long time. CDF is free (no licensing is required). CDF files are portable, i.e. it can run on any type of machines that support CDF. CDF is self-describing data abstraction (data is described through metadata). The data format of CDF is transparent to users, i.e., the users do not need to know the internal data formats of CDF to use the data. Users simply need to know what data they need to retrieve. CDF compresses data internally, creating small sized archives. Data compression is transparent to users, i.e., the users of CDF files do not need to know whether the CDF files are compressed or not. CDF was designed for efficiently managing large-scaled multi-dimensional data. CDF allow random access to any part of the stored data. CDF files are used by many scientific visualization and analysis packages (commercial and non-commercial). Data analysts may use any of these available tools if the data is packaged into CDF. CDF files can be shared between applications and users.
As a competing form, another type of scientific data package called Hierarchical Data Format (HDF) exists (see the web site given in (12)). HDF is a powerful data tool that can be used for manipulating hierarchical types of data. On the other hand, HDF was more recently developed and is still in the evolving phases. As a result, it is less widely used and supported by less number of tools and organizations than CDF. Studying HDF is one of our on-going research topics in our laboratory, and the outcome would be reported in future.
ACKNOWLEDGEMENTS This research was supported in part by the Northland Advanced Transportation Systems Research Laboratories (NATSRL) at the University of Minnesota Duluth and the Minnesota Department of Transportation.
REFERENCES 1. U.S. DOT ITS, Archived Data User Service (ADUS), “ITS Data Archiving: Five-Year Program Description,” March 2000, Published by U.S. DOT, ADUS Program. 2. Margiotta, Richard, ITS as a Data Resource: Preliminary Requirements for a User Service. Report FHWA-PL-98-031, Federal Highway Administration, Washington, DC, April 1998.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
11 3. Tuner, S.M., Eisele, W.L., Gajewski, B.J., Albert, L.P., and Benz, R.J., ITS Data Archiving: Case Study Analysis of San Antonio TransGuide Data. Report FHWA-PL-99-024, Federal Highway Administration, Texas Transportation Institute, College Station, Texas, August 1999. 4. C. Chen, K Petty, A. Skabardonis, P. Varaiya, and Z Jia “Freeway performance measurement system: mining loop detector,” Transportation Research Record 1748, Paper No. 01-2354. 5. Varaiya, Pravin “Freeway Performance Measurement System: Final Report”, University of California Berkley, California PATH Working Paper UCB-ITS-PWP-2001-1 6. National Space Science Data Center, CDF User’s Guide, Version 2.7, April 2, 2002. 7. Goucher, G.C. and Mathews S.S., “A Comprehensive Look at CDF,” NSSDC/WDC-A-R&S 94-07, NASA/Goddard Space Flight Center, August 1994. 8. Treinish, L.A., “Data Structures and ‘Access Software for Scientific Visualization,” A Report on a Workshop at Siggraph’90, Computer Graphics, 25, No. 2, April 1991. 9. Treinish, L.A. and Goucher G.W., “A Data Abstraction for the Source-Independent Storage and Manipulation of Data,” National Space Science Data Center Technical Paper, NASA/Goddard Space Flight Center, August 1988. 10. Treinish, L.A. and Gough M.L., “A Software Package for the Data-Independent Storage of Multi-Dimensional Data,” EOS Transactions, American Geophysical Union, 68, pp. 633-635, 1987.
Web References 11. CDF: http://nssdc.gsfc.nasa.gov/cdf/cdf_home.html 12. HDF: http://hdf.ncsa.uiuc.edu/
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
12
List of Tables Table 1: 30 sec Traffic Data in a Table Form Table 2: Variances Specification for Traffic Data CDF
Table 3: File Sizes of CDF with Different Compression Options Table 4: Summary of Archival/Retrieval Performance Test Results
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
13
Table 1. 30 sec Traffic Data in a Table Form Record Number
Detector ID
1
2
2
3
. .
. .
3999
4182
4000
4183
Time
Volume
Occupancy
00:00:00 00:00:30 00:01:00 . . . 23:59:30 00:00:00 00:00:30 00:01:00 . . . 00:00:00 . . . 00:00:00 00:00:30 . . 23:59:30
10.0 1.0 2.0 . . . 2.4 2.0 3.0 2.0 . . . 0.0 . . . 3.0 2.0 . . 2.0
5.0 2.0 1.0 . . . 3.0 1.0 2.5 1.0 . . . 0.0 . . . 4.5 3.0 . . 1.0
Note: Detector IDs are not necessarily sorted as shown above nor has a sequential relation to its locations. No specific rules exist for naming the detector IDs that are presently managed by the Mn/DOT TMC.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
14 Table 2. Variances Specification for Traffic Data CDF
Record Variance First Dimension Variance
TRB 2003 Annual Meeting CD-ROM
RVariables Detector ID TRUE FALSE
Time FALSE TRUE
Volume TRUE TRUE
Occupancy TRUE TRUE
Paper revised from original submittal.
15
Table 3. File Sizes of CDF with Different Compression Options Level 1 2 3 4 5 6 7 8 9
Uncompressed CDF 67.6 MB ---------
RLE
Huffman
45.4 MB ---------
36.3 MB ---------
Adaptive Huffman 29.8 MB ---------
GZIP 18.7 MB 17.8 MB 16.9 MB 16.6 MB . . . . 15.3 MB
Note: The above table was created using Mn/SOT’s 30 sec data (about 4,000 detectors) for 2/11/2001. MB stands for mega bytes.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.
16
Table 4. Summary of Archival/Retrieval Performance Test Results
Archival Time Archive Size Retrieval Time
CDF (GZIP Level 4**)
RDBMS
5 minutes
6 hours
40 MB*
16.6 MB
370 MB
N/A
2 minutes
2 hours
Binary Uncompressed N/A
* This is the data size only. Actual size in Windows 2000 file system is 44MB due to the allocated size of clusters. ** Users can select nine different levels of GZIP compression rate in CDF. Level 4 was used for the above test runs.
TRB 2003 Annual Meeting CD-ROM
Paper revised from original submittal.