High Availability Archive for Time-Based Data Tobias Haag, Thomas Meier and Izudin Dzafic
Abstract An approach for archiving time-based data is presented. The basic idea is to split data intelligently into separate time-dependent areas having different logical and physical properties. It is especially useful and sustainable when dealing with just a few but very big data tables, when all-time availability becomes a principal issue or when it is necessary to have a fast restore in case of a disaster recovery—even if the database reaches sizes over one TB. Keywords Archive systems storage Backup and restore Area
Power systems Huge database Oracle Data Short Term Area Mid Term Area Long Term
1 Introduction Generally historical data collected in industry processes is becoming more and more important for direct operational work, optimization decision support systems [1], disturbance studies and even quantitative risk analysis [2]. Over the past years the number of measurements increased significantly and so did the total amount of data to be stored. T. Haag (&) T. Meier I. Dzafic Siemens AG, Humboldt str. 59, 90459 Nuremberg, Germany e-mail:
[email protected] T. Meier e-mail:
[email protected] I. Dzafic e-mail:
[email protected]
James J. (Jong Hyuk) Park et al. (eds.), Computer Science and Convergence, Lecture Notes in Electrical Engineering 114, DOI: 10.1007/978-94-007-2792-2_20, Springer Science+Business Media B.V. 2012
211
212
T. Haag et al.
Short Term Area Raid 1 / 10
Move data
Mid Term Area
Import: reload / restore Mid Term
Long Term Area
Raid 5 Export file
Export
Fig. 1 Logical and physical data storage
Archiving terabytes of time-dependent data leads to some technical problems. Usually in such systems a huge amount of dynamic data is stored with a high throughput up to several thousands of records each second. Terabytes of numerical values, status data and system messages—all of them addressed by timestamp— might be stored in just a few big data tables. Static data like configurations and master data however is stored in a lot of tables, but referring to its total data volume it is more or less negligible for the overall size of the database. This is also true for Power Systems, where the described solution is already in use. On the one hand recently archived data is of greatest interest. For example a Power System’s operator uses trending based on very recently archived values in order to compare data and to control the power network. If necessary, this data must be edited manually. Data availability is vital and very time critical. On the other hand data must also be kept online for up to a few years in order to support reporting and other significant investigations. Appropriate queries are usually less time critical as the queries of the operators, but must be available for monthly or yearly reports. However manual changes are quite unlikely for nonrecent data. Finally old data must—in any way—still be available for several years for legal reasons. 10 years and more are quite usual. It is commonly accepted that this old data can be exported out to files and stored somewhere, for example on tapes. If necessary, it must be reloaded to the archive system first. Another requirement is that the database must be available again in after short time in case of a failure—even if the database is restored through a backup. One possibility to get a recovery faster is to reduce the size of the backups. Since still all data must be backed up, the data which shall be available soon have to be backed up in a different way as older data. This means in case of Power Systems, the data of for example the last three months has to be available soon and the system can store new values into the database. The restore of older data may take more time, since it is not critical for operational use cases. Because the old data has to be exported to files for offline archiving, these files provide the possibility to be used as backup (Fig. 1). For the outlined solution the database is split into three parts (logical locations): 1. Short Term Area: contains the most recent data. 2. Mid Term Area: contains older data. It is still online, but data is considered to be read-only. The Mid Term Area should be stored on separate disks.
High Availability Archive for Time-Based Data STORAGE_UNIT
ARCHIVE_TABLE
PK
STORAGE_UNIT_ID
PK
FK1
TABLE_ID PARTITION_NAME
FK1
TIME_STAMP_FROM
TIME_STAMP_TO TRANSFER_STATUS
VERSION
213
TABLE_ID TABLE_GROUP_ID TABLENAME TABLESPACE LOCATION TIME_STAMP_FROM TIME_STAMP_TO PARTITION_RANGE
ARCHIVE_TABLE_GROUP PK
TABLE_GROUP_ID
TABLE_GROUP_NAME TRANSFER_RANGE_HOURS
EXPORT_DIRECTORY IMPORT_DIRECTORY
PARTITION_RANGE_UNIT
Fig. 2 Entity-relationship diagram: archive tables
3. Long Term Area (LTA): consists of data having been exported to external files. It has to be reloaded, if a user wants to access it. Because the files are already created when moving data from Short Term Area to read-only Mid Term Area, they incidentally also constitute a backup of the Mid Term Area. This split implies some severe logical consequences. As the data is no longer stored in one single relational table, the actual (and varying in time) location must be known for access. This task can be solved by an API, which defines an interface and returns the data from the correct tables. It can also be used to initiate a reload of Long Term data, in case the requested time range is stored completely or partly in the Long Term. Since the data of Mid Term Area was already exported to Long Term Area files, it can no longer be allowed to modify the data in the Mid Term Area—otherwise these changes would get lost in case of a recovery. Because the API is also used for writing/editing data, it must also take care of this important restriction. The Mid Term Area can be stored on a RAID 5 (‘‘RAID’’ = Redundant Array of Independent Disks) because there is mainly read access, except the movement of data from Short to Mid Term and from Mid to Long Term. The Short Term Area should be stored on a RAID 10 or RAID 1 for best read and write performance.
2 Algorithm of Data Transfer The described solution is implemented based on Oracle Enterprise Edition [3] with the Partitioning Option [4]. Although Oracle already provides all necessary basic functionality, the actual solution must still be implemented. Currently there is not any feature like ‘‘Short, Mid or Long Term solution’’ directly offered by Oracle. As the API and also the internal management of the data must know where the data is located, some bookkeeping tables are used storing the necessary information in relational tables (Fig. 2). The information is split into following parts: 1. Data archives can be spread over Short, Mid and Long Term. The table group joins these areas logically together. Therefore every data archive is assigned to one table group consisting of Short, Mid and Long Term parts.
214
T. Haag et al. STORAGE_UNIT
STORAGE_UNIT_IN_LTA_UNIT
PK
STORAGE_UNIT_ID
PK,FK2 PK,FK1
FK1
TABLE_ID PARTITION_NAME TIME_STAMP_FROM TIME_STAMP_TO TRANSFER_STATUS VERSION
STORAGE_UNIT_ID LTA_UNIT_ID
LTA_UNIT
LTA_FILE PK
LTA_FILE_ID
FK1
LTA_UNIT_ID FILE_NAME ARCHIVED RESTORE_STATUS
PK
LTA_UNIT_ID
FK1
TABLE_GROUP_ID TIME_STAMP_FROM TIME_STAMP_TO EXP_TIME_STAMP_FROM EXP_TIME_STAMP_TO USER_COMMENT STATUS
LTA_RELOAD_JOB PK
LTA_JOB_NO STATUS NAME TIME_STAMP_FROM TIME_STAMP_TO USER_ID EXPIRATION_DATE RESTORE_FLAG CANCEL_FLAG IMP_DIRECTORY
LTA_UNIT_IN_JOB PK,FK1 PK,FK2
LTA_JOB_NO LTA_UNIT_ID
Fig. 3 Entity-relationship diagram: long term archiving
2. The underlying data table. In case of the Long Term Area this can also be a ‘‘logical’’ table, which does not exist physically in the relational database. It is nothing but a bookkeeping entry. Thus a table group consists of the following tables stored on different physical locations: (a) (b) (c) (d) (e)
One physical Short Term Area table One physical Mid Term Area table One ‘‘logical’’ Long Term Area table (just a bookkeeping entry) 0 to n only temporarily existing intermediate transfer tables 0 to n real reload tables, if data was reloaded from LTA to the online system
3. Each table consists of 0 to n storage units, which reflects the Oracle partitions for the Short Term and Mid Term tables. The time range of the data contained by a particular storage unit gets recorded. This time range is left-justified (end time is excluded), because Oracle’s range partitioning is used and the partitions are created with the ‘‘less than’’ statement. The storage unit’s current status is changing during the move from Short Term to Long Term. This must be recorded very detailed in order to enable the process to resume in any case without manual actions. 4. The files of Long Term Area (LTA) are grouped in LTA units. One LTA unit consists of one or more files and one or more storage units. Usually one LTA unit consists of one file only, but it is possible to limit the size of the LTA files. Also the parallel export option from Oracle could be used, if a LTA unit consists of more than one file (Fig. 3). The whole mechanism is working in the background. All background jobs are implemented in PL/SQL and started by the Oracle Scheduler [5, 6]. The following steps have to be executed regularly:
High Availability Archive for Time-Based Data
215
1. The partitions for the Short Term tables must be created and the information must be stored in the storage units table. As in Power Systems new values get the current timestamp, it is not necessary to create too many partitions in advance. But nevertheless at least partitions for some days should be available. So there is time enough to react to potential technical problems—for example no space left on disk—without disturbing the running applications. 2. The partitions must be analyzed for better performance. The statistics on the tables/partitions are used by the Oracle optimizer to get the best read performance. Analyzed but empty partitions, which are filled up later, decrease the performance until the partitions are analyzed again. A separate background job is used for statistics gathering which guarantees that only filled partitions are analyzed, as it is known, when the data is stored. 3. The data must be moved from the Short Term Area to the Mid Term Area as soon as the Short Term retention period is reached. This is the core functionality and it must be done in the following steps: (a) Merge partitions to bigger partitions. This step is optional but it helps to reduce the number of partitions in Mid Term. It can be used to increase the retention period of the Mid Term because there is a restriction in Oracle concerning the maximum number of partitions in one table (about 32000 partitions). But the experience shows, that the access to partitioned tables slows down already, if the table has more than 10000 partitions. (b) Create a table with the exact physical layout as the partition. This table will be used later on for the export to files and also for the move to Mid Term. There has to exist exact one table for each partition, which has to be moved. For easier maintenance it gets the same name as the partition. (c) Exchange the data between the partition and the table. Using the command ‘‘ALTER TABLE … EXCHANGE PARTITION …’’ it is only an internal bookkeeping change from Oracle. The extents of the partition are assigned to the table and vice versa. This also generates nearly no entry in the redo logs and therefore also not in the archived redo logs. This helps to reduce the required space for archived redo logs during transfer from Short Term to Mid Term. (d) The Long Term archive files have to be created now. This is just an export of the transfer tables using the Oracle Data Pump export [7]. Data Pump has a PL/SQL interface [8], this makes it easy to use in the background job scheduler with the Oracle Scheduler. The export has to update the LTA unit information. Creating the LTA files has to be finished before moving the data to the Mid Term, otherwise there would be no backup of this data. (e) The next step is to move the transfer tables to the Mid Term. It is a physical move using the Oracle ‘‘ALTER TABLE … MOVE …’’ command. This statement also produces nearly no entries in the redo logs. It may decrease the fragmentation of the data segment because originally inserted data might not have been inserted concurring to the index.
216
T. Haag et al.
(f) When the tables are moved, the partitions must be created on the Mid Term table. (g) Afterwards the Oracle ‘‘EXCHANGE’’ command is used to exchange the data from table to the partition in Mid Term. If there are indexes on the tables, they have to be rebuilt then, otherwise they will be not usable. In case of index-organized tables the rebuild is not necessary. (h) Drop the empty transfer tables. (i) Analyze the transferred partitions in Mid Term, as the optimizer statistics get lost during transfer and would be also no longer valid as the move command results in an optimized storage of data by Oracle. 4. When the retention period in Mid Term is reached, the partitions have to be dropped. As the Long Term archive files are already created, just the bookkeeping must be updated, that the data is now in Long Term.
3 Backup and Restore There are also some additional commands necessary for backup and restore to make this approach working [9, 10]. When the tablespaces for the Mid Term are created, they have to be excluded from the backup. RMAN’s command ‘‘CONFIGURE EXCLUDE FOR TABLESPACE …‘‘excludes a tablespace for all backups made after execution of this command. When a restore is necessary, the Mid Term tablespaces must be dropped first, otherwise RMAN will try to restore them. This can be done by SQL command ‘‘ALTER DATABASE DATAFILE ‘…’ OFFLINE FOR DROP’’ before the restore command. The tablespaces must be also excluded at recovery: ‘‘RECOVER DATABASE SKIP TABLESPACE …’’. After having successfully restored the Short Term, the Mid Term tablespaces must be dropped and recreated. Finally the configuration to exclude the tablespace from backup must be executed again, because this configuration was lost with the drop of the tablespace. Some PL/SQL procedure recreates the Mid Term tables and updates the internal bookkeeping tables. Since logically for the API the overall time range covered by Mid Term must be contiguous, the restore has to be started with the newest data and has to be continued in direction from newest to oldest—an important restriction of Mid Term restoration. With this approach the normal move from Short Term to Mid Term can be done already, even if the Mid Term is not yet completely restored. If the restore takes long, it is possible, that some data has already reached the retention period and thus has no longer to be restored. Because range partitions are used and created with the ‘‘less than’’ statement, the partitions have to be split during restore of the Mid Term. This is supported by Oracle with the ‘‘ALTER TABLE … SPLIT PARTITION …’’ command. Because a split of partitions causes an internal renumbering of the partitions by Oracle, this approach will slow
High Availability Archive for Time-Based Data
217
down the more data is restored in Mid Term and the split command has to be executed. Therefore all necessary partitions are already created when recreating the Mid Term. In case the restore of Mid Term takes some time and the retention period of an empty partition is reached, the partition will be just dropped by the normal background process and the status of the storage unit will be updated.
4 Restore/Reload of Mid Term Area Reload and restore are mostly automated. The only exception is that the necessary files are not loaded automatically from a backup medium like tape—some user interaction is still needed. The necessary files will be listed by restore/reload. The load can be also automated, but as there are different solutions available for the archiving a generic method is not available. The reload of Long Term data and restore of Mid Term Area are nearly the same, there are just a few more steps necessary in case of a restore. The following steps are executed during restore/reload: 1. Start of the reload or restore. The time range and the type of data must be specified. In case of a reload this information is already passed by the API. Internally a reload/restore job will be created. The jobs are stored in tables, so it is possible to check the status in different sessions. A background job, which is implemented in a way, that it can be restarted in case of any issues, takes over the work. 2. First the time range of the reload/restore will be adapted to fit to the partition size because only partitions as a whole can be imported. 3. Now the files to import will be identified using the LTA bookkeeping tables. As long as the necessary files are on disk, the reload/restore imports them step by step. If there are one or more files missing, the background job lists them in a way, that the user can see them through the UI. As long as the reload/restore job is not cancelled it will retry to import the files after a configured time. 4. In case of a reload, the job is finished after the import. The background job inserts for each imported table (which was a partition before) an entry in ARCHIVE_TABLE, so the API can find the reloaded data. 5. In case of a restore, the reloaded table will be exchanged with the corresponding partition in Mid Term. The reloaded table will be dropped afterwards. The status of the storage unit will be updated, that the data is available in Mid Term. At each start of a reload job, the user can specify how long the reloaded data should be kept in the online system. It is possible to modify this date as long as the data is online. It may happen that the same storage unit is part of more than on reload job. In this case, the data will be only unloaded, if all reload jobs are expired. The unload itself is just a drop of the table and an update of the
218
T. Haag et al.
Table 1 Database sizes of two productively working reference systems Data inflow MB/min Online retention period in months Size in GB 8 3.5
Total
Short Term
Mid Term
Short Term
Mid Term
24 4
3 2
21 2
980 300
6,800 300
bookkeeping table. The expiration of reload jobs is checked by the same background job, which also executes the reload. Furthermore a separate tablespace is used in order to limit the maximum reloaded data. This tablespace is not auto-extendable, so the maximum used space for reload is already allocated at installation. If the required space was underestimated, the tablespace size can be still modified using Oracle commands.
5 Result The Short/Mid Term solution provides the flexibility to fulfill different customer needs by also assuring a stable overall workload and an acceptable performance— even if Short and Mid Term data accesses are mixed. The following table shows two exemplary configurations, which are already in productive use for more than 2 years now. The configurations differ due to different use cases. The first system needs data over two years for some reports, which is not required for the second system (Table 1).
6 Conclusion and Risks This paper presented an approach for reducing the restore time of a large database. The goal was to provide an algorithm which • makes it possible to restore the most recent data fast; • does not accept data loss and ensures high data consistency of Short and Mid Term Area (by moving complete partitions and providing a detailed bookkeeping and an intelligent API for data retrieval); • can be automated; • just uses Oracle features only. The current version of Siemens’ Historical Information System (HIS) contains all these features already implemented and proven in practice. A very exact bookkeeping is essential for backup, restore and data retrieval by API. Any failure or even any inaccuracy (e.g. concerning left- or right-justified time ranges) can
High Availability Archive for Time-Based Data
219
cause significant data loss—logically (data retrieval by API) or even physically (backup/restore). By accessing the intermediate transfer tables if necessary, the API is basically designed to provide data access even while the Short to Mid Term data transfer is in progress. However, it cannot be guaranteed that data can be read without any gap in all potential circumstances without the API having logically locking the transfer. But HIS avoids any lock by intention, because it must not be possible at all to block the important (background) data transfer by simply reading data. The coded algorithm is critical concerning transactional locking issues, potential error handling and the possibility to restart broken jobs again. Since Oracle’s SQL data definition language commands like ‘‘ALTER …’’ already terminate currently open transactions, on the one hand no regular rollback is possible anymore and on the other hand any intermediate changes in bookkeeping tables become persistent. In case of failure all changes must be undone explicitly. This is why any real implementation must be very carefully checked and intensively tested.
References 1. Niu CL, Yu XN, Li JQ, Sun W (2005) The application of operation optimization decision support systems based on data mining in power plant. In: Proceedings of 2005 international conference on machine learning and cybernetics 2. Rastogi UJ, Srinivas VK, Guptan RG (2010) Databases and their analysis for applications in quantitative risk analysis of NPP. In: 2nd international conference on reliability, safety and hazard (ICRESH), 2010 3. Oracle Database, http://www.oracle.com/technetwork/database/enterprise-edition/overview/ index.html 4. Oracle Partitioning, http://www.oracle.com/technetwork/database/options/partitioning/index. html 5. Oracle Scheduler Concepts, http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/ schedover.htm#i1106753 6. Oracle Scheduler: Scheduling of Jobs and Administration, http://download.oracle.com/docs/ cd/E11882_01/server.112/e17120/scheduse.htm#i1033533. http://download.oracle.com/docs/ cd/E11882_01/server.112/e17120/schedadmin.htm#i1007589 7. Oracle Data Pump, http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/part_ dp.htm#i436481 8. Oracle Data Pump PL/SQL Interface, http://download.oracle.com/docs/cd/E11882_01/appdev. 112/e16760/d_datpmp.htm#i1007277 9. Oracle Backup and Recovery User’s Guide, http://www.oracle.com/pls/db112/to_pdf? pathname=backup.112/e10642.pdf 10. Oracle Backup and Recovery Reference, http://www.oracle.com/pls/db112/to_pdf?pathname= backup.112/e10643.pdf