Persistent Data Structure Library for C++ Applications K. A. T. A. Jayasekara1 and Sanath Jayasena2 1,2 Department of Computer Science & Engineering, University of Moratuwa, Katubedda, Moratuwa, Sri Lanka 1
[email protected];
[email protected]
Abstract Persistent Data Library (PDL) manages object persistence in C++ applications. PDL abstracts persisting features and provides an easy programming environment to the programmer. It offers a set of data structures which transparently handles persistence. Data structures in PDL are quite similar to data structures in Standard Template Library (STL) in C++ but STL does not provide functionality to make data persistent. This kind of a library will be beneficial when implementing Fault Tolerance to state based applications. State based applications need to checkpoint data periodically. In case of a failure, such applications need fast recovery of data. In developing such applications, each time a new state is introduced to the system, the programmer needs to write code to serialize and de-serialize data. PDL framework helps the programmer to write less code on serialization and de-serialization. Due to the direct memory dumping technology PDL uses, the time taken to write data to the disk and recover data from the storage is minimized.
1. Introduction Most present applications are developed using object oriented programming languages. In object oriented programming languages, application specific data are represented as objects. Data encapsulated within objects needs to be written to a non-volatile medium for reasons like, 1. Data is needed for a later usage, 2. As a method of exchanging information between processes, 3. Recover to a last known state in the presence of a failure. The above mechanisms are particularly useful when introducing fault tolerance to applications which sustain a state. In passive replication, fault tolerance is achieved by duplicating the process in another instance and the duplicated instance is not active. Only the
primary process will serve requests or do the actual work. In case the primary process fails, the secondary process will take over. If the running application or process has a state, that state must be same for the secondary process at the time of takeover. The primary process periodically writes its state to a shared storage and the secondary process will update its state by reading the shared storage. As far as business functionality is concerned a state is some data which the application is interested in, at a given time. Time to make primary process’s state persistent is crucial since its primary objective is to serve client requests. (i.e., it should not burn a lot of processing power in making the state persistent.) In addition the secondary process should read states quickly to make the failure transparent to the client. A common method of writing states in C++ programs is to use object serialization and object deserialization. In serialization object data is converted to a byte stream and in de-serialization byte stream is converted to an object represented in memory. In the long run this method can lead to errors. For example, each time the programmer adds a new member to an existing class he/she needs to update the serialization/de-serialization logic, similarly if the programmer needs to introduce a new class, he/she needs to modify serialization/de-serialization logic. It is possible that the programmer forgets to write serialization or de-serialization logic to persist/load newly added attributes/classes. These kinds of programming errors can be hazardous as they are not traceable at compile time. These errors usually lead to application crashes. Thus in applications, states are usually maintained within object collections (Arrays, Lists, etc.). Persisting state involves iterating each object within each collection and serializing and writing them to persisting storage. When serializing the state, the application needs to iterate each collection, each object within a collection and each member within an object. Since there are a lot of iterations, this
technique will consume considerable amount of CPU time. There were several attempts to hide persistence behavior from usual programming routine. But most of them provide persistence at an object level and not at data structure level. Thus some of those solutions do not provide data deletion capability. Some solutions or their underline concepts cannot be directly applied to applications written in C++. This paper proposes a framework which can be used in supporting persistence in distributed applications. The presented framework will alleviate the above mentioned drawbacks and make serialization, deserialization functionality transparent to the programmer. Also it reduces the iterations in collections and object members by using a memory dumping technique. Our proposed framework exposes set of data structures quite similar to data structures in Standard Template Library (STL) but with persisting capability. The proposed framework is also capable of handling data deletion. The rest of the paper is structured as follows; Section 2 reviews earlier related work, while Section 3 sketches our new framework. In Section 4 we’ll discuss a vital concept in the framework - heap management. How persistence is achieved is discussed in Section 5. PDL data structures will be introduced in section 6. Experimental results are reported in Section 7. Finally section 8 concludes the paper. Note that in this paper the term “persist” is used, also as a verb to mean “making data persistent”.
Jens-UweDolinsky and Thorsten Pawletta introduced a light weight class library to manage object persistence in C++ applications [2]. Their solution abstracts persistence mechanism. But the solution is not implemented at a data structure level. Thus their approach does not use heap management. A method of providing persistence by overloading the “new ()” operator was introduced in [3]. Their main focus was to make data persistent in an object oriented database. Several flavors of the overloaded “new ()” operators were used to achieve persistence ability. Again this method addresses the persistence problem at an object level. The solution does not provide persistence at a data structure level.
3. Overview of Proposed Approach PDL provides an application library and an API that can be used to make data persistent easily. PDL includes main data structures that we use day to day. The current implementation has stack, queue, map and list. The library can be extended to use other data structures as well. The API will mainly instruct on how to access the memory management module and how to use the data structures.
2. Related Work Not many literatures could be found in the area of data persistence. Some of them discuss about persistence in a different context. Learning about those was useful as it gave us an understanding about persistence in different contexts. Hibernate [5] is one of the most widely used data persistence framework. One could say it is more than a data persistence framework, since it also supports transaction controlling. Hibernate is primarily used in J2EE (Java 2 Platform, Enterprise Edition) [1] applications (Now available for .NET [4] applications). Hibernate makes use of persistent objects commonly known as POJO (POJO = “Plain Old Java Object”) along with XML mapping documents for persistent objects to the database layer. Thus it uses Java reflection to map POJO’s to database objects. Hibernate is specific to database persistence. But this doesn’t support persistence at a data structure level. Also Hibernate is not usable in C++ applications.
Figure 1. High level PDL architecture. According to Figure 1, PDL consists of 3 main subcomponents. If the application wishes to “persist” certain objects they must be created using the PDL memory manager. The application can access PDL data structures through an API. In the application, persistence is implemented at a class level. The class which wishes to “persist” its attributes must override the “new ()” operator and must implement a method called “DeAllocateMemory”. In addition to the type parameter, the programmer needs to pass a reference to the memory manager when invoking the “new ()” operator. The “DeAllocateMemory” method is called when deleting an object. It will also get a reference to the memory manager. These rules are enforced through C++
inheritance. The following example shows how the “new ()” operator and the “DeAllocateMemory” methods are implemented in a persisting object class. class TestItem : PDLObject { public: int i_TestNumber; TestItem(){} TestItem(int _iTestNumber):i_TestNumber(_iTestNumber){} TestItem(const TestItem& _rTestItem):i_TestNumber(_rTestItem.i_Test Number){} void* operator new (size_t _size, MemoryManager* _pManager) { TestItem* pself = (TestItem*)(_pManager>AllocateMemory(sizeof(TestItem))); pself->i_TestNumber = 0; return pself; }; void DeAllocateMemory(MemoryManager* _pManager) { _pManager>DeAllocateMemory((char*)this, sizeof(TestItem)); } virtual ~TestItem(){} };
The proposed API is quite similar to Standard Template Library API. Creation and manipulation of data structures are quite similar to the way we do in STL data structures. The following example shows how a PDL list is used. MemoryManager oMM; PDLList* pList = new (&oMM) PDLList; for(int i = 0; i < 100; ++i) { TestItem oItem(i); pList->push_back(&oMM, oItem); } pList->DeCompose(&oMM); oMM.DumpMemory("Test1.dmp");
According to the above example, PDL can make data persistent in a collection at any time by calling the memory manager’s, “DumpMemory” method. The following example shows how data can be loaded back to data structures. MemoryManager oMMLoad(true); oMMLoad.LoadMemory("Test1.dmp"); PDLList* pListLoaded = (PDLList*)oMMLoad.GetBasePointer (); pListLoaded->Compose(&oMMLoad); assert(pListLoaded != NULL); i = 1; PDLList::PDLList_Iterator_t iteLoaded = pListLoaded->begin();
for(; iteLoaded != pListLoaded->end(); ++iteLoaded) { assert(i == (*iteLoaded).i_TestNumber); ++i; }
When loading data, application has to call “LoadMemory” method of memory manager. Afterwards application can iterate data structures in the loaded file.
4. Heap Management In PDL, the memory manager plays a vital role. All objects that need to persist must be created inside the PDL memory manager. Thus all objects must be created using the “new ()” operator; i.e., in the heap. But this heap memory area is not managed by the programmer. Instead, it is managed by the PDL memory manager.
4.1. Memory Blocks At instantiation, the memory manager will allocate a block of memory from the heap (Internally PDL uses the “malloc” operator to allocate memory) and this initial memory block size is configurable. Persisting objects will be created within this allocated memory block. Inside a block, allocated and free memories are tracked using a linked list. If an application overruns the initially allocated memory block, the memory manager will allocate another block of memory but the size of the new block will be “K” times the previous block size. “K” is a constant and it is configurable. It is up to the programmer to decide the value of “K”. “K” also depends on the size of the initially allocated memory block. If an application is instantiating a lot of persisting objects and if the initially allocated block size is small, the value of “K” can be made higher. If an application uses few objects and if the initially allocated block size is high, the value of “K” should be lower. Figure 2 shows how multiple memory blocks are arranged. The process of allocating a new memory block will continue whenever the memory manager runs out of space in a particular memory block. The memory manager will maintain allocated memory blocks in a list. The start address of each memory block, the size of the memory block allocated, free memory arrangement within the block and size of the largest free node within a memory block are maintained as ”meta” information. Meta information is useful when deleting instantiated objects, when allocating new memory and when making data persistent.
Figure 2. How multiple memory blocks are arranged.
4.2. Layout inside a single memory block For each single memory block, the memory manager keeps track of allocated memory and free memory. In a block, allocated and free memories are organized as a linked list. The operating system uses a similar methodology to manage allocated and free memory [6].
a block which has enough free space. The memory manager discovers whether a particular memory block has sufficient space by comparing the requested memory size and the largest free node size within a memory block. After selecting a suitable memory block, the memory manager will traverse through the allocatedfree linked list until it finds a free space to accommodate requested memory size. If it finds a free space which can fit requested memory size, the memory manager will modify the allocated-free linked list and will return the start address assigned. When modifying allocated-free linked list we need to consider several cases. But we will only discuss a primitive case. Other cases are also handled in a similar way. Assume we have a size M memory block and the memory manager was requested to allocate X amount of memory; Figure 4. Linked list node structure after allocating memory. Before allocating any memory, the allocated-free linked list has a single free node with size M. Figure 4 shows the node structure after allocating X amount of memory. As shown in the figure, allocating memory only involves creating a new node in the list and returning the start address of the allocated memory (MX).
4.3. Memory de-allocation Figure 3. Memory management with Linked Lists Figure 3 shows how memory is managed using linked lists. Each node represents an allocated or free memory area. A node contains the following information about a memory area. 1. Whether memory is allocate or free 2. Start address 3. Size of the allocated or free memory. “A” represents an allocated memory and “F” represents a free memory. In the following sections we will refer to this linked list as “allocated-free linked list”.
4.3. Memory allocation The memory manager will be asked to allocate a given memory size (“AllocatMemory” function). The memory manager will first select a suitable memory block. To select a suitable memory block, the memory manager has to traverse through linked list until it finds
The memory manager will be asked to de-allocate memory, given the start address and size of the memory to de-allocate. By comparing base addresses of memory blocks and the start address, the memory manager will figure out from which block it has to deallocate memory. After finding the relevant block, the memory manager will traverse through the allocatedfree linked list within the block. Node within the allocated-free linked list will be found by the memory manager and will be modified in such a way so the memory de-allocation is reflected.
5. Persisting and Loading 5.1. Making data persistent PDL can make data persistent in the memory by calling the memory manager’s “DumpMemory” method.
PDL needs to make two kinds of data persistent. They are PDL specific data (metadata) and application specific data. Application specific data is saved to disk using a memory dump mechanism, as a file. PDL data (metadata) is written to disk in a serialized form to the same file. Application specific data is stored in memory blocks. Before dumping memory inside the blocks, memory addresses need to undergo a process called, “memory decomposition”. During memory decomposition the absolute addresses within blocks are converted to a value relative to memory block’s base address. Memory is written to disk after decomposing their values.
blocks. Dumped memory will be loaded into newly allocated memory blocks. After loading them to memory, the relative addresses needs to be replaced with the new addresses. For that all memory addresses used (pointers) needs to be converted to their absolute addresses. Absolute address is calculated by adding offset values to block base address. Node information within allocated-free linked lists is also need to be updated appropriately.
6. PDL Data Structures PDL provides a set of data structures with persistent capability. Lists, Maps, MultiMaps, Queues, and Sets are among them. In usage, these data structures are quite similar to STL data structures. These data structures use the memory manager to store internal data and uses above described methodology to persist and reload data.
7. Experimental Results
Figure 5. Address values before decomposition and after decomposition Figure 5 shows how pointer values change to their relative values after decomposition. Address values in allocated-free linked lists are also changed accordingly. Objective of this process is to identify address values within a block using an offset. During memory dump these converted values are written to the persistent storage.
This section presents a performance comparison results between certain STL data structures and PDL data structures. To store data inside STL containers to disk, we used conventional serialization and deserialization methodology. The experiments were carried out in a machine with following configurations; Operating System - Microsoft Windows XP, Professional, Version 2002, Machine Speed – Intel Pentium IV CPU, 2.40 GHz, Machine Memory - 256 MB RAM, Disk speed - 7200 RPM. Initial Block Size for PDL - 64 * 1024 bytes and value of K is set to 2. We used a single class when instantiating objects for both STL and PDL data containers. Following we have compared the performance of persisting and loading data for both PDL data structures and STL data structures.
5.2. Loading Data When loading data (that were made persistent previously) from disk, PDL will start reading from the beginning of the file. At the beginning, file has information about number of blocks. After PDL discovers the number of blocks it will start to create memory blocks. By reading the metadata information PDL will discover the amount of memory that needs to be allocated and the values inside allocated-free linked list, for each block. When PDL finishes reading node information it will restore internal structures (linked lists). PDL uses “malloc” function to allocate memory
Figure 6. Time to make data persistent in a list
According to the graph in Figure 6 STL list and PDL list takes more or less the same time. But when the number of data items are less than 27 x 10,000 PDL list takes less time than STL list. But when the number of data items is more than 27 x 10,000 PDL list performances reduces. This is mainly due to having more than one memory block. When the number of data items is less than 27 x 10,000 PDL operates only with a single memory block. But when it reaches 27 x 10,000 margins PDL extends the memory by adding another block. Adding another block will result in writing more metadata information. For each memory block PDL memory manager has to iterate through the blocks and store each one of data as well as metadata. Due to that PDL performance is slightly reduced.
PDL transparently handles persistence up to a great extent. The key idea behind PDL is to manage the heap and store application data as a memory dump. PDL is like a layer between the application and operating system memory. The programmer doesn’t need to think about serialization and de serialization at all. PDL can be used in a high performance environment. PDL can be efficiently used in an object oriented programming environment. Furthermore, PDL has an extensible memory model. Persisting PDL data structures show equal or better performance than standard serialization, deserialization when there is a single memory block. But when the number of memory blocks increase PDL data structure performance slightly reduced. PDL data structures are similar to STL containers. Both use the same algorithms and at the same time both are based on C++ templates. Both use partial specialization to reduce logic duplication in code.
10. References [1] Oracle Corporation and/or its affiliates, "Java 2 platform, enterprise edition (j2ee) overview", http://java.sun.com/j2ee/overview.html, January 2010.
Figure 7. Time to load data in a list. According to Figure 7 when PDL operates with a single memory block, it takes less time to load data into the list (than STL list).But when the number of blocks increases, it performance drops. When there are number of memory blocks PDL memory manager needs to iterate through them and load them to memory. This process is slightly slower than conventional de-serialization. Thus each time a new memory block is created there is a peak in the PDLList performance graph. When a new block is created some extra information needs to be written to the disk. Also when loading, that metadata needs to be deserialized and loaded into the memory. Therefore when there is more than one block, PDL loading performance is slightly lower.
8. Conclusion Presence of a framework which transparently handles persistence would be very useful in distributed C++ based applications. Such a framework will alleviate the need for conventional serialization, deserialization techniques that are error prone.
[2] Jens-UweDolinsky and Thorsten Pawletta, "A lightweight class library for extended persistent object management in c++", Software - Concepts & Tools, 19:71–79, 2004. Authors are in relation with Department of Mechanical Engineering, Process and Environmental Engineering. [3] Mi Young Lee, Ok Ja Cho, and Dae Young Hur, "Method of providing persistence to object in c++ object oriented programming system", http://www.freepatentsonline.com/6275828.html, August 2001. Patent Number - 6275828. [4] Microsoft, "Microsoft .net framework", http://www.microsoft. com/NET/, January 2010. [5] LLC. Red Hat Middleware, "Hibernate - powerful, high performance object/relational persistence and query service", https://www. hibernate.org/, January 2010. [6] Andrew S. Tanenbaum, Modern Operating Systems, Prentice Hall, pages 190–262, 2001.