A Framework For User Driven Data Management Mark Scotta , Richard P Boardmana , Philippa A Reeda , Tim Austina , Steven J Johnstona , Kenji Takedaa , Simon J Coxa a Faculty
of Engineering and the Environment, University of Southampton, SO17 1BJ, United Kingdom
Abstract Scientists within the materials engineering community produce a wide variety of data, with datasets differing in size and complexity. Examples include large 3D volume densitometry files (voxel) generated by microfocus computer tomography (µCT) and simple text files containing results from a tensile test. Increasingly, there is a need to share this data as part of international collaborations. The design of a suitable database schema and the architecture of a system that can cope with the varying information is a continuing problem in the management of heterogeneous data. We present a model flexible enough to meet users’ diverse requirements. Metadata is held using a database and its design allows users to control their own data structures. Data is held in a file store which, in combination with the metadata, gives huge flexibility. Using examples from materials engineering we illustrate how the model can be applied. Keywords: Digital libraries, Document management, Heterogeneous databases, Scientific databases, Dissemination, Data models 1. Introduction
1.1. Guiding Materials Datasets
With the volume of scientific research data increasing rapidly [1, 2], and institutions and funding councils spending large amounts of money on the resources required to produce it, proper curation and dissemination could have huge advantages for future work. The Materials Data Centre (MDC) [3] is a UK Government (JISC) funded project to establish a repository promoting data capture and management in the engineering materials domain [4, p. 5]. Responses to a questionnaire distributed at the start of the project identified the use of many types of data in materials engineering [4]. Some engineers need to store very structured data such as MatDB [5], a standard schema able to represent data from tensile, creep and fatigue tests as XML. Others just want a large data store where they can save raw, unprocessed image data until they are ready to use it. A few require the ability to create information sheets about the materials they use and link them to the results of experiments, and some just want to store raw data CSV files containing their results in order to support findings in a paper. The variety of dataset types, sizes, metadata and relationships between datasets create a complex problem in terms of storing, managing and retrieving. The model we present in this paper, created for the Materials Data Centre project and aimed at materials engineers, addresses these issues by permitting storage of heterogeneous data, accompanying metadata and relationships.
The data from the following materials engineering tests were identified for storage in a prototype system. The references provided are to materials engineering research produced by engineers (the prospective users of the Materials Data Centre) giving concrete examples of the data they produce and how it is used in their work. • Tensile test [6, 7] • S-N fatigue test [8, 9] • Fatigue crack growth [10, 11] • Impact/toughness test [12, 13] • Fractography [14, 15] The datasets generated by these tests are diverse in size and complexity. Results produced by tensile and fatigue tests can be as simple as comma separated numeric data while fractography uses computed tomography and produces files containing 3D voxel densitometric data which can be 32 GB each, and this is likely to increase as technology improves. The relationships between these tests is also important as can be seen in Figure 1. For example, the results of a tensile test can be used to calculate how a crack will grow in a fatigue crack growth test, and a material’s composition and other particulars are useful when performing or reviewing the results of most tests.
∗ Corresponding
author Email address:
[email protected] (Mark Scott)
Preprint submitted to Elsevier
March 10, 2014
the decisions behind the choice of technologies before giving detail regarding the structure of the database and how it can be used for flexible data management. In section 6 we show some examples of how the model has been used in the MDC project. Finally, we analyse our findings and draw some conclusions.
Material Classification: Heat Treatment, Composition Microstructure
Tensile Test
Calculate σy → ry Define stress level
Long Crack Fatigue/ Threshold/ Growth out
Part of data
Fractography
Part of data
2. Related work
S-N data
Systems with such wide ranging sets of data have been attempted in the biomedical domain. Entity-AttributeValue (EAV), a method of structuring data in a database described in section 3.1.2, was used in the early 1980s for medical record management where doctors could define their own data dictionaries to track diagnoses and treatments of patients [16]. The history of EAV is comprehensively described in [17]. EAV was then augmented with the EAV with classes and relationships (EAV/CR) approach which permits the modelling of complex objects as well as management of relationships between those objects [18, 19, 20]. Work in other research domains such as chemistry with eCrystals [21] and biology with OME [22] indicate that a Materials Data Centre would be valuable, but the breadth of data types make its architecture a challenge. Technologies such as EPrints [23] and DSpace [24] also contain many of the desired features because they permit storage of publications and can be adapted for data as shown by the eBank project which established the eCrystals repository that managed and disseminated metadata relating to crystal structures and investigated linking datasets from Grid/Cloud-enabled experiments to open data archives and through to peer-reviewed articles using aggregator services. This allowed crystal structure data to be provided with a paper for a reader to check validity. The LabTrove electronic notebook system [25] allows users to connect sources of data to a blog. Automatic publishing of data from a testing machine might also be possible as the project has begun looking at monitoring a file share and automatically creating blog entries. The system is mature and could be used for encouraging the sharing of data within an institution. The approach taken by the BioSimGrid project [26] to promote data sharing between biochemists used a distributed file store—specifically SRB [27]—combined with a relational database. Data and files were replicated across participating sites in Europe and the United States. The structure of the BioSimGrid database was solely for biomolecular simulation metadata; MDC, in contrast, will be used to store very general data. Similar architectures have been created in the Earth Science and Environment domains [28] built upon distributed collaborative technologies [29]. The Atmospheric Data Discovery System [30] showed the convenience of using a portal to browse stored data before downloading it, saving on time and bandwidth. The
Extra detail
Calculation data
Initiation and growth processes
Figure 1: Relationships between the different types of test data, as represented in the Materials Data Centre
1.2. Architecture To fully support the wide ranging data types and relationships, we combined a file system with a database for holding metadata. A synchronisation service keeps the metadata held in the database consistent with the file system and the database allows users to create their own data structures without affecting others or requiring an administrator. Choosing a file store as the base of the design ensures flexibility and reliability and allows the system to seamlessly take advantage of developments in file system technologies and benefit from more advanced file systems such as a distributed file system. The final system we present provides the following features: • Storage and sharing of any data files, limited in size only by the file system • The ability for users to define their own data structures, with nesting, for situations where the data is not defined in a data file • Addition of metadata to a dataset • Support of the relationships between datasets • Allow for predefined metadata to be created as a template and for metadata to be copied between datasets • Encourage use of the system by providing dataset recommendations, blogs, wikis and message boards • Allow plugins, in order for the system to provide customised reports and tools depending on the data In section 2 we discuss some of the related work that has influenced this architecture. Then we analyse some of 2
system, which processes and indexes meteorological data in BUFR format from the atmospheric science domain, permitted users to publish the BUFR metadata and query and search over the datasets and browse the metadata through a web browser before downloading. The Human Genome Project was a 15-year international project to identify all nucleotides in human chromosomes, with up to 100 000 genes each having up to 1 million nucleotides [31]. GenBank [32], the European Molecular Biology Laboratory Nucleotide Sequence Database (EMBL) [33] and the DNA Databank of Japan [34] are public databases containing nucleotide sequences and bibliographic metadata of more than 300 000 organisms and includes 255 billion nucleotide bases [32]. The EMBL database at present is 1144 GB [35] stored as text data. Newer Human Genome projects are moving to relational databases to improve performance when performing their analysis such as the Ensembl system [36]. The Utah Center for Human Genome Research developed a database which would allow relationships between objects to be stored as data in a database in contrast to modelling these relationships in the database schema. This had the advantage of reducing the number of schema changes to the database as the relationships evolved [31]. The Large Hadron Collider (LHC) [37] is a two-ring superconducting hadron particle accelerator and collider in a 26.7 km tunnel at CERN. It has four large detectors and two smaller ones each looking for different events. The LHC produces massive amounts of data; the different experiments of the LHC have to cope with gigabytes per second [38]. For example, the ALICE experiment [39] is required to process data at 1.25 GB/s transforming the data into files and storing it in Permanent Data Storage (PDS) ready for processing [40]. Oracle relational databases are also used to support the LHC for recording data such as detector conditions, calibration and geometry, and are fundamental to supporting the separate data collection processes; even these have 300 TB of disk storage [41]. A support system, the ALICE Electronic Logbook [42], permits the ALICE project’s users to record metadata about their activities. The user can choose from predefined fields and can generate reports with the ability to attach any type of file. Some of these approaches and technologies have provided inspiration for the final design. For example, eCrystals produces a single summary page for each sample that has been entered. The data blogging concept used with LabTrove is useful and one of the reasons Microsoft SharePoint was considered because it provides blogs, wikis and message boards useful for groups of users to coordinate. The extra layer above EAV that EAV/CR introduces provides a structure that is not present with plain EAV and this helps to avoid some of the disadvantages of EAV; a similar approach has been taken with the MDC’s database in order to have flexibility but also some structure. Decisions on which technologies to use are discussed further in section 3.
3. Choice of Technology Before designing the MDC, existing technologies were investigated to determine if anything could already manage these tasks. The requirements capture of MDC users was used to inform key requirements and this is discussed in more detail in section 6. 3.1. Relational database systems Relational databases are the popular approach for recording structured data. At the most basic level they can store numbers and text compartmentalised into fields. Related fields make up records, and related records are grouped into tables. Data in tables can then be associated with relationships. This model was invented by [43]. Some database systems can also store more complex items such as the contents of a file. 3.1.1. Normalisation Guidelines were invented to be followed when designing a database to ensure its most efficient operation. There are six sets of criteria known as normal forms and applying these is known as normalisation. Following the guidelines can help reduce repeated data and improve performance. Each normal form is cumulative, so a database has to fulfil the criteria of, for example, the first and second normal forms (1NF and 2NF ) in order to be considered 2NF. [44] Materials engineers’ research data can take the form of very structured data once it is published, but other data can be very unstructured. For less structured data, there are some other approaches. 3.1.2. Entity-Attribute-Value Entity-Attribute-Value (EAV) is one alternative to using an entirely normalised approach for structuring data in a relational database. EAV, sometimes called namevalue pairs, open schema or schema-less design [45], uses a new record for each attribute as opposed to needing a dedicated field. The flexibility provided by this approach means that whenever a new property is needed it can be added without altering the database schema. The advantages of EAV have led to some technologies arising which use key-value pairs to store data and this is discussed in the next section on structured storage. There are disadvantages to EAV: potentially slow with large datasets; an inability to use mandatory values and data types; the difficulty enforcing referential integrity (because the value of an attribute is a string with no constraints like a foreign key); the difficulty ensuring attribute names are used consistently; and the complicated queries that are often needed to work with the data because the database’s built-in functions are no longer able to understand the data, moving the burden of work from the database engine, and administrator, to the developer. However, in certain circumstances it can be useful. The EAV model can help to keep the design of a database 3
Table 1: Some of the structured storage systems
which are commonly used with object oriented languages and permit a more natural mapping of data objects held System Description in memory to permanent storage. CouchDB [46] Document database (JSON [47]). JSON Examples of some of these systems are shown in table 1. permits strings, numbers and lists. MongoDB [48] Document database (BSON [49]). BSON 3.3. Repository system adds to JSON by also permitting dates and Repository systems such as EPrints and DSpace albinary data. The Drupal content reposilow storing of publications in much the same way as a tory [50] can use MongoDB for the storage document management system. Among their features are of field data. security control, upload workflow and templates to enNeo4J [51, 52] Graph database. Can store nodes and resure correct publication metadata is stored. Standards lationships between nodes. Properties of such as AtomPub [58], OAI-ORE [59], OAI-PMH [60] and the nodes and relationships are stored as SWORD [61] are supported. key-value pairs. The Atom Publishing Protocol, better known as AtomMidgard [53] Content management framework. ParamPub or APP, is a protocol based on HTTP [62] for createters can be entered as a triplet of strings: ing or updating a web resource. SWORD builds on this parameter domain, parameter name and standard to provide a method of depositing into a data parameter value. repository. Rather than depositing an Atom document as Jackrabbit [54] Content repository. This is an implemenwith AtomPub, SWORD sends files. It also extends on tation of The Content Repository for Java AtomPub by allowing additional parameters to be speciTechnology (JCR) API Specification [55] fied when depositing. It does not, however, include updatwhich is an abstract model for accessing ing or deleting of resources (although this can be implea content repository. Uses a hierarchy of mented if desired as with AtomPub). Repositories such as nodes and properties. Nodes can have EPrints have implemented the SWORD profile to permit types and these types specify the properuploading of data from clients supporting the standard. ties that are available for each node and The Open Archives Initiative Object Reuse and Exany child nodes. change specification (OAI-ORE) is used to make it possible to represent a collection of resources, rather than just a single resource as in the case of a URI. OAI-ORE can repschema simple and has been used when modelling heteroresent a set of pages that go together, which helps search geneous data, or with databases with constantly changengines identify groups of web pages rather than having ing schemas, when it becomes difficult and impractical to to use advanced heuristics. A resource map describes each maintain a formal, rigid schema. resource in this aggregation, expressing the relationships and properties. It can be given in different formats such 3.2. Structured storage as RDF/XML and Atom XML [63]. A number of technologies exist that permit storage of The Open Archives Initiative Protocol for Metadata documents and tagging of metadata. Enterprise document Harvesting (OAI-PMH) provides a method of contacting management systems (that permit storage and tracking of a repository and querying it to find out about its records. documents) and web content management systems (that Supported formats include Dublin Core [64], a standard permit users to provide web page content and have the sysway of structuring metadata. tem automatically generate the web page) are examples. Both of these types are driven by the content, where the files or the web site are the basis of the system. The term content repositories is used to describe this approach. These systems take a variety of approaches to storing the data to provide flexibility to their users. Many of them try to move away from using a standard relational database which is no longer a good fit—this is called structured storage or sometimes NoSQL [56, 57]. Some of the underlying data storage mechanisms include: document databases, where a document is a collection of fields with associated values and each document’s set of fields is permitted to differ; graph databases, which can represent relationships between entities with the ability to record data as key-value pairs on the entities and the relationships; key-value stores which can look up the value of something based on an identifier; and object databases
3.4. Microsoft SharePoint Microsoft SharePoint has some document management features and also has the advantage that it allows metadata to be stored, and provides social networking features which might make the MDC more useful to its users. SharePoint provides a framework upon which to build an application. Out of the box it already has powerful search capabilities, has many of the desired social networking features such as blogs and wikis, has user-friendly web page creation as well as basic document management and workflow, and can be extended with plugins known as web parts. There are other products with similar functionality available but Microsoft SharePoint was most supported at our institution.
4
3.5. Discussion
External system e.g. EPrints with EP2DC
Users can view files and add metadata from a web interface or via external system
For systems where the data is clearly defined, the standard approach to designing a relational database system makes sense. For example, the results from a standardised tensile test and some information about the material tested would be a suitable candidate for a relational database. However, the MDC is required to support the entire materials discipline and possibly cross-discipline projects so the types of metadata that will be needed can never be fully known in advance. Designing a rigid structure to cover all types of metadata where it is not always clear how the user will organise their research data or even what form it will take is an unbounded task. Many databases impose size limits on the files they store, such as the 2 GB maximum file size imposed by Microsoft SQL Server 2005 and earlier, and the 4 GB boundary in MySQL [65, p. 798]. Microsoft SharePoint also imposes this limit, a remnant of its use of SQL Server to store files. This proves problematic for larger data files, such as 3D voxel densitometric datasets. Even for databases without this limitation, file storage in a database introduces a level of complexity which could make recovery in the case of a disaster more difficult. Many of the solutions provided by structured storage differ from the standard relational database in that they do not impose the use of a rigid schema. They tend to use some form of key-value pair arrangement: documents held in a document database are made up of key-value data, and the properties stored in graph databases take the form of keys and values. EAV is lightweight and can easily be converted to a keyvalue approach as the attribute name can be used as the key to the finding the value. This maps nicely onto technologies such as Amazon SimpleDB and DynamoDB [66, 67], Microsoft Azure Tables [68] and Google AppEngine Datastore [69], and permits scaling to massive levels. Much of the metadata for this project can be stored with attribute and value fields, but to fully support our requirements we need to enhance this by adding extra fields and tables, making our final solution a relational database approach with an extended EAV table for storing the metadata. More details about this are in section 4.2. The JCR model was perceived to be very flexible but it was felt that EAV was a better fit for the MDC’s requirements. Other technologies, such as Midgard, seemed appropriate too and could have been used for the MDC’s metadata, keeping the model of separate file and metadata storage the same.
Web Interface
File system monitor synchronises between file system and database
REST Web Service
Database File System Monitor File System
User copies files into file system
Figure 2: Materials Data Centre data flow
NoSQL) approach. In addition, while we also acknowledge that the model presented could be created using a NoSQL system, the Materials Data Centre is to be deployed within the University of Southampton and we have used existing infrastructure where possible. We have chosen Microsoft SharePoint as the interface. The University of Southampton, as well as many other institutions, have central IT infrastructure using this product so it is a good choice for the interface for us. The decision not to use SharePoint for storing files means there is no significant advantage to choosing Microsoft SharePoint over other technologies—the interface could just as easily have been built in other systems. Other systems that could be considered have been discussed in section 3.5 as well as elsewhere in section 3. 4. Architecture In this section, we present a model which uses a novel combination of a flexible, user specifiable and compoundable metadata database connected to a generic data store, in this case a file system. The system is aimed at capturing data at a research group level and encouraging sharing and discovery. Once data and metadata has been captured in a system such as this it becomes much simpler to later preserve it in a data archive. 4.1. Materials Data Centre Overview The final architecture of the Materials Data Centre consists of four components: a file system which permits users to store files, a metadata database that permit users to define their own data structures, a file system monitoring service that synchronises changes in the file system with the database, and an interface to manage datasets, their metadata and their relationships to other datasets. Figure 2 illustrates the overall system. Our implementation of the metadata database is hosted on a Microsoft SQL Server 2008 SP1 cluster managed by the University of Southampton’s central IT service. The file monitoring service runs on a Microsoft Windows Server 2008 R2 virtual machine with 4 GB of RAM using a NTFS file system for data storage. The Microsoft SharePoint site
3.6. Summary We have chosen a relational database to support the storage of metadata, with user supplied dataaset metadata being stored in an EAV inspired table. We need to store a limited amount of relational data to support the application so using EAV in a relational database gives the simplest approach, in preference to a non-relational (e.g. 5
Dataset metadata
DatasetParameterLinks
DataFiles
Dataset linking
DataFileTypes Metadata templating
DatasetCollectionDatasets
DatasetCollections
Table 2: Fields in a parameter
Dataset data files
DatasetParameters
Datasets
DatasetLinks
DatasetTypes
DatasetTypeDatasetTemplateMap
Figure 3: Database schema model used in the Materials Data Centre
and accompanying database is hosted on the same machine. The virtual machine itself is hosted using VMware ESX server by the central IT service. The database creation scripts, and code for the interface tools and file system monitor can be downloaded from the project’s CodePlex repository [70]. 4.2. Database Model for Datasets, Data Files and Their Metadata The MDC is primarily a user driven system. Users need the ability to add parameters as they see fit, without concerning themselves with the design of the database; they just want to be able to create their list of parameters and the relationships between them. For this reason, and given the success of using EAV-based databases in user driven databases in other projects, the flexibility of the EAV approach makes a lot of sense. The database model shown in Figure 3 holds five types of data: Datasets; Data files; Dataset Metadata (EAV table); Dataset Relationships; and Metadata Templates. The database and file system are kept synchronised by a file system monitor. When folders are created on disk, a dataset record representing the folder is created in the database. Folders can be used to store files, and these files are represented as data file records tied to each dataset. Datsets can be tagged with metadata, and relationships between datsets can be defined. Templates are used for holding predefined metadata which can then be used for populating a dataset’s metadata.
Field DatasetID
EAV Entity
Name Value Unit
Attribute Value Extension
Type
Extension
IsCompulsory
Extension
SourceParameterID
n/a
Description ID of the dataset the metadata is linked to Attribute name Attribute value Unit for the attribute value, for application layer validation Data type for the value, for application layer validation As the Value field is nullable to permit optional parameters, this boolean field gives the ability for error checking Optional parameter ID. If the parameter was copied from a template, this field records the original parameter’s ID
the system identify related data and enables extra features such as the parameter templates described in section 4.2.4. 4.2.3. Hierarchical Dataset Metadata The DatasetParameters table allows metadata about the dataset to be recorded using an extended EAV model. As with the standard EAV model, there is an entity, which is the dataset the parameter is linked to, an attribute, which is the parameter’s name and a parameter value. In addition to the EAV model, there are three other fields: Unit, Type and IsCompulsory. Unit permits the ability to record the unit if the parameter was a measurement, and Type and IsCompulsory permit application layer validation to ensure that required values were not omitted and were of the correct format. The fields are summarised in table 2. In order for users to create more complex data structures, the model is extended further with the addition of the DatasetParameterLinks table to permit nesting of parameters. This enables powerful tagging capabilities.
4.2.1. Datasets and Data Files The dataset is the principal data construct in the system and is used to collect related data files and metadata. Its equivalent in the file system is a folder; during the course of the project, we have also used the terms ‘container’ and ‘experiment’ to mean the same thing. In the database, the Datasets table holds the name of the folder and an optional description. Information about the files in the folder are recorded in the DataFiles table. To give an example from materials engineering, a dataset could be a collection of files relating to an experiment such as a fatigue test and associated metadata such as the last time the equipment was calibrated.
4.2.4. Dataset Metadata Templates Templates permit parameters for different dataset types to be predefined. These parameters are then copied into the dataset’s list of parameters and, because the parameter copy is linked to the original, changes to the original or copy can be identified when necessary. Templates are stored in the Datasets table because they follow the same structure as an ordinary dataset. Template datasets tend to have a special type that defines them in the system as a template, but copying of parameters from a template to a dataset, from dataset to
4.2.2. Dataset and Data File Typing With the DatasetTypes and DataFileTypes tables, it is possible to type each dataset and data file, which helps 6
dataset or from template to template is possible because the system considers them equivalent. The DatasetTypeDatasetTemplateMap table, in combination with the DatasetTypes table, links templates to types of dataset.
4.3. Data Store In our implementation of this model, we used a Microsoft Windows file server and created a Samba share for users to upload files to a file system formatted as NTFS. The choice of file system could greatly affect the flexibility of the system; for example, file size could be one issue if the incorrect file system was chosen for the data store. By changing the file system, it is possible to introduce additional features such as file replication without affecting the capabilities of the file system monitor, and by making the entry point to the system a file share, it is possible to use other tools that can write to file shares and the MDC will recognise them as new datasets. Conversely, the metadata in the database can be used to identify datasets that can then be read back via the file system. We have already discussed how the database records metadata related to files; in the following section, we discuss how a file synchronisation service keeps the metadata in the database up to date.
4.2.5. Dataset Relationships There are three ways of linking datasets. The first is through a direct link in the DatasetLinks table which associates one dataset with another. The second option is to create a new collection in the DatasetCollections table and add a list of related datasets to the collection, stored in the DatasetCollectionDatasets table. Finally, the value of a parameter might refer to another dataset by its identifier and the user interface will recognise this and render it appropriately. For example, using the string ‘Dataset 8’ inside a parameter would produce a link to the dataset with the ID of 8. 4.2.6. Database Tables There follows a summary of the tables in the database. The relationships between them are shown in Figure 3:
4.4. Synchronisation between the file system and database With files stored in the file system on a server, and metadata about those files in a database, we introduced a synchronisation service to update the database to ensure that users can see their files in the MDC’s interface almost immediately, that associated dataset and file metadata can be kept even after the original files have been deleted, and even if the file share is not available associated metadata can still be viewed in the MDC’s interface. A file system monitor identifies when folders are created and enters a matching record in the Datasets table of the database. It also detects file adds, renames, deletes and updates the DataItems table. A scheduled task ensures that missed file system events (such as when the monitoring service is not running) are also picked up. Using the file’s time stamps, SHA1 hash, name and path, the service can identify missed file adds, deletes, copies and renames. Locking and contention is tackled through exception handling in the file system monitor so this is minimally intrusive to the user.
Datasets A collection of data files and metadata. This equates to the entity in the EAV model. DataFiles A list of the data files known to the system. DatasetParameters Each parameter can contain Name, Value, Unit and Type fields for tagging a dataset with metadata. The IsCompulsory field controls whether a value is required in the parameter’s Value field. DatasetParameterLinks Permits nesting of metadata. DatasetTypes The types of dataset known to the system. DataFileTypes The types of the data file known to the system. DatasetTypeDatasetTemplateMap A list of datasets that can be used as templates for datasets of a specified type.
4.4.1. Tracking folder changes while offline The file system monitor communicates with the database DatasetLinks Contains the IDs of two datasets providto keep folder and data file names synchronised. It reads ing a direct link between them. the dataset’s ID from an XML file stored in a hidden diDatsetCollections Collections of datasets to group datasets rectory in each folder (called ‘.dataset’) when it starts together. up to identify the exact record to read from the database, in preference to relying on the name of the directory in DatasetCollectionDatasets The datasets in a collecthe file system in case it was renamed while offline. It uses tion. the dataset’s ID to look up the recorded name and flag a folder rename. The following situations can be handled: 1. Folder renamed or moved Compare the folder name with the name stored in the database
7
2. Copied folder If we load the XML file and the name in the database is different, it might have been moved if the old folder still exists or the old folder has an XML file with the same ID. In this situation, the new folder is considered to be a copy, the XML file is renamed and a new one is created with a new dataset ID.
Figure 4: Plugins at the data file level, including the CT data browser plugin, a metadata import plugin (from an XTEKCT file) and a thumbnail plugin to set an image as the thumbnail of the dataset.
4.4.2. Tracking file changes while offline The file monitoring service cannot be running all of the time so on start up, and at regular intervals, all files and folders are checked for changes in case some changes were missed. The procedure we use is as follows:
4.5.1. Metadata entry, copying and templating Once users have copied a dataset into the MDC they can use the web interface to record information about the dataset, such as adding a description, type and metadata. To make metadata entry easier, parameters can be copied between datasets, either from another dataset or from a template linked to a dataset’s type.
1. Check for files updated Timestamps changed 2. Check for unchanged files File name, file size and SHA1 hash all the same 3. Check for files renamed File name changed only 4. Check for files updated File name the same, but file size, SHA1 hash, or time stamps different 5. Check for files deleted Any files still in the database that do not match the above criteria should be considered deleted 6. Check for files added Any files on the file system that have not been found in the database using the criteria above are considered new
4.5.2. Dataset relationships and recommendations As discussed in section 4.2.5, users can create relationships between datasets which helps them to manage their datasets. The linking and typing of datasets helps the system to recommend related work to the user which has a lot of potential to help the dissemination of datasets. By looking at datasets that have been linked to, the system can suggest other datasets that also link to those datasets. Using a materials engineering example, if an engineer links an experiment to a material information sheet, the system suggests other experiments that also used that material. It can also suggest datasets that are of the same type (such as other fatigue tests).
4.4.3. Tracking file changes while online On a file system event, we trigger a check of just the dataset that changed. Rather than check only the file that was changed, we check the entire dataset using the same criteria as offline. This ensures we do not miss file renames if the event generated does not reflect exactly what happened, or if other events were missed.
4.5.3. Plugins There are two types of plugin. The first is a datasetlevel plugin such as a report as shown in Figure 6 and Figure 7. The second is a file-level plugin, such as the CT data browser shown in Figure 8 or a metadata import plugin. File-level plugins are activated by selecting them from a dropdown list next to each file’s name as shown in Figure 4. Further discussion of the interface specific to materials engineers is done in the use cases in section 6. In the next section, we discuss some of the tests performed to assess the reliability and performance of the system.
4.4.4. Ensuring consistency To ensure the database remains consistent in the case of missed events, we trigger at regular intervals a complete check of all folders on disk for comparison against the database, as well as a check of files and datasets in the database against what is on the file system. 4.4.5. Reducing the time for checking large datasets The generation of a SHA1 hash can be expensive for large datasets such as a CT scan. As we rescan the entire dataset every time any file changes, and we regularly recheck all files in all datasets, we only regenerate the SHA1 hash if any of following criteria have changed: File name; Last write time; Creation time; SHA1 hash generated time is earlier than last write time; and File size. If all of the criteria match we use the SHA1 hash stored in the database. This drastically improves the speed of checking a large dataset.
5. Performance and Reliability Testing 5.1. Tests of the synchronisation service To test the synchronisation service we set up a number of tests as shown in table 3. We were looking to prove that the service recorded the correct hash in its database and that the synchronisation service did not affect the data being stored. This was achieved using a set of bash scripts on an Apple iMac running Mac OS X 10.6.8 connected to the MDC via Samba. The scripts created folders and files containing random data and checked the database records to
4.5. Materials Data Centre Interface Some of the features of the MDC’s interface are discussed in this section. 8
Time taken (ms) with a 1 MB file
Read from MDC Read from normal share Write to MDC Write to normal share
increase in the time taken to read data when the file synchronisation service is monitoring a share. The overhead of hashing files and regularly checking the data does introduce a performance penalty. This data was calculated using one run of test number 5 in table 3 which generates one thousand 1 MB files containing random data. Please note that this is using the test scripts which pipe all data written and read through openssl to generate a SHA1 hash. We have used the median rather than the mean because there were some outlying data points which skewed the overall results (on both the MDC and normal share tests). For transparency, we also show all the data we collected, as well as the moving median. The prototype system and the testing system are not in the same buildings and are on shared networks; bottlenecks on the network, virtual machine host or even background tasks on the test client could be a factor.
Moving median w=100 Moving median w=100 Moving median w=100 Moving median w=100
1000
100
40
0
200
400 600 Test run
800
1000
Figure 5: Performance results
5.3. Testing the metadata database Parameters are nested by creating a record in the database linking two parameters. The depth of nesting is only limited by the number of records that can be created in a table but in practice will depend on the method of interfacing with the MDC. To discover limitations of the metadata component of the MDC, we created nested parameters up to 200 levels deep. Generation of the parameter editing web page remained sub-5 seconds for up to 100 levels with the simplest datasets taking under a second; at 200 levels, generation of the parameter editing page took 19 seconds. The browsing experience in Safari 5.0.5, Firefox 3.6.18 or Internet Explorer 8 also suffered as they did not cope well with so many nested tables. For our users, metadata depth is not expected to exceed five levels but optimisation steps will have to be taken if requirements change.
verify the synchronisation service had operated correctly. Some statistics from the tests are shown in table 4. We created a file containing random data, hashing the data at the same time as it is generated to verify the MDC generates the correct metadata and does not affect the file in the file share. The command that was used to generate the random data was as follows: dd if=/dev/urandom bs=1m \ count=$TESTFILESIZE 2>/dev/null \ | tee $TESTFILEPATH | openssl sha1 By piping the output through tee, the data could be written to the MDC at the same time as generating the SHA1 hash to ensure data remained intact during the transfer. We used bsqldb (from version 0.92.79 of the FreeTDS tools, installed from MacPorts) to query the SQL Server database to find the record in the database, sleeping while the synchronisation service generated the hash. Finally, we checked the hash of the file in the MDC file share to ensure the contents of the file were written correctly and were not altered by the synchronisation service or during transfer to the server. By running the tests in table 3 we show that the service is reliable and can cope with differing sizes of datasets and files.
6. Use Cases In this section, we demonstrate how the model discussed in section 4 has been used to manage materials data. We show how information about the nickel-based N18 material has been stored, and then show a tensile test of material FV566 followed by a CT scan of a sample of another material. We then discuss the EP2DC project which indicates that the system has potential for interoperability with other established systems.
5.2. Performance of the synchronisation service Figure 5 shows the effect of installing the file synchronisation service on a server using a simple moving median (w = 100) of 1000 test runs. Without the file synchronisation service, the median time to write a 1 MB file was 213 ms and 56 ms to read it back (calculated over all 1000 test results). When the MDC service is enabled, the overall average median time to write a 1 MB file was 263 ms and 59 ms to read it back, suggesting that there is an increase of 23% in the time taken to write data and a 5%
6.1. Material information sheets Information about the materials used by an engineer is frequently referenced during the course of their research, for example when performing a test such as the tensile test shown in section 6.2. To demonstrate how the model has been used to store information about materials, we will show how material N18 from [71] was recorded in the system. 9
Table 3: Synchronisation service tests
Test description Runs Files Errors 5 files: 1, 5, 10, 20 and 100 MB 200 1000 0 5 files of a random size 10) 20 { 21 Console . WriteLine ( " Material ’{0} ’ has {1} {2} chromium . " , ds . Name , chromiumParameter . ֒→ Value , chromiumParameter . Unit ) ; 22 } 23 } 24 }
14
Listing 4: Find datasets with a specific property and perform some action on them (Python)
1 with M a t e r i a l s M e t a d a t a C o n t e x t ( connection_string ) as dbcontext : 2 # Find datasets with a specific property 3 datasets = dbcontext . G e t E x p e r i m e n t s W i t h M a t c h i n g P a r a m e t e r ( " Crystal system " , " triclinic " ) 4 # Convert matching datasets to an array and loop around them 5 for ds in Enumerable . ToArray [ Experiment ]( datasets ) : 6 # Check if we ’ ve visited this dataset before 7 visitedparameter = ds . LookupParameter ( " Test script executed date " ) 8 if visitedparameter is None : 9 # Perform some action on the dataset 10 print ds . ID , ds . Name , " -" , ds . GetUNCPath ( dbcontext ) , ds . GetDetailsWebPage ( dbcontext ) 11 # Record that we have visited this dataset to skip it next time 12 ds . SaveParameterValue ( dbcontext , " Test script executed date " , time . strftime ( " % c " ) , ֒→ create = True ) 13 dbcontext . SaveChanges ()
Listing 5: Find recommended datasets related to dataset 16, list some information and create a link to them (Python)
1 with M a t e r i a l s M e t a d a t a C o n t e x t ( connection_string ) as dbcontext : 2 t e n s i l e _ t e s t _ d a t a s e t = dbcontext . GetExperimentByID (16) 3 related_datasets = dbcontext . G e t R e l a t e d E x p e r i m e n t s ( t en s i l e _ t e s t _ d a t a s e t . ID , count =5 , ֒→ findDeleted = False ) 4 print " Related datasets : " 5 for ds in related_datasets : 6 # Print dataset information , the dataset ’s UNC path and the dataset ’s web page 7 print ds . ID , ds . Name , " -" , ds . GetUNCPath ( dbcontext ) , ds . GetDetailsWebPage ( dbcontext ) 8 # Link dataset 16 to recommended dataset 9 dbcontext . C r e a t e E x p e r i m e n t L i n k ( t e n s i l e _ t e s t _ d a t a s e t . ID , ds . ID , saveChanges = True )
Listing 6: Using LINQ instead of the API to search for data in the database (C#)
1 using ( M a t e r i a l s M e t a d a t a C o n t e x t dbcontext = new M a t e r i a ls M e t a d a t a C o n t e x t ( connection_string ) ) 2 { 3 int ctscanType = 2; 4 var datafiles = 5 // Retrieve datasets filtered by type 6 dbcontext . Experiments . Where ( ds = > ds . ExperimentType . ID == ctscanType ) 7 // Retrieve all files in the matching datasets 8 . SelectMany ( ex = > ex . ExperimentDataFiles ) 9 // Filter the files by name 10 . Where ( df = > df . Filename . Contains ( " . xtekct " ) ) ; 11 12 foreach ( ExperimentDataFile df in datafiles ) 13 { 14 // Load dataset information from database 15 df . ExperimentReference . Load () ; 16 // Parse file ( for brevity , we assume file exists and is valid ) . 17 FileParser_XTEKCT filedata = new FileParser_XTEKCT ( df . GetFullPath () ) ; 18 foreach ( KeyValuePair < string , string > pair in filedata . Properties ) 19 { 20 // Create parameter 21 ExperimentParameter newParam = new ExperimentParameter () { Name = pair . Key , Value = pair ֒→ . Value }; 22 dbcontext . A d d E x p e r i m e n t P a r a m e t e r T o E x p e r i m e n t ( df . Experiment . ID , newParam ) ; 23 } 24 dbcontext . SaveChanges () ; 25 } 26 }
15
Table 6: Encoding larger tables using the MDC’s metadata database
8. Discussion
(a) Using parameters to encode a more complex table
The model of file store, metadata database and example interface hosted in Microsoft SharePoint has been built as an output of this project and tested with a number of use cases: a material information sheet, a tensile test and 3D densitometry data. This section evaluates the system looking at the metadata database, data store and interface components.
Parameter 1 1.1 1.2 1.3 2 2.1 2.2 2.3 3 3.1 3.2 3.3
8.1. Feedback from user testing The MDC system is being piloted by researchers in the materials engineering department. Most of the feedback involved requests for new reporting features or improvements to the user interface, such as: • Reordering of metadata entries, especially on reports The ability to reorder metadata entries will involve a modification to the database schema to record the parameters’ order.
Name Column headings Element Wt % Element description Row Element Wt % Element description Row Element Wt % Element description
Value — — — — — Co 15.4 Cobalt — Cr 11.1 Chromium
(b) The resultant table
Element Co Cr
• Browsing datasets with large numbers of files or parameters can be slow Improvements such as hiding of files, browsing of files by subdirectory (instead of listing all files in a dataset), and optimisation of the prototype code in some components (such as not listing all parameters, files, plugins and all other dataset information on the same page) would improve performance.
Wt % 15.4 11.1
Element description Cobalt Chromium
Due to this very simple database structure, filtering and searching even with large datasets should be scalable as complex queries with many joins are not involved; indices will assist with this. In some cases the EAV approach may introduce a performance penalty but for a user-driven data repository, where searches are generally simple, any delays should not become a disadvantage. For intense data analysis, a database structure tailored to the application is likely to give better performance. Modifications to the database may be necessary in order to improve the toolset of the MDC, such as the ordering of metadata parameters requested by users of the prototype. The approach of storing the data in a file system provides the ability to seamlessly take advantages of developments in storage technology as they become available in the future, such as cloud storage and improved local file systems. There is an impact on the performance of using a share monitored by the MDC file synchronisation service (5% on reading, 23% on writing), caused by the additional overhead of collecting metadata including hashes of incoming data as discussed in section 5. Some of the disadvantages of using an EAV inspired approach were discussed in section 3.1.2. This included an inability to use mandatory values and data types and the difficulty with ensuring attribute names are used consistently. These issues are partially mitigated with a number of features that have been discussed in section 4. The templating features allows predefined metadata sets to be copied onto datasets ensuring metadata is completed consistently, and by recording where parameters were copied
• Reordering of figures in reports The reporting plugin could be adapted to allow this feature. • Ability to create more complex tables in reports One of the reports includes a simple two row table consisting of one row of ‘Name’ entries and one row of ‘Value’ entries. Some users wanted more complex tables with multiple rows and multiple columns. A new interface component could allow users to encode more complex tables using parameters to represent the cells in the table—no changes to the database schema would be required. An example of how this might be encoded is shown in tables 6a and 6b. The ability to incorporate features such as those requested by users in this section and adapt the system in response to feedback demonstrates the flexibility of the architecture. 8.2. Metadata and data storage Storing the dataset metadata in the way we have has permitted the wide ranging data types required in the MDC to be supported with a very simple schema. Incidentally, we have successfully employed the metadata storage approach to also store configuration data for the MDC system and used templates to ensure the configuration datasets are created correctly. 16
from it is possible to repair or highlight names that had been mistakenly changed. The additional fields we have included in the parameter table allow us to support mandatory values and data types at the interface level. However, it is important to note that the user is in control of their metadata as this is a system for user driven data management and in early prototypes of the system, user testing showed that enforcing mandatory values or types frustrated users. Therefore the interface ensures users are aware of any discrepancies but does not enforce anything, ensuring they can always produce the reports that they require or record the metadata needed for their situation, even at the cost of consistency. The effect of this is still unknown. User feedback has so far shown that they are content with the metadata features provided, other than suggested improvements already discussed in section 8.1. It is postulated that parameter names will remain consistent for users who use the templates, but the system is flexible enough to allow the user to deviate from the template if required.
The linking and typing of datasets helps the system to recommend related work to the user which has a lot of potential to improve the dissemination of datasets. By allowing users to upload datasets via a file share, this simplifies dataset upload and has huge potential for capturing datasets in materials engineering and at the research-group level, making it easier to transfer to a dataset archive at a later date. The approach taken is suitable for storage of any data files and metadata that can be recorded against a set of files. We have shown this flexibility with a number of use cases, suggesting that the architecture of the MDC may be flexible enough to be used in disciplines other than materials engineering. 10. Future work There are a number of advancements that could be made to the MDC, including the linking of multiple repositories and the use of replicated files to allow the infrastructure to scale and improve the reliability of the system. Replicating files between data centres could be achieved with minimal modifications to the system as the synchronisation service would pick up on the incoming files. Possible approaches to the replication of the metadata include using the XML file stored with the files to transmit the metadata with the replicated files, or adding to the service layer created for the EP2DC project so a remote data centre can download the missing database records when it receives a file via replication. Adding direct support for protocols such as SWORD, OAI-ORE, OAI-PMH and OData could be considered sensible future work to improve the interoperability of the system. We have also begun investigating using an RDF representation of the data and tools to increase the power of queries that can be performed. Feedback on the prototype system from our materials users has generally been on additional tools or new reports and some ideas on improving the usability of the interface. This paper discussed the architecture of the JISC funded Materials Data Centre (MDC), a repository for capturing and managing data produced by materials engineers. The generic nature of the design has led to a new name for the project: the Heterogeneous Data Centre (HDC). Future work with the HDC will look at hosting medical data while continuing to support materials research.
8.3. Materials Data Centre Interface The interface components allow users to manage the metadata surrounding their datasets, produce reports, execute plugins and create relationships between datasets for management and to obtain dataset recommendations. As a consequence of using Microsoft SharePoint as the underlying interface technology, users are also able to produce their own landing pages for their data by using the provided MDC web parts, by linking to an automatic report or, for advanced users, creating a custom web part that can present the data in a meaningful way. Users could also manage their own project sites, including the ability to use blogs and wikis. 9. Conclusions The combination of a flexible database for metadata and file system for storing images, raw data and documents has been able to cope with many of the different types of materials data. Disconnecting the file storage and metadata from the technology chosen for the interface introduces much more flexibility to the system, while the addition of a web service provides yet another entry point. We have shown the use of the metadata database and file system with a number of use cases and an example interface which was developed to be used with Microsoft SharePoint to take advantage of some of its features such as user-created wikis. Users are not limited to the SharePoint interface as shown by the EP2DC project. This produced a web service layer that permitted EPrints to upload data and communicate with the MDC. Furthermore, any client that can connect to the file system is able to upload data and the web service can be used by systems other than EPrints, such as a web browser via a POST request to the REST end point.
Acknowledgements The MDC project team gratefully acknowledge funding from JISC. We appreciate the support given to us by Microsoft and Microsoft Research Connections, and the EPrints team. We thank Kath Soady, Thomas Mbuya and Eleanor Hawkins for feedback and prototype testing as well as provision of materials engineering data. [1] T. Hey, A. Trefethen, The data deluge: An e-Science perspective, in: Grid Computing — Making the Global Infrastructure a Reality, Wiley and Sons, 2003, pp. 809–824.
17
[2] S. Johnston, H. Fangohr, S. J. Cox, Managing Large Volumes of Distributed Scientific Data, in: M. Bubak, G. D. Albada, J. Dongarra, P. M. Sloot (Eds.), Computational Science – ICCS 2008, Vol. 5103 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2008, pp. 339–348. doi:10.1007/978-3-540-69389-5_39. [3] P. Reed, T. Austin, K. Takeda, Materials Data Centre, Project Proposal to JISC (April 2009). [4] C. Rimer, P. Reed, Materials Data Centre – report on user requirements, Tech. rep., University of Southampton (August 2009). [5] T. Ojala, H.-H. Over, Approaches in using MatML as a common language for materials data exchange, Data Science Journal 7 (2008) 179–195. [6] M. S. Ali, P. A. S. Reed, S. Syngellakis, Comparison of fatigue performance of HVOF spray coated and conventional roll bonded aluminium bearing alloys, Materials Science and Technology 25 (5) (2009) 575–581. doi:10.1179/174328408X322213. [7] G. Sha, X. Sun, T. Liu, Y. Zhu, T. Yu, Effects of Sc Addition and Annealing Treatment on the Microstructure and Mechanical Properties of the As-rolled Mg-3Li alloy, Journal of Materials Science & Technology 27 (8) (2011) 753 – 758. doi:10.1016/S1005-0302(11)60138-2. [8] T. O. Mbuya, I. Sinclair, A. J. Moffat, P. A. Reed, Analysis of fatigue crack initiation and S-N response of model cast aluminium piston alloys, Materials Science and Engineering: A 528 (24) (2011) 7331 – 7340. doi:10.1016/j.msea.2011.06.007. [9] S. Lee, Y. Lu, P. Liaw, L. Chen, S. Thompson, J. Blust, P. Browning, A. Bhattacharya, J. Aurrecoechea, D. Klarstrom, Tensile-hold low-cycle-fatigue properties of solid-solutionstrengthened superalloys at elevated temperatures, Materials Science and Engineering: A 504 (1–2) (2009) 64 – 72. doi:10.1016/j.msea.2008.10.030. [10] H. T. Pang, P. A. S. Reed, Microstructure variation effects on room temperature fatigue threshold and crack propagation in Udimet 720Li Ni-base superalloy, Fatigue & Fracture of Engineering Materials & Structures 32 (8) (2009) 685–701. doi:10.1111/j.1460-2695.2009.01366.x. [11] X. Wang, D. Yin, F. Xu, B. Qiu, Z. Gao, Fatigue crack initiation and growth of 16MnR steel with stress ratio effects, International Journal of Fatigue 35 (1) (2012) 10 – 15. doi:10.1016/j.ijfatigue.2011.05.007. [12] T. Morgeneyer, M. Starink, S. Wang, I. Sinclair, Quench sensitivity of toughness in an Al alloy: Direct observation and analysis of failure initiation at the precipitatefree zone, Acta Materialia 56 (12) (2008) 2872 – 2884. doi:10.1016/j.actamat.2008.02.021. [13] M. Fattahi, N. Nabhani, M. Vaezi, E. Rahimi, Improvement of impact toughness of AWS E6010 weld metal by adding TiO2 nanoparticles to the electrode coating, Materials Science and Engineering: A 528 (27) (2011) 8031 – 8039. doi:10.1016/j.msea.2011.07.035. [14] P. A. S. Reed, J. F. Knott, Investigation of the role of residual stresses in the warm prestress (WPS) effect. Part II – Analysis, Fatigue & Fracture of Engineering Materials & Structures 19 (4) (1996) 501–513. doi:10.1111/j.1460-2695.1996.tb00985.x. [15] D. R.-B. Aroush, E. Maire, C. Gauthier, S. Youssef, P. Cloetens, H. Wagner, A study of fracture of unidirectional composites using in situ high-resolution synchrotron X-ray microtomography, Composites Science and Technology 66 (10) (2006) 1348 – 1353. doi:10.1016/j.compscitech.2005.09.010. [16] W. W. Stead, W. E. Hammond, M. J. Straube, A chartless record — is it adequate?, Journal of medical systems 7 (2) (1983) 103–9. doi:10.1007/BF00995117. [17] V. Dinu, P. Nadkarni, Guidelines for the effective use of entityattribute-value modeling for biomedical databases., International journal of medical informatics 76 (11-12) (2006) 769–79. doi:10.1016/j.ijmedinf.2006.09.023. [18] P. M. Nadkarni, L. Marenco, R. Chen, E. Skoufos, G. Shepherd, P. Miller, Organization of heterogeneous scientific data using the EAV/CR representation., J Am Med Inform Assoc 6 (6) (1999)
478–493. [19] L. Marenco, N. Tosches, C. Crasto, G. Shepherd, P. L. Miller, P. M. Nadkarni, Achieving evolvable Web-database bioscience applications using the EAV/CR framework: recent advances., Journal of the American Medical Informatics Association 10 (5) (2003) 444–453. [20] M. J¨ ager, L. Kamm, D. Krushevskaja, H.-A. Talvik, J. Veldemann, A. Vilgota, J. Vilo, Flexible Database Platform for Biomedical Research with Multiple User Interfaces and a Universal Query Engine, in: H.-M. Haav, A. Kalja (Eds.), Databases and Information Systems V: Selected Papers from the Eighth International Baltic Conference, DB&IS 2008, Ios Pr Inc, 2008, pp. 301–310. doi:10.3233/978-1-58603-939-4-301. [21] S. J. Coles, J. G. Frey, M. B. Hursthouse, M. E. Light, A. J. Milsted, L. A. Carr, D. DeRoure, C. J. Gutteridge, H. R. Mills, K. E. Meacham, M. Surridge, E. Lyon, R. Heery, M. Duke, M. Day, An e-science environment for service crystallography — from submission to dissemination, Journal of Chemical Information and Modeling 46 (3) (2006) 1006–1016. doi:10.1021/ci050362w. [22] M. Linkert, C. T. Rueden, C. Allan, J.-M. Burel, W. Moore, A. Patterson, B. Loranger, J. Moore, C. Neves, D. MacDonald, A. Tarkowska, C. Sticco, E. Hill, M. Rossner, K. W. Eliceiri, J. R. Swedlow, Metadata matters: access to image data in the real world, The Journal of Cell Biology 189 (5) (2010) 777–782. doi:10.1083/jcb.201004104. [23] C. Gutteridge, GNU EPrints 2 Overview, in: 11th Panhellenic Academic Libraries Conference, 2002. [24] M. J. Bass, M. Branschofsky, DSpace at MIT: Meeting the Challenges, Joint Conference on Digital Libraries (2001) 468. [25] J. Frey, Curation of laboratory experimental data as part of the overall data lifecycle, International Journal of Digital Curation 3 (1) (2008) 44–62. [26] M. H. Ng, S. Johnston, B. Wu, S. E. Murdock, K. Tai, H. Fangohr, S. J. Cox, J. W. Essex, M. S. P. Sansom, P. Jeffreys, BioSimGrid: grid-enabled biomolecular simulation data storage and analysis, Future Generation Computer Systems 22 (6) (2006) 657–664. doi:10.1016/j.future.2005.10.005. [27] C. Baru, R. Moore, A. Rajasekar, M. Wan, The SDSC Storage Resource Broker, in: Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research, IBM Press, 1998, p. 5. [28] S. Fiore, A. Negro, G. Aloisio, The data access layer in the GRelC system architecture, Future Generation Computer Systems 27 (3) (2011) 334–340. doi:10.1016/j.future.2010.07.006. [29] I. Foster, C. Kesselman, J. M. Nick, S. Tuecke, The Physiology of the Grid, in: F. Berman, G. Fox, T. Hey (Eds.), Grid Computing: Making the Global Infrastructure a Reality, John Wiley & Sons, Ltd, 2003, pp. 217–249. doi:10.1002/0470867167.ch8. [30] S. L. Pallickara, S. Pallickara, M. Zupanski, Towards efficient data search and subsetting of large-scale atmospheric datasets, Future Generation Computer Systems 28 (1) (2012) 112 – 118. doi:10.1016/j.future.2011.05.010. [31] R. Sargent, D. Fuhrman, T. Critchlow, T. Di Sera, R. Mecklenburg, G. Lindstrom, P. Cartwright, The design and implementation of a database for human genome research, in: Proceedings of 8th International Conference on Scientific and Statistical Data Base Management, IEEE Comput. Soc. Press, 1996, pp. 220–225. doi:10.1109/SSDM.1996.506064. [32] D. a. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, E. W. Sayers, GenBank., Nucleic acids research 38 (Database issue) (2010) D46–51. doi:10.1093/nar/gkp1024. [33] T. Kulikova, P. Aldebert, N. Althorpe, W. Baker, K. Bates, P. Browne, A. van den Broek, G. Cochrane, K. Duggan, R. Eberhardt, N. Faruque, M. Garcia-Pastor, N. Harte, C. Kanz, R. Leinonen, Q. Lin, V. Lombard, R. Lopez, R. Mancuso, M. McHale, F. Nardone, V. Silventoinen, P. Stoehr, G. Stoesser, M. A. Tuli, K. Tzouvara, R. Vaughan, D. Wu, W. Zhu, R. Apweiler, The EMBL Nucleotide Sequence Database., Nucleic acids research 32 (suppl 1) (2004) D27–30.
18
[55] JSR-000283 Content Repository for Java Technology API 2.0 Final Release, [accessed 19-July-2011]. URL http://www.jcp.org/en/jsr/detail?id=283 [56] M. Rys, Scalable SQL, Communications of the ACM 54 (6) (2011) 48. doi:10.1145/1953122.1953141. [57] E. Meijer, G. Bierman, A co-relational model of data for large shared data banks, Communications of the ACM 54 (4) (2011) 49. doi:10.1145/1924421.1924436. [58] J. Gregorio, B. de hOra, The Atom Publishing Protocol, RFC 5023 (Proposed Standard) (Oct. 2007). URL http://www.ietf.org/rfc/rfc5023.txt [59] Open Archives Initiative Protocol — Object Exchange and Reuse, [accessed 1-July-2011]. URL http://www.openarchives.org/ore/ [60] The Open Archives Initiative Protocol for Metadata Harvesting, [accessed 1-July-2011]. URL http://www.openarchives.org/OAI/ openarchivesprotocol.htm [61] J. Allinson, L. Carr, J. Downing, D. F. Flanders, S. Francois, R. Jones, S. Lewis, M. Morrey, G. Robson, N. Taylor, SWORD AtomPub profile version 1.3, [accessed 6-June-2010] (ca.2010). URL http://purl.org/net/sword/ [62] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee, Hypertext Transfer Protocol — HTTP/1.1, RFC 2068 (Proposed Standard), obsoleted by RFC 2616 (Jan. 1997). URL http://www.ietf.org/rfc/rfc2068.txt [63] M. Nottingham, R. Sayre, Atom Format, RFC 4287 (Dec. 2005). URL http://www.ietf.org/rfc/rfc4287.txt [64] J. Kunze, T. Baker, The Dublin Core Metadata Element Set, RFC 5013 (Aug. 2007). URL http://www.ietf.org/rfc/rfc5013.txt [65] MySQL Documentation Team, MySQL 5.6 Reference Manual, Oracle Corporation, 28103rd Edition, [accessed 22-November2011] (Nov 2011). URL http://dev.mysql.com/doc/refman/5.6/en/ [66] Amazon SimpleDB documentation, [Accessed 31/10/2013]. URL http://aws.amazon.com/documentation/simpledb/ [67] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazon’s highly available key-value store, ACM SIGOPS Operating Systems Review 41 (6) (2007) 205– 220. doi:10.1145/1323293.1294281. [68] Windows Azure Table Storage and Windows Azure SQL Database—compared and contrasted, [Accessed 31/10/2013]. URL http://msdn.microsoft.com/en-us/library/ windowsazure/jj553018.aspx [69] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Transactions on Computer Systems 26 (2) (2008) 4:1– 4:26. doi:10.1145/1365815.1365816. [70] Materials Data Centre and Heterogeneous Data Centre CodePlex repository. URL http://hdc.codeplex.com [71] S. Everitt, M. Starink, H. Pang, I. Wilcock, M. Henderson, P. Reed, A comparison of high temperature fatigue crack propagation in various sub-solvus heat treated turbine disc alloys, Materials Science and Technology 23 (2007) 1419–1423. URL http://eprints.soton.ac.uk/45815/ [72] T. Austin, M. Scott, S. Johnston, P. Reed, K. Takeda, EP2DC – an EPrints module for linking publications and data, in: Ensuring long-term preservation and adding value to scientific and technical data, 2011. [73] E. Prud’hommeaux, A. Seaborne, SPARQL query language for RDF, W3C Recommendation (January 2008). URL http://www.w3.org/TR/2008/ REC-rdf-sparql-query-20080115/
doi:10.1093/nar/gkh120. [34] E. Kaminuma, T. Kosuge, Y. Kodama, H. Aono, J. Mashima, T. Gojobori, H. Sugawara, O. Ogasawara, T. Takagi, K. Okubo, Y. Nakamura, DDBJ progress report., Nucleic acids research 39 (September 2010) (2010) 22–27. doi:10.1093/nar/gkq1041. [35] EMBL Nucleotide Sequence Database Release Notes, [accessed 29-June-2011]. URL ftp://ftp.ebi.ac.uk/pub/databases/embl/release/ relnotes.txt [36] S. C. Potter, L. Clarke, V. Curwen, S. Keenan, E. Mongin, S. M. J. Searle, A. Stabenau, R. Storey, M. Clamp, The Ensembl analysis pipeline., Genome research 14 (5) (2004) 934–41. doi:10.1101/gr.1859804. [37] L. Evans, P. Bryant, LHC machine, Journal of Instrumentation 3 (08) (2008) S08001. doi:10.1088/1748-0221/3/08/S08001. [38] J. Toledo, Past, present and future of data acquisition systems in high energy physics experiments, Microprocessors and Microsystems 27 (8) (2003) 353–358. doi:10.1016/S0141-9331(03)00065-6. [39] The ALICE Collaboration, The ALICE experiment at the CERN LHC, Journal Of Instrumentation 3 (08) (2008) S08002– S08002. doi:10.1088/1748-0221/3/08/S08002. [40] R. Divi` a, U. Fuchs, I. Makhlyueva, P. V. Vyvre, V. Altini, F. Carena, W. Carena, S. Chapeland, V. C. Barroso, F. Costa, F. Roukoutakis, K. Schossmaier, C. So` os, B. V. Haller, t. A. Collaboration, The ALICE online data storage system, Journal of Physics: Conference Series 219 (5) (2010) 052002. doi:10.1088/1742-6596/219/5/052002. [41] M. Girone, CERN database services for the LHC computing grid, Journal of Physics: Conference Series 119 (5) (2008) 052017. doi:10.1088/1742-6596/119/5/052017. [42] V. Altini, F. Carena, W. Carena, S. Chapeland, V. C. Barroso, F. Costa, R. Divi` a, U. Fuchs, I. Makhlyueva, F. Roukoutakis, K. Schossmaier, C. So` os, P. V. Vyvre, B. V. Haller, t. A. Collaboration, The ALICE Electronic Logbook, Journal of Physics: Conference Series 219 (2) (2010) 022027. doi:10.1088/1742-6596/219/2/022027. [43] E. F. Codd, A relational model of data for large shared data banks, Communications of the ACM 13 (6) (1970) 377 – 387. [44] C. J. Date, An introduction to database systems, Pearson/Addison Wesley, 2004. [45] B. Karwin, SQL Antipatterns: Avoiding the Pitfalls of Database Programming, Pragmatic Bookshelf Series, Pragmatic Bookshelf, 2010. URL http://books.google.co.uk/books?id=Ghr4RAAACAAJ [46] J. C. Anderson, J. Lehnardt, N. Slater, CouchDB: The Definitive Guide, O’Reilly, 2010. [47] D. Crockford, The application/json Media Type for JavaScript Object Notation (JSON), RFC 4627 (Jul. 2006). URL http://www.ietf.org/rfc/rfc4627.txt [48] K. Chodorow, M. Dirolf, MongoDB: The Definitive Guide, O’Reilly Media, 2010. [49] The BSON Specification, [accessed 19-July-2011]. URL http://bsonspec.org/ [50] A. Byron, A. Berry, N. Haug, J. Eaton, J. Walker, J. Robbins, Using Drupal, O’Reilly Media, 2008. [51] D. Dominguez-Sal, P. Urb´ on-Bayes, A. Gim´ enez-Va˜ n´ o, S. G´ omez-Villamor, N. Mart´ınez-Baz´ an, J. Larriba-Pey, Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark, in: H. Shen, J. Pei, M. zsu, L. Zou, J. Lu, T.-W. Ling, G. Yu, Y. Zhuang, J. Shao (Eds.), Web-Age Information Management, Vol. 6185 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2010, pp. 37–48. doi:10.1007/978-3-642-16720-1\_4. [52] The Neo4j Manual v1.4, [accessed 19-July-2011]. URL http://docs.neo4j.org/chunked/1.4/ [53] Midgard Open Source Content Management Framework, [accessed 19-July-2011]. URL http://www.midgard-project.org/ [54] Apache Jackrabbit Web Site, [accessed 19-July-2011]. URL http://jackrabbit.apache.org/
19