David S. H. ROSENTHAL, Thomas ROBERTSON, Tom LIPKIS, Vicky REICH, and .....  Norman GRAY, Tobia CAROZZI, and Graham WOAN, "Managing ...
Digital Preservation in e-Science scenarios - An Enterprise Architecture Approach
Diogo Filipe Lopes Fernandes
Dissertation submitted to obtain the Master Degree in
Communication Networks Engineering
Jury Chairman: Prof. Doutor Paulo Jorge Pires Ferreira Supervisor: Prof. Doutor José Luís Brinquete Borbinha Members: Prof. Doutor André Ferreira Ferrão Couto e Vasconcelos
Em primeiro lugar, gostaria de manifestar o meu agradecimento a todos os que contribuíram directa e indirectamente para a conclusão deste trabalho. No plano académico, gostaria de expressar o meu agradecimento e reconhecimento:
Ao professor José Borbinha, pelo apoio, proximidade e conselhos, durante todo o processo de investigação;
À Marzieh, pela simpatia e disponibilidade em discutir ideias para o rumo do meu trabalho;
Ao Gonçalo Antunes pelas conversas pontuais e sugestões de correcção para a escrita da dissertação.
No plano empresarial, gostaria de agradecer ao Gonçalo Borges e ao Jorge Gomes, do LIP, pela disponibilidade ao acolherem e possibilitarem a execução e validação deste trabalho. No plano familiar gostaria de agradecer à minha Mãe
motivação que sempre me transmitiu durante os momentos mais difíceis. Aos meus Avós maternos, Elsa e Manuel, apesar de não poderem estar presentes fisicamente, agradeço todo o amor e carinho que me deram, constituindo uma fonte de força e inspiração.
We are in an era of data-centric scientific research, in which scientific discovery is not carried out only through the well-defined, strict process of hypotheses testing, but generated by combining the pool of data already available. The scientific data environment is expanding rapidly in both scale and diversity. The large volumes of data, the complex relationships between them, the intense and constant interdisciplinary collaborations between disciplines and new types of near-real-time publishing are adding pattern and rule discovery to the scientific method. The growing open-data and open-science movements have helped to focus attention on issues related to the lifecycle of scientific research data and related processes, especially concerning their preservation. This has been associated to the emergence of the concept of e-science, based on new scientific experiments unifying experiment, theory, and computing, exploiting advanced computational resources, data collection and scientific instruments. Capturing the important elements of data preservation, collaboration and provenance will require new approaches in the highly distributed, data-intensive research community. Hence, this project aimed at the preservation of the data analysis processes taking part on the context of a collaboration, combining the best practices of IT Governance and the requirements of a typical e-science scenario.
Key Words Digital Preservation, e-science, Data Management Plan, Logbook, IT Governance
Actualmente, estamos numa era onde a pesquisa científica é centrada nos dados, e em que esta não é realizada somente através de processos bem definidos e rigorosos de testes de hipóteses, mas é gerada pela combinação e execução de actividades experimentais sobre o conjunto de dados disponíveis. Desta forma, o contexto de dados científicos está a expandir-se rapidamente em escala e diversidade. Os grandes volumes de dados, as relações complexas entre eles, as intensas e constantes colaborações interdisciplinares e novos tipos de publicação quase em tempo real, estão a adicionar novos padrões e regras para a execução do método científico. O crescimento de movimentos que incluem dados e ciência abertos, ajudaram no foco da atenção para questões relacionadas com o ciclo de vida de dados de investigação científica e respectivos processos, especialmente em relação à sua preservação. Isto tem sido associado ao aparecimento do conceito de e-ciência, baseado em novas experiências científicas unificando a experimentação, a teoria e a computação, que exploram avançados recursos computacionais, colecção de dados e instrumentos científicos. A captura dos elementos importantes de preservação de dados, colaboração e sua origem, vai exigir novas abordagens na comunidade de investigação, altamente distribuída e com uso intensivo de dados. Assim, este projecto irá visar a preservação dos processos de análise de dados tomando parte no contexto de uma colaboração, combinando as melhores práticas de governação de TI e os requisitos de um cenário de e-Ciência típico.
Palavras Chave Preservação Digital, e-ciência, Plano de Gestão de Dados, Logbook, Governação TI
Table of Contents AGRADECIMENTOS ........................................................................................................................................... II ABSTRACT ...................................................................................................................................................... III KEY WORDS ................................................................................................................................................... III RESUMO .......................................................................................................................................................... IV PALAVRAS CHAVE ........................................................................................................................................... IV TABLE OF CONTENTS....................................................................................................................................... VI LIST OF FIGURES .......................................................................................................................................... VIII LIST OF TABLES ............................................................................................................................................... IX ACRONYMS ....................................................................................................................................................... X 1.
INTRODUCTION .................................................................................................................................. 1 1.1. Impact of science’s changes on organisations...................................................................................1 1.2. Problem context ..........................................................................................................................................2 1.3. Motivation .....................................................................................................................................................3 1.4. Research methodology..............................................................................................................................4 1.5. Thesis outline ...............................................................................................................................................6
RELATED WORK ................................................................................................................................. 7 2.1. Digital Preservation (DP) .........................................................................................................................7 2.1.1. The concept of preservation ..........................................................................................................8 2.2. Enterprise Architecture (EA) ..................................................................................................................9 2.3. Information Technology Infrastructure Library (ITIL) .............................................................. 10 2.3.1. The service lifecycle ....................................................................................................................... 11 2.3.2. Service transition – preparing for change .............................................................................. 13 2.3.3. Service Asset and Configuration Management (SACM) ..................................................... 13 2.3.4. Components, tools and databases ............................................................................................. 16 188.8.131.52. Service Knowledge Management System (SKMS) ...................................................... 16 184.108.40.206. Configuration Management System (CMS) ................................................................... 16 220.127.116.11. Configuration Management Database (CMDB)............................................................ 17 2.4. E-science ..................................................................................................................................................... 18 2.4.1. Scientific workflow......................................................................................................................... 20 2.4.2. Business workflow vs. Scientific workflow ............................................................................ 22 2.4.3. Collaborative e-science experiments ....................................................................................... 23 2.4.4. Open data in collaborative e-science experiments .............................................................. 24 2.4.5. Electronic notebook ....................................................................................................................... 25 2.5. Data Management Plan (DMP) ............................................................................................................ 27 2.5.1. The importance of DMP for research community ............................................................... 28 2.5.2. DMP international practices ....................................................................................................... 29
PROBLEM ANALYSIS ........................................................................................................................ 31 3.1. Research problem and goals ................................................................................................................ 31 3.2. The need for better DM in scientific institutions .......................................................................... 31 3.2.1. The preservation of HEP data..................................................................................................... 32 18.104.22.168. Long-term completion and extension of scientific programs ................................. 33 22.214.171.124. Cross-collaboration analyses ............................................................................................. 33 126.96.36.199. Data reuse................................................................................................................................. 34 3.3. The need for process preservation .................................................................................................... 34 3.4. Large international scientific collaborations in Particle Physics............................................. 35 3.5. Analysis process ....................................................................................................................................... 37 3.6. The LIP scenario....................................................................................................................................... 38 3.6.1. Organisation description.............................................................................................................. 38 3.6.2. Scenario overview .......................................................................................................................... 38 188.8.131.52. Infrastructure’s view ............................................................................................................ 40 184.108.40.206. Processes’ description view ............................................................................................... 46
3.6.3. Stakeholders ..................................................................................................................................... 49 3.7. Conclusions ................................................................................................................................................ 51 4.
PROPOSED SOLUTION ..................................................................................................................... 53 4.1. Objectives ................................................................................................................................................... 53 4.2. DMP for scientific research................................................................................................................... 53 4.3. Logbook for scientific research ........................................................................................................... 54 4.4. A consolidated scenario: using a logbook in alignment with a DMP ...................................... 54
DM PLANNING IN E-SCIENCE.......................................................................................................... 57 5.1. Infrastructure and implementation issues...................................................................................... 57 5.1.1. Relating the DMP to other documentation............................................................................. 57 5.1.2. Roles and responsibilities ............................................................................................................ 58 5.1.3. Creation and development of the DMP.................................................................................... 58 5.1.4. Review of the DMP ......................................................................................................................... 59 5.1.5. Budget................................................................................................................................................. 59 5.1.6. Data security..................................................................................................................................... 59 5.1.7. Identifying contractual and legal obligations ........................................................................ 60 5.2. Key practices and process areas for consideration in the design of a DMP ......................... 60 5.3. Recommended contents for DMP ....................................................................................................... 62 5.4. Conclusion .................................................................................................................................................. 66
THE LOGBOOK IN E-SCIENCE ......................................................................................................... 68 6.1. The logbook applied to data analysis process................................................................................ 68 6.2. Recommended contents for logbook ................................................................................................ 69 6.3. Summary of logbook context ............................................................................................................... 71 6.4. The Record Point (RP) ........................................................................................................................... 74 6.4.1. Definition of RP ............................................................................................................................... 74 6.4.2. Recommended properties for RP .............................................................................................. 75
CONCLUSIONS.................................................................................................................................... 77 7.1. Application of the proposal .................................................................................................................. 77 7.2. Future work ............................................................................................................................................... 78 7.3. Final remarks ............................................................................................................................................ 79
REFERENCES ...................................................................................................................................... 80
APPENDIXES ................................................................................................................................................... 86 Appendix A CMDB functions ........................................................................................................................ 86 Appendix B Comparison between business and scientific workflows’ features ........................ 87 Appendix C ELNs products currently on the market ........................................................................... 88 Appendix D International DMP practices ................................................................................................ 90 Appendix E LIP workflow ............................................................................................................................. 93 Appendix F Recommended DMP ................................................................................................................ 94
List of Figures Figure 1 – ITIL V2 books and their disciplines ...................................................................... 11 Figure 2 – ITIL V3 service lifecycle ...................................................................................... 12 Figure 3 – The context of ITIL service transition ................................................................... 13 Figure 4 – Components making up the SKMS ............................................................... 16 Figure 5 – The basic scientific method ................................................................................. 21 Figure 6 – Scientific workflow lifecycle ................................................................................. 22 Figure 7 – WSU's publications with international collaborators ............................................. 24 Figure 8 – The main concepts of the context and domain of DMP ........................................ 32 Figure 9 – General experiment dataflow .............................................................................. 37 Figure 10 – LIP computing infrastructure (operated data centres) ........................................ 41 Figure 11 – LIP infrastructure .............................................................................................. 43 Figure 12 – The grid infrastructure ....................................................................................... 45 Figure 13 – LIP business processes .................................................................................... 46 Figure 14 – Obtain data and software .................................................................................. 47 Figure 15 – Analyse data..................................................................................................... 47 Figure 16 – Produce new data ............................................................................................. 48 Figure 17 – Infrastructure and implementation issues conceptual map ................................. 57 Figure 18 – The components of RP ..................................................................................... 74 Figure 19 – Local analysis process ...................................................................................... 93
List of Tables Table 1 – I
’ k y
Table 2 – Primary market audience choices......................................................................... 27 Table 3 – Information recorded during a local data
- ”) .............. 49
Table 4 – IT manager concerns ........................................................................................... 50 Table 5 – Researcher concerns ........................................................................................... 51 Table 6 – Process areas for consideration in the design of a DMP ....................................... 62 Table 7 – Recommended structure for DMP ........................................................................ 67 Table 8 – Match between business processes and logbook contents ................................... 71 Table 9 – Recommended information that should be recorded during a local analysis (Infrastructure) ............................................................................................................ 72 Table 10 – Recommended information that should be recorded during a local analysis (Processes) ................................................................................................................ 73 Table 11 – Recommended RP properties and respective descriptions ................................. 75 Table 12 – CMDB functions ................................................................................................. 86 Table 13 – C
w kf w ’ f
 ................ 87
Table 14 – ELN companies, solutions and descriptions........................................................ 88 Table 15 – ELN companies, solutions and descriptions (continuation).................................. 89 Table 16 – International DMP practices ............................................................................... 90 Table 17 – Curation policies and support services of the main UK research funders ............ 91 Table 18 – DM policies and research data requirements  ............................................... 92
ANDS AUGER CCTA CENSA CI CMDB CMP CERN CMP COBIT DCC DOI DP DPC EA EGI ESA FCCN IEEE IS ISACA ISO/IEC IT ITGI ITIL ITSM LHC LIP LRMS MoU NASA NFS OAIS OECD OGC RfC SACM SKMS SNOLAB SSH TIMBUS WfMC WLCG WSU
Australian National Data Service The Pierre Auger Cosmic Ray Observatory Central Computer and Telecommunications Agency Collaborative Electronic Notebook Systems Association Configuration Item Configuration Management Database Configuration Management Plan European Organisation for Nuclear Research Configuration Management Plan Control Objectives for Information and Related Technologies Digital Curation Centre Digital Object Identifier Digital Preservation Digital Preservation Coalition Enterprise Architecture European Grid Initiative European Space Agency Portuguese Academic Research Network Institute of Electrical and Electronics Engineers Information System Information Systems Audit and Control Association International Organisation for Standardization/International Electro technical Commission Information Technology IT Governance Institute Information Technology Infrastructure Library Information Technology Service Management Large Hadron Collider Laboratory of Instrumentation and Experimental Particle Physics Local Resource Management System Memorandum of Understanding National Aeronautics and Space Administration Network File System Open Archival Information System Organisation for Economic Co-operation and Development Office of Government Commerce Request for Change Service Asset and Configuration Management Service Knowledge Management System Sudbury Neutrino Observatory Laboratory Secure Shell Timeless Business Workflow Management Coalition Worldwide LHC Computing Grid Washington State University
INTRODUCTION Due to the ease with which people handle digital contents it is presumed that everyone w
businesses, institutions, and governments invest time and effort into creating and capturing digital information for instantaneous access by anyone. Therefore, researchers are increasingly engaged in research projects involving intensive data manipulation, where the disciplinary and geographic boundaries are crossed. Research projects now often involve virtual communities of researchers participating in large-scale webbased collaborations, opening their early-stage research to the research community to encourage broader participation and accelerate discoveries. The result of such large-scale collaborations has been the production of ever-increasing amounts of data. In short, we are in the midst of a data deluge   and researchers are striving to make this information available to communities worldwide. Unfortunately, the continued preservation and accessibility of digital information generated in this context of rapid technological advances cannot be guaranteed. Despite our information technology (IT) investments, there is a critical and cumulative ’ information infrastructure. Long-term preservation of digital
weakness in most of
information is plagued by short media life, obsolete hardware and software, slow read times of old media, and defunct websites. Indeed, the majority of products and services on the market today did not exist five years ago . More importantly, we lack proven methods to ensure that the information will continue to exist, that we will be able to access this information using the available technology tools, or that any accessible information is authentic and reliable. Moreover, in a world that is in constant change and progress, organisations must rely on IT services to address their needs and those of their clients. Stakeholders need faster and improved methodologies to deliver their services and/or products in the most cost-efficient w y
“IT becomes not only a success factor for survival and prosperity, but
also an opportunity to differentiate and to achieve competitive advantage” . In bringing together the concepts of Digital Preservation (DP) and IT Management, organisations can combine the best of both worlds, bringing benefits to their business. 1.1.
Impact of science’s changes on organisations
Science is a body of empirical, theoretical, and practical knowledge about the natural world, produced by researchers making use of scientific methods, which emphasize the observation, explanation, and prediction of real world phenomena by experiment. Science and technology have had a major impact on society and consequently in organisations, and their impact is growing. As a result, organisations are a reflection of society and its behaviour. Thus, the impact of science in organisations is mainly due to the development of new technologies, which affects people and organisations. Technology has always been a central variable in organizational theory, informing research and practice.
However, the way science is done today is quite different from how it was done a few decades ago. We are all aware of the tremendous impact that computers have had on science and engineering in recent years. The fusion of computers and all the inherent computing technology has been playing a key role in the dramatically improvement in production and productivity of researches. One of the major steps responsible for this potential is the combination of the interests of the scientific community, which is an identified group of potential consumers of information who should be able to understand a particular set of this information. Usually, this group comprises all interacting researchers, y “
” w k
activities in scientific fields. Thus, it was possible to create an integrated community, concerned and focused to support scientific collaborations   . Furthermore, information is a valuable asset that organisations have and use it to their advantage, having as main objective to improve the internal decision processes or create new findings. Due to the high amount of information that these organisations are subject, it is necessary to create mechanisms that interact with these assets and select only what is deemed relevant according to a particular context. This problem applies both to IT organisations as scientific research entities, since in both cases it is necessary to adapt the organisation to the new challenges posed by scientific advances, which cause an increasing data deluge with which these entities have to know and be able to cope. Thus, it is important that organisations as a whole are available to take the step forward towards new scientific and technological advances, because only in this way can foster the development and advances in science. 1.2.
Researchers are facing an eminent data deluge, which imposes several challenges on the way that data is managed and analysed . Several communities like in Biology, Medicine, Engineering or Physics, manage large amounts of scientific information. To address this problem, new methods of research have emerged, which exploit advanced computational resources, data collections and scientific instruments. These new methods are derived from the concept of e-science , which includes a very broad class of activities, as nearly all information gathering is computer-based, or uses information technologies for measuring, recording, reporting and analysing . The growing number of computational and data resources coupled with uniform access mechanisms provided by common infrastructures (such as grid networks) is allowing researchers to perform advanced scientific tasks in collaborative environments . Scientific workflows are the means by which these tasks can be composed. These workflows can generate terabytes of data, mandating rich and descriptive metadata about the data in order to reuse it. Nowadays, research organisations receive data, process it and produce results. These are the activities for which most researchers go through. Therefore, when a researcher performs a process throughout his/her research activity, it is necessary and common practice to keep a
record of tasks performed (for instance, for keeping provenance information on results). In fact, since scientific activities are increasingly supported by the computing environment (processing of numerical data), the processes can be roughly compared to organisational business processes. Similarly to organisations, which invest in process audit and conformance, researchers also have the need to be able to proof that the process was executed in a way that is according to the produced results. Therefore, researchers have come to realize they need to keep the record of their activities and contextualize them in order to prove that an analysis was performed in a certain way. Moreover, researchers have the need for this record to be obtained in a way that is the less intrusive as possible. In other words, the recording process should be automatic, so that they do no have to do it manually. In such contexts, DP is applied to e-science scenarios, comprising the data and the resulting documentation of the researches. That is, because of all these researches are performed in the digital environment, when this issue is placed on the area of concern of escience, the scientific community has been identifying some relevant concepts that may overcome the problem related to the preservation of the context of the analysis process. It is in this regard that the combination of best practices for IT Governance and e-science scenarios will be cause for a detailed study and subsequent presentation of some proposals intended to contribute to solving this type of problems. 1.3.
DP, in its basic definition, is a process that ensures that digital object can be accessible and understood for a long period of time, ensuring the authenticity and integrity of these digital objects . The main motivation of this work is related with the traditional focus on DP in preserving digital objects, but additionally it is also concerned with the broader scope of preserving the context where these objects were produced. The research problem involves an identified concern of business governance in e-science organisations that develop and execute research projects as part of collaborations with other organisations. For this, we consider the real case of an e-science organisation, the Particle Physics Instrumentation and Experimental Laboratory (LIP). LIP is a scientific and technical association of public utility that aims to research in the field of experimental high-energy physics (HEP) and associated instrumentation. The main research activities of the laboratory are developed within large collaborations at European Organisation for Nuclear Research (CERN) and other international organisations and major infrastructure projects within and outside Europe. Data analysis in these scenarios of collaboration (“ y
fic formal term
”) is a complex process that can take months or
years, performed by one or more researchers. The main problem is that later (months or years) the same researchers, or others, may want to retake the same analysis, such as:
Understand how data analysis were done;
Run the same analysis over new or more complete data; or
Redo the same analysis but with different parameters or applying new techniques.
Considered this, LIP felt the need to create an environment where any data analysis can be reproduced, as part of any collaboration. So, the main goals for the research problem are based in two concepts:
Focus on information and its management, maintenance and documentation of data that is obtained and produced as part of a collaboration;
Record the context information (activities and tasks performed) during a data analysis process and align it with the best practices of IT Governance.
These goals are achieved through the preservation of the execution of a collaboration, combining the best practices of IT Governance and the requirements of a typical e-science scenario. This project is done in wider scope of the Timeless Business (TIMBUS)1 European project whose vision is to bring DP into the realm of business continuity management by developing the activities, processes and tools that ensure long-term continued access to business processes and the underlying software and hardware infrastructure. 1.4.
A major barrier to the empirical validation of information system (IS) design methods is that it is very difficult to get new approaches, especially those developed in academic environments, accepted and used in practice. So, there have been frequent calls for IS researchers to make their research more relevant to practice , yet it seems IS researchers continue to struggle to make excellent research practically relevant. To address this problem, action research (AR) aims to solve current practical problems while expanding scientific knowledge. AR is an established research method in use in the social and medical sciences since the mid-twentieth century, and has increased in importance for IS toward the end of the 1990s. This method has developed a history within IS that can be explicitly linked to early work by Lewin and the Tavistock Institute . AR provides a method for testing and refining research ideas by applying them in practice  . That is, the method produces highly relevant research results, because it is grounded in practical action, aimed at solving an immediate problem situation while carefully informing theory. I
context using real practitioners”
“in an organisational y . Likewise, AR allows the method to
evolve as a result of experience in practice, as part of an on-going learning and reflection process . It is important to recognise AR as one of a number of different kinds of action inquiry . Action inquiry is a generic term for any process that follows a cycle in which one improves practice by systematically oscillating between taking action in the field of practice and inquiring into it. The AR adopted in this work was the canonical action research (CAR). 1
However, there are different developments of the basic action inquiry process . CAR is y
carried out in practice. This cyclical, iterative process used in CAR is one feature that helps distinguish it from other types of AR. CAR is an iterative, rigorous and collaborative research method. According to , there are the five key steps (Table 1) for each research cycle. Table 1 – Identification and description of the AR cycle’s key steps
Key step of AR cycle Diagnosing Action Planning Action Taking Evaluating Specifying Learning
Description Identification of the primary problems that are the underlying causes of the organisation’ f Specifies organizational actions that should relieve or improve primary problems (identified in Diagnosing step). Implements the planned action through an active intervention into the client organisation. Determination of whether the theoretical effects of the action were realized, and whether these effects relieved the problems. It is usually an on-going process and the knowledge gained is directed to three audiences: the restructuring of organizational norms; foundations for further intervention; the scientific community.
The reason why we chose CAR is justified by the specificity of the domain, where this work fits and which is distinguished :
The researcher is actively involved, with expected benefit for both researcher and organisation;
The knowledge obtained can be immediately applied, there is not the sense of the detached observer, but that of an active participant wishing to utilize any new knowledge based on an explicit, clear conceptual framework;
The research is a process linking theory and practice.
Once the problem is determined and the proposal applied to a specific situation, the CAR is the method that best applies to this case, as indicated in , definition also assumes that there is a concrete client involved”. T
CAR “the y
organisation (whether commercial, non-for-profit, governmental or some other form) as whole, or a subset of an organisation (whether a specific unit, level or individual within it) . To sum up, this method was applied during the execution of this work, essentially because:
AR aims at an increased understanding of an immediate social situation, with emphasis on the complex and multivariate nature of this social setting in the IS domain;
AR simultaneously assists in practical problem solving and expands scientific knowledge. This goal extent into two important process characteristics: first, there are highly interpretative assumptions being made about observation; and second, the researcher intervenes in the problem setting;
AR is performed collaboratively and enhances the competencies of the respective actors.
AR is characterized by the support provided by the theory and analysis carried out to the problem to the action (proposal) performed by the researcher.
This document is divided in 7 main chapters, including the actual one. C
2 (“Related work”) f
main concepts considered relevant to the proper
comprehension of the problem enabling the contextualization and better understanding of the information to be presented in subsequent chapters. This chapter introduces sub-sections where concepts such as, DP, Enterprise Architecture (EA), Information Technology Infrastructure Library (ITIL), Configuration Management Database (CMDB), e-science, Data Management Plan (DMP), among others are presented. C
3 (“P blem a
problem. That is, it was recognised a problem related with business governance in e-science organisations that perform research projects within collaborations with other organisations. In this context, research problems and objectives were identified, according to the problem under consideration. The analysis was based on two crucial points in which this dissertation relies: the infrastructure view and the processes view. C
4 (“Proposed solution”)
the proposed solution. Therefore, this chapter comes as a result of the analysis of the chapter 3. The solution lies in two concepts, a DMP and a logbook. In this chapter, these two solutions are described at a high-level form. C
the first concept, the DMP. In this chapter, we describe the general principles of a DMP along with the detailed implementation of this solution through a combination of a set of best practices with suggested recommendations for the creation of a DMP and its contents. C
”) describes the implementation of a logbook for e-
science disciplines, based on the state of the art of electronic laboratory notebook (ELN). However, proper adjustment is performed, in order to record all activities considered important for a researcher during a data analysis process. In this chapter, a high-level integration of logbook describing the contents, which should be addressed as well as the creation of a new concept underlying the use of the logbook by the researcher, is also made. C
7 (“Conclusions”) regards the evaluation of the proposed approach, justifying that
the proposal is correct through its application to the case of LIP. In this chapter, some ideas for future work are described as well as the final remarks.
Digital Preservation (DP)
The field of DP has been guided by the Open Archival Information System (OAIS) standard ISO 14721:2003 , which provides a high-level reference model. It identifies the participants, describes their roles and responsibilities, and classifies the types of information they exchange. However, because it is only a high-level reference model, almost any system capable of storing and retrieving data can make a plausible case that it satisfies the OAIS conformance requirements . DP ensures that digital objects can be accessible and understood for a long period of time. It is a set of processes that guarantees that in spite of the media, hardware or software obsolescence, the informational content of the digital object must be available to its users as it was when it was created. Digital information cannot survive and remain accessible if there is no concern about its (active) management and evaluation since the beginning of its lifecycle. From another point of view, DP can be taken as a solution to tackle issues related to business continuity, i.e., retain and keep accessible digital information over long periods of time (a decade or more) may be helpful in business performance, particularly in policy governance and compliance. The possibility of being able to see the action taken as a consequence of a particular decision made in the past, as well as being able to analyse and reproduce the entire business process that led to that decision, are fundamental aspects that DP aims to ensure. The benefits of this approach are making better business decisions in the future, ensuring that the organisation can be more rational and aware of the risks that may possibly result. To sum up, we can define as generic and common requirements of DP :
Integrity – Effective preservation requires that the informational content of objects remains unchanged through its lifetime;
Reliability – A copy (or representation) of any preserved object must survive over its y
Authenticity assurance – A future consumer may require the accessed information to be trustworthy;
Provenance – A future consumer may require information concerning the origins of object;
Dealing with obsolescence – Digital objects should be able to be exploited independently of any technological context;
Scalability – DP systems might be required to face technological evolution through the addition of new components;
Heterogeneity – DP
DP is a very complex problem, however these are the main requirements surveyed that need to be taken into account to achieve the goals of DP. 2.1.1.
The concept of preservation
We are living in a digital world. This era and what we are building can have many names, such as cyberspace, global information infrastructure, information age, information (super) highway, interspace, paperless society, etc. They are all supported by networking (e.g., the internet). However, their essence is information. Information is what flows over the networks, what is presented to us by our consumer electronics devices, what is manipulated by our computers and what is sorted in our libraries. I f
is raw. It simply exists and has no significance beyond its existence (in and of itself). It can exist in any form, usable or not. On the other hand, information is data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions. According to , information is data that has been given meaning by way of relational connection. This "meaning" can be useful, but does not have to be. For example, in computer science, a relational database makes information from the data stored within it. Data curators, archivists, librarians, and other information managers, are very comfortable with operating in a physical world, where the objects that are the focus of preservation have a tangible reality and existence. In brief, preservation in this world aims to ensure that the physical object remains accessible and useable for as long as necessary or possible. Of course, the physical carrier of the data (e.g., a sheet of paper) can be separated from the data that it carries but for most purposes the carrier and the data are inextricable. To preserve the data in this context, it is necessary to preserve the physical carrier. So, preserving the object results in preservation of the data. This type of preservation has long been the responsibility of libraries and archives, which assemble, organize, and protect this kind of information, such as, documentation of human activity. The ethic of preservation as coordinated and conscious management, however, is a more recent phenomenon. On the other hand, digital objects, although they are often functionally the same as their paper analogues, are inherently different. Perhaps the most obvious difference is that digital objects are mediated by technology. For the objects to be used, a user must have access to the right combination of hardware and software to enable the object to be recreated. Certain nontrivial consequences may arise from this unavoidable technological dependence of digital objects . The major consequence is that it is not possible to leave the digital object alone and expect it to “survive”. In the past, a “do nothing approach” , has often been enough to ensure the “survival” and usability of physical data. However, technology changes so rapidly that there is no guarantee that existing data sources will be accessible and usable on future computing platforms or software versions. Thus, in a digital world, the rate of technological obsolescence “
” approach is risky and will result in the loss and/or destruction of the
Another important consequence is that it is not enough to merely preserve the carrier medium. There must be (active) intervention to make sure that the digital object can be located, accessed and used over time. In brief, DP covers all actions necessary to keep the digital object accessible and usable over time, and as ensuring that the
data and its informational content have not been compromised by anything that has been done to it in the preservation or access processes. According to Digital Preservation Coalition (DPC)2, three types of preservations, concerning to time, exists:
Long-term preservation – Continued access to digital materials, or at least to the information contained in them, indefinitely;
Medium-term preservation – Continued access to digital materials beyond changes in technology for a defined period of time but not indefinitely;
Short-term preservation – Access to digital materials either for a defined period of time while use is predicted but which does not extend beyond the foreseeable future and/or until it becomes inaccessible because of changes in technology.
In order to standardize DP practice and provide a set of recommendations for preservation program implementation, the reference model for OAIS was developed. 2.2.
Enterprise Architecture (EA)
a formal, highly structured, way of defining an
enterprise's systems architecture . Giving a more detailed explanation,  defines EA as “a coherent whole of principles, methods, and models that are used in the design and realization of an enterprise’s organizational structure, business processes, information systems, and infrastructure.” The role of an EA is to achieve model-driven enterprise design, analysis and operation, but at the same time, as stated by , it allows the organisation to be flexible in face of changes since the configuration of a system might have to change at any moment. According to IEEE 1471-2000 standard , the notion of architecture is about the fundamental organisation of a system embodied in its components, their relationships to each other and to the environment, and the principle guiding its design and evolution. An architecture provides an integral view of the system being designed or studied. For many years, the systems that were object of architectural descriptions belonged basically to technical domains, but nowadays the trend flows to a much broader scope and enterprises are considered also as a subject that must be viewed as a whole and purposefully designed in a systematic and controlled way. EA captures the essentials of the business and IT, and its evolution. It benefits the organisation by transmitting a much better understanding of its structure, products, operations and technology, connecting the organisation to its surrounding environment.
EA is, therefore, a way to align IT, human resources and organizational processes with the business strategies. EA defines a model of the current state of business, IT personnel, and j
processes aligned with the organisation’
Identifying the architecture of the enterprise should therefore be considered as a fundamental step for any organisation that renders important to be ready to act rather than react and to be able to understand whether its elements are aligned. The EA results from the continuous process of representing and keeping aligned the elements that are required for management of the organisation. T
v w f
-i ” “ -
organisation” , one defines a series of reference ”
related to the current state of the organisation as it is, without including any changes or improvements;
“ -b ”
migrate to, including impr v f
ure should ;
” ” . T
alignment between all the architectures has been a focus of some studies  and can be y
used as a support for governance of organisation’
Taking into account that the ultimate goal of the implementation of a DP system is to be able to offer solutions to address problems in a proper manner, then it should be recognized that such solutions must be always a mix of an organizational structure with the related set of activities and services . 2.3.
Information Technology Infrastructure Library (ITIL)
The ITIL w
Telecommunications Agency (CCTA) which is called Office of Government Commerce (OGC) nowadays . Starting as a guide for United Kingdom (UK) government, the framework has proved to be useful to organisations to all sectors through its adoption by many service management companies as the basis for consultancy, education and software tools support. A
ITIL w s not widely adopted until the mid-1990’ . H w v
today ITIL is known and used worldwide, which had led to a number of standards, including ISO/IEC 20000  , which is an international standard covering the IT Service Management (ITSM) elements of ITIL. ITIL is a collection of guidelines and best practices to improve ITSM. It provides a standardised process model, which defines goals, workflow, input and output of each process. It helps to improve and measure quality of services on a continuous basis. This framework is independent from technologies and vendors, applicable for all kind and sizes of companies. ITIL defines a common vocabulary for all IT actors, as well as proposing a standard way of implementing IT services within organisations. The first iteration of the books was published between 1989 and 1996, with a large emphasis on controlling and managing operations using "Plan-Do-Check-Act" (PDCA) cycle . This initial collection of books became the ITIL V1.
Between the years of 1996 and 2000, the original collection of books increased to include over 20 books. This made adherence to ITIL quite difficult. Thus, in 2001, ITIL V2 was released, in order to make ITIL more accessible to those wishing to explore it. One of the aims f ITIL V2 w
that grouped related process guidelines into the different aspects of IT management, applications and services (Figure 1).
IT Service Management
• Service delivery
• ICT infrastructure management
• Service support
• Security management • Business perspective • Application management • Software asset management Figure 1 – ITIL V2 books and their disciplines
In December 2005, the OGC issued notice of an ITIL refresh, commonly known as ITIL V3, which became available in May 2007. ITIL version 3 initially includes five core books: 1. Service strategy 2. Service design 3. Service transition 4. Service operation 5. Continual service improvement In this new version of ITIL, all the main processes known from ITIL V2 are still there, with only few substantial changes. In many instances, however, ITIL V3 offers revised and enhanced process descriptions. The main difference between ITIL V3 and V2 is the new ITIL V3 service lifecycle structure: ITIL V3 is best understood as seeking to implement feedbackloops by arranging processes in a circular way. 2.3.1.
The service lifecycle
ITIL is organized around a service lifecycle, which includes: service strategy, service design, service transition, service operation and continual service improvement (Figure 2). At the core of the service lifecycle is service strategy. This phase provides guidance on how to view service management not only as an organizational capability but also as a strategic asset. Service strategy includes understanding who the IT customers are, the service ff
are required to develop these offerings and the requirements for executing successfully. Driven through strategy and throughout the course of delivery and support of the service, IT must always try to assure that cost of delivery is consistent with the value delivered to the customer .
For services to provide true value to the business, they must be designed with the business objectives in mind. Service design is the stage in the lifecycle that turns service strategy into the blueprint for delivering the business objectives. Service design provides guidance for the design and development of services and service management practices. However, the scope of service design is not limited to new services, it includes the changes and improvement necessary to increase or maintain value to customers over the lifecycle of services, the continuity of services, achievement of services, and conformance to standards and regulations. Service transition provides guidance for the development and improvement of capabilities for transitioning new and changed services into live service operation. This phase addresses managing changes, controlling the assets and underlying components (e.g., hardware, software, etc.) associated with new and changed systems, service validation and testing and transition planning to assure those users, support personnel and the production environment has been prepared for the release to production. Service transition introduces the Service Knowledge Management System (SKMS), which builds upon the current data and information within configuration, capacity, know error, definitive media and assets systems and broadens the use of service information into knowledge capability for decision and management of services. Service operation embodies practices in the management of the day-to-day operation of services. It includes managing disruptions to service through rapid restoration of incidents, determining the root cause of problems and detecting trends associated with recurring issues, handling daily routine end user requests and managing service access.
Figure 2 – ITIL V3 service lifecycle
Strategic objectives are ultimately realized through service operation, therefore making it a critical capability. Enveloping the service lifecycle is continual service improvement. It offers a mechanism for IT to measure and improve the service levels, the technology and the efficiency and effectiveness or processes used in the overall management of services. In short, continual
service improvement provides instrumental guidance in creating and maintaining value for customers through better design, transition and operation of services. It combines principles, practices, and methods from quality management, change management and capability improvement. A closed-loop feedback system, based on the PDCA model, is established and capable of receiving inputs for improvements from any planning perspective. 2.3.2.
Service transition – preparing for change
In the IT world, many business innovations are achieved through projects initiatives that involve IT. In the end, whether these are minor operational improvements or major transformational events, they all produce change . It is at this point that the knowledge that has been generated will be needed to manage services once in the live environment, and must be managed and shared across the organisation. The main objective of service transition is the development and improvement of capabilities for transitioning new and changed services into operation. Figure 3 presents the key concepts underlying the service transition.
Figure 3 – The context of ITIL service transition
Service Asset and Configuration Management (SACM) “S v
ITIL V2 w
and C y
SACM, was created with ITIL ”.
Configuration management provides information about service asset components and the relationships that exist between the various components. This is essential to effective service management solutions since this information underpins all of the other processes particularly incident, problem, availability and change management. This process, SACM, is a subset of ITSM and a two-fold process . The first part revolves around the management of a service asset across its lifecycle, with attention to
management, provides a logical model to identify, control, maintain, verify and report on the assets and resources comprising an IT infrastructure, as well as their constituent components and relationships. ITIL SACM aims to maintain information about Configuration Items (CI), w
rable attributes: for example a computer, a
process or an employee, required to deliver an IT service, including their relationships. Simply put, the purpose of SACM is to control changes through creation and maintenance of documentation. This is not the same as change management , which is a process for evaluating and handling change requests in the pursuit of quality of service improvement. Change management and other ITIL processes use this documentation to make better decisions. Creating and maintaining records of CI such as hardware, software, and the documentation related to these CI is all what SACM describes. SACM process activities include: A. Configuration management and planning; B. Configuration identification; C. Configuration control; D. Configuration status accounting; E. Configuration verification and audit. Each one of these activities is described below. Before continue, it is important to distinguish the CI and configuration record (introduced in ITIL V3) terms. ITIL V3 describe a configuration record, as "A record containing the details of a CI...configuration records are stored in a CMDB." A CI is defined as "Any component that needs to be managed in order to deliver an IT Service. Information about each CI is recorded in a configuration record..." So, most of the time (both in documentation and in communications) when we are talking about CI in the CMDB, we are really talking about configuration records. The CI is the actual item that is managed, not the record in CMDB. Throughout this document, the term CI will be used, without a clear distinction between this and the configuration record term, since for practical purposes of this dissertation this distinction is not relevant. A.
Configuration management and planning This activity plans and defines configuration m
procedures with the relevant organisational and technical considerations. During this activity, the schedule and procedures for performing configuration management activities are described. The management team and configuration management decide what level of configuration management is needed and how this level will be achieved. This is documented in a 3
configuration management plan (CMP) , which is a document describing the organisation and procedures for the configuration management of a specific product, activity or service. In this activity, the tools needed to support the function are evaluated, so this is where the CMDB and associated tools should be chosen. Other tools that are associated included those 3
w k. These are vital in
maintaining information about CI as otherwise a manual approach would be needed and this can be error-prone and expensive. B.
Configuration identification This activity focuses on establishing a CI classification system, which selects and identifies
the configuration structures for all infrastructure’
CI, including their owners, their
interrelationships and configuration documentation. It should also include setting up an identification scheme for all items, allocating identifiers and version numbers to CIs, labelling each physical item and recording details in the CMDB. These activities can take place for hardware, software, business systems, physical databases, etc. C.
Configuration Control This activity ensures that only authorised and identifiable CIs are accepted and recorded in
the CMDB and that these are managed for their entire lifecycle from receipt to disposal. It ensures that no CI is added, modified, replaced or removed without appropriate controlling documentation. Anytime that CMDB is altered a control occurs, including:
Registration of all new CI and version;
Update of CI records and license control;
Updates in connection with request for change (RfC) and change management;
Update the CMDB after periodic checking of physical items.
Configuration status accounting This activity provides reporting capabilities for all current and historical data for each CI
throughout its lifecycle. Status reports should be produced on a regular basis, listing, for all CI under control, their current version and change history. Status accounting reports on the current, previous and planned states of the CI should include:
Unique identifiers of constituent CI and their current status;
Configuration baselines, releases and their status;
Latest software item versions and their status for a system baseline/application;
The person responsible for status change;
Configuration Management and Audit This activity ensures that reviews and audits verify the physical existence of CI is
conducted at appropriate intervals. These reviews include verification of the existence of CI, checking that they are correctly recorded in the CMDB and that there is conformity between the documented baselines and the actual environment to which they refer. The efficiency and effectiveness of configuration management need to be assessed on a regular basis. This should include checks to ensure that all changes to the IT infrastructure have been properly authorised and details recorded correctly and in a timely manner in the CMDB. So, configuration audits should occur at the following times :
Before and after major changes to the IT infrastructure;
Following recovery from disaster;
In response to the detection of an unauthorised CI;
At regular intervals. Components, tools and databases
The execution of any ITSM process usually requires access to, or storage of, relevant sets of data and information. The service transition phase incorporates processes of knowledge management and SACM, which are heavily focused on the data management (DM), information and knowledge, the components, tools and databases (Figure 4). 220.127.116.11. Service Knowledge Management System (SKMS) The SKMS is the central repository of the data, information and knowledge that the IT organisation needs to manage the lifecycle of its services. As stated in , SKMS provides a “set of tools and databases that are used to manage knowledge and information.” The SKMS includes the Configuration Management System (CMS) as well as CMDB and other tools and databases. The SKMS stores, manages, updates and presents all information that an IT service provider needs to manage the full lifecycle of IT Services. Figure 4 is an illustration of the relationship between the three levels, with data being gathered within the CMDB, and feeding through the CMS into the SKMS as information to support the informed decision making process.
Figure 4 – Components making up the SKMS 
The meshing and synchronization of these various data sources provides the knowledge required to make service portfolio decisions, review and improve polices procedures and processes, determine training needs, monitor service level objectives track and effectively manage assets and monitor changes. The SKMS is not necessarily a single system – in most cases it will be a federated system  based on a variety of data sources. Knowledge management is especially significant within service transition since relevant and appropriate knowledge is one of the key service elements being transitioned. 18.104.22.168. Configuration Management System (CMS) The CMS is, as detailed in  “a set of tools and databases that are used to manage an IT service provider’s configuration data. The CMS also includes information about incidents, problems, known errors, changes and releases; and may contain data about employees,
suppliers’ locations, business units, customers and users.” T
collecting, storing, managing, updating, and presenting data about all CIs and their relationships.” I
d is used by all ITSM
processes. On the other hand, SACM requires the use of a supporting system, the CMS. The CMS holds all the information for CI within the designated scope. Some of these items will have related specifications or files that encompass the contents of the item. Ideally, every identifiable unit of importance can be captured in the CMDB, but realistically, organisations will be driven by their business service needs and bound by priorities, staffing, and technology capabilities. 22.214.171.124. Configuration Management Database (CMDB) One of the key processes of ITIL is the configuration management, which in ITIL V3, is described as SACM. Briefly, this process is responsible for storing and keeping track of all the resources of the organisation, named as CI. The tool responsible for this is the CMDB, a fundamental component of an ITIL framework . T
through the combination of
new technologies such as application dependency mapping and the quest for management process improvement. Because the CMDB is a repository of information about IT infrastructure components and applications and the relationships
“it represents all of the
concepts of holistic service management that so far have proved so difficult to transform into realities” . It helps an organisation to understand the relationships between these components and modify their configuration. At a minimum, a CMDB includes the following elements:
IT infrastructure components – A CMDB stores discovery information about infrastructure components such as servers, storage, desktops, and other workstations, as well as relevant information about their nature and configuration;
Applications and/or services – By the same token, a CMDB contains collected information about the location and configuration of applications, services, and business processes;
Dynamic maps – Maps are the most crucial component, which consist in representations that show the links between the applications/services and the infrastructure components needed to run them. These maps have to be updated dynamically.
The CMDB is a fundamental component of SACM and therefore a greater enabler of ITSM. Most of the other processes defined in ITIL rely on this repository to retrieve and store information required for them to accomplish their own responsibilities . A CMDB is seen as a database, or a set of databases in the case of a federated CMDB , used to store configuration records throughout their lifecycle . So, the main responsibilities of this tool are keeping track of changes in all of the organis
using some functions (cf. Appendix A). Each CMDB stores attributes of CI, and relationships between CI.
Each element in the IT environment is an individual entity requiring accurate capture of its attributes. Attributes are the details that describe the CI, such as, location, serial number or owner. These attributes follow an established pattern but should be defined for each type of CI to best fit in organisation’
interfaces that exist between CI in the infrastructure. These CIs, as seen by ITIL, are components of an infrastructure or items that are (or are to be) under the control of the configuration management. CIs may vary widely in complexity, size and type, from an entire system (including all hardware, software and documentation) to a single module or a minor hardware component. CIs should be selected using established selection criteria, grouped, classified and identified in such a way that they are manageable and traceable throughout the service lifecycle. A CI can be physical, that is, real and tangible (e.g., hardware and software code), or it can be logical abstractions of these (e.g., business processes and distributed applications). The concept of a CMDB has evolved over the years from a collection of isolated data stores  to integrated data stores  to a single, central database , each time getting f
The use of computers generates many challenges as it expands and develops the field of the possible in methodical and scientific research and many of these challenges are usual to researchers in diverse areas. The insights achieved in one area may catalyse change and accelerate discovery in many others. It is absolutely true in the statement that “it is no longer possible to do science without doing computing” . Originally, there was just experimental science, and then there was theoretical science, with K
M xw ’
. Later, for many
problems, the theoretical models grew too complicated to solve analytically, and people had to start simulating. These simulations have carried us through much of the last half of the last millennium. At this point, these simulations are generating a whole lot of data, along with a huge increase in data from the experimental sciences. Researchers now do not actually look . I
rough large-scale, complex instruments,
which relay data to datacentres, and only then do they look at the information on their computers . The world of science has changed, and there is no question about this. The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Researchers only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration, as  stated. Computing sciences and humanities has been developing at a very high pace over the past decades. The emergence of new research methods that exploit advanced computational
resources, data collections and scientific instruments, in other words, enhanced-science or escience , is poised to revolutionize the future scientific discovery process. To take the words of PhD John Taylor, who as Director General of the Research Councils4, the term e-science “the global collaboration in key areas of science, and the next generation of infrastructure that will enable it”. C
y “e-science will change the dynamic of the
way science is undertaken.” Some authors, for example , suggest that the very essence of science is changing, particularly through employment of electronic networks and high-speed computers – two of the core components of e-science. This transformation is not limited to the natural sciences, where e-science has become, in some countries and disciplines, the modus operandi, but is also penetrating the domains of the social sciences and humanities. E-science includes a very broad class of activities, as nearly all information gathering is computer-based, or uses information technologies for measuring, recording, reporting and analysing. The term e-science has the intention to suggest an idea of future for scientific research based on distributed resources especially data-gathering instruments and international collaborations. As  stated, e-s
“IT meets science”.
Scientific progress depends increasingly on sharing knowledge and the results that prove it; sharing ideas and making the connection between these, people and data; as well as the interpretation of the knowledge generated by others, in another way that has not been done in the past. The term e-science is usually related to the concept of data-intensive and computational grid, which makes these paradigms inseparable in science nowadays. Along to these two paradigms, and with evolution to which technology has been subjected, up to today, a very powerful infrastructure is needed to support and sustain e-science. The grid is an architecture projected to produce all the issues together and make a reality of such a vision for e-science. In the field of technology, such as grid computing, architecture examines grid technology as a standard and generic integration mechanism assembled from grid services (GS), which are an extension of web services (WS) to comply with additional grid requirements. One of the main goals of the grid is increase efficiency by sharing the same computers more widely, where can be included computers from different organisations. A grid middleware layer can be employed to facilitate transparent access to distributed data and computational resources. Grid or WS then provide distributed data access, authentication, resource allocation, scheduling services, remote execution, etc. The grid can be a means by which researchers conduct their data analysis. However, for more sophisticated analyses, researchers may need to develop their own specialized programs or scripts to link together various applications. So, in order to fill the obvious need at the top-level of cyber infrastructure , i.e., where researchers build and run their virtual
experiments and analysis pipelines, scientific workflows are emerging as a new modelling and execution paradigm. 2.4.1.
The formal concept of a workflow has existed in the business world for a long time. The 5
Workflow Management Coalition (WfMC) exists for nearly twenty years. It has developed a large set of references models, documents and standards . According to the WfMC, a w kf w
“the automation of a business process, in whole or part, during which documents,
information or tasks are passed from one participant to another for action, according to a set of procedural rules”. In other words, it is a precise description of a scientific procedure – a multistep process to coordinate multiple tasks, acting like a sophisticated script . Each task represents the execution of a computational process, such as running a program, submitting a query to a database, submitting a job to a compute cloud or grid, or invoking a service over the web to use a remote resource. As scientific knowledge and the number of studies that need to access this knowledge increases, the complexity of scientific problems was magnified. In order to respond to these challenges imposed every day by science, researchers have started using computational methods increasingly complex. However, the basic scientific method  remains the same (Figure 5). A change in this scientific flow is that is progressively being transformed by advances in computer science and technology in recent decades. A few of these advances include sensor based observatories to collect data in real-time, supercomputers to run simulations, domain-specific data archives that gives access to heterogeneous data and online interfaces to distribute computational experiments and monitor resources. In particular, researchers and engineers need to spend substantial effort managing data (e.g., scripts that encode computational tasks, raw data, data products, and notes) and recording provenance information . These set of tasks not only are time-consuming, but also error-prone. This push by the complexity of today's problems and scientific state of the art computer science and technology, resulted in a group of technologies developed for making the automation of scientific process more efficient and faster, with the aim to help researchers track easier utilization of technology, now called scientific workflows . Scientific workflow is a particular type of workflow that is directly connected to the largescale complex e-science applications, such as Astronomy, Chemistry, HEP, or Medical Surgery    .
Figure 5 – The basic scientific method
A scientific workflow attempts to capture a series of analytical steps, which describe the design process of computational experiments. Scientific workflow systems provide an environment to aid the scientific discovery process through the combination of scientific data management, analysis, simulation, and visualisation. Scientific work is centred on conducting experiments, so a scientific workflow system should mirror a researcher’
patterns by allowing them to apply their methodology over distributed resources. Figure 6 depicts a high-level view of the scientific workflow lifecycle . A
w kﬂ w
commodity and a source of intellectual capital . The data resulting from workflows or workflows themselves should be used as a basis for future research, either by researchers who generated the data, or by researchers who intend to develop new research. So, these workflows may be reused, enhanced over time and shared with researchers from the scientific community. For this reason, scientific workflows have to be reproducible, i.e., it is necessary that the information that feeds the workflows be properly recorded, indicating the origin of the data, how it was handled and what were the components and parameters used. Thus, it is possible to other researchers replay the experiment, confirming the results or producing new outputs according to the previous data. Similar to the business domain,
w kﬂ w y
crowded with different languages and frameworks allowing researchers to automate tasks w kﬂ w . Some of the most popular workflow systems are: Taverna6, Kepler , Triana , Planning for Execution in Grids or Pegasus , GridNexus  and DiscoveryNet .
Figure 6 – Scientific workflow lifecycle
Business workflow vs. Scientific workflow
Business workflow management and business process modelling are mature research areas, whose roots go far back to the early days of office automation systems. Scientific workflow management , on the other hand, is a much more recent phenomenon, triggered by a shift towards data-intensive and computational methods in the natural sciences, and the resulting need for tools that can simplify and automate recurring computational tasks. Scientific and business workflows began from the same common ground with differences and similarities in many different ways. From the point of view of an end-user, both scientific and business workflow management systems provide means to (i) model and specify processes with design primitives, (ii) reengineer developed processes such as verification and optimization and (iii) automating the execution of processes by scheduling, controlling and monitoring the tasks . B
w kf w
human or computer agents in an administrative context. It concerns agents, roles, manipulate objects (resources) and especially, the partial order or coordination among activities. The use of business workflows is prevalent among insurance, banking, and health industries. When workflows are linked to scientific research rather than the business place is required another approach that includes support for large-scale and complex environments, faulttolerant and the ability to maintain and meet the demands imposed by the scientific processes. Similarly, in order to allow the researchers to validate their researches and corresponding hypothesis, the components of the workflow must comply with the background of specific domain. Then, the process of constructing a workflow that enables this type of validation is built incrementally unlike business-oriented approach, where the workflow is designed and then implemented.
Consequently and since the validation of scientific hypothesis depends on the experimental data, the scientific workflow tends to have an execution model that is dataflow-oriented, whereas business workflow places an emphasis on control-flow patterns and events. According to , scientific workflow management systems are those that provide comprehensive answers to typical provenance questions, such as, what are the inputs used to create the final product? Intuitively speaking, researchers are concerned a lot about the intermediate steps, data and results of a scientific process. In business workflow, business people want to know which parts of their processes can be optimized based on previous runs to reduce the maintenance costs. To sum up,  compare features of scientific workflows and business workflows. Despite the fact that there are few sharp, categorical boundaries, the author stated that the comparison in Table 13 (cf. Appendix B) should help in assessing commonalities and typical differences between scientific workflows and business workflows. 2.4.3. T
Collaborative e-science experiments y
“collaborations become necessary whenever researchers
wish to take their research programs in new directions” . As a result, innovations and advances that were not possible within one laboratory working in isolation are now emerging from collaborations and research teams that have harnessed techniques, approaches, and perspectives from multiple scientific disciplines. The progress of Internet and grid technologies have been positively reinforce the scientific x
. B y
complex experiments require geographical surrounding, so that the e-infrastructure  becomes a reality. Considering that scientific workflows integrate logic of experiments, the sharing of resources and the possibility of researchers, from around the world, work together in the same experiment is essential to foster knowledge transfer and accelerate the development of scientific experiments. Despite of global scientific collaboration takes many forms, the “
various initiatives around the world agree f “
”  – or at f
underlying published research, and to communication tools. These collaborations also increasingly span institutional and national boundaries. For example, the percentage of papers that involve international collaborations increased from 9% in 1983 to 22% in 2001 , with mathematics and physics leading the way. This increase in team science has been driven by a variety of factors, including growing interest in scientific problems that span disciplines (e.g., mapping the human genome or studying global climate change); advances in communication and transportation technologies that make remote collaborations easier to sustain; and government policies that encourage collaboration, especially between universities and organisations . 7
More recently, studies conducted by the ISI Web of Knowledge , show that 27% of Washington State University's (WSU) peer-reviewed publications in agriculture, chemistry, 7
plant sciences, engineering and other science related disciplines are the result of international collaborations (Figure 7). WSU’s joint publications with international researchers have increased 8-fold from 112 in 1992 to 805 in 2011. In the past two decades, WSU has published with WSU researchers from over eighty countries.
Figure 7 – WSU's publications with international collaborators
As modern research methods have become more specialized and the true complexity of y’
fields have become essential for exploring and tackling these new problems. Also,  states that this specialization of research methods has made interdependence, joint ownership, and collective responsibility between and among researchers near requirements. 2.4.4.
Open data in collaborative e-science experiments
Science is based on building, reusing and openly criticizing the published body of scientific knowledge. For science to effectively function, and for society reap the full benefits from scientific endeavours, it is crucial that the output of scientific research be handled as open data. Open data consists in certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The idea behind open data has been long established, but the term "open data" itself is quite recent, gaining popularity with the rise of the Internet and, especially, with the launch of open data government initiatives such as Data.gov 9. Thus, in this context of e-science, the definition of “open data in science” is an emerging term in the process of defining how scientific data may be published and reused without price or permission barriers. Researchers generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its reuse without permission. Following a request and an intense discussion with data-producing institutions in member states, the Organisation for Economic Co-operation and Development (OECD)10 published in 2007 the OECD principles and guidelines for access to research data from public funding  8
http://ip.wsu.edu/IP-research/research.html http://www.data.gov 10 http://www.oecd.org/ 9
as a soft-law recommendation. Other projects are being developed in this area, for instance the Opportunities for Data Exchange (ODE)11. Data sharing can be performed through data referencing or data citation, which is the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to printed resources. The need to cite data is starting to be recognized as one of the key practices underpinning the recognition of data as a primary research output rather than as a by-product of research . While data has often been shared in the past, it is seldom cited in the same way as a journal article or other publication might be. This culture is, however, gradually changing. If datasets were cited, they could achieve a validity and significance within the scholarly communications cycle. Data citation could enable recognition of scholarly effort in disciplines and organisations that want to acknowledge and reward data outputs. Many academic libraries and institutional repositories are poised on the cusp of housing datasets, placing them in a position where they will need to engage with the challenge of defining and managing access to datasets . This pressure to archive and then link the scholarly publications to the data has been growing. According to , this is based on arguments that the data is essential for establishing validity, reproducibility and replicability. There are many types of persistent identifiers, which can be used to uniquely identify data and datasets. A digital object identifier (DOI)  is a robust persistent identifier, which will be available through the Australian National Data Service (ANDS) “Cite My Data Services”12 from the end of 2011. It has been recognized that unique identifiers are essential for the information management in any digital environment  . In order to discover, retrieve, manage, and trade the vast array of creative works that are becoming available in the digital domain, a way to refer to them unambiguously, by means of unique identifiers, is required. Features such as persistence, interoperability and uniqueness are characteristics that characterize DOI. A
identifier (URI) : it is a false comparison. The World Wide Web (W3) is a communication medium and a highly successful one; it is not an information management system (for example, it has not made databases obsolete). The DOI system, especially as it has evolved, has much more in common with an information management system or inventory system or distributed database than it does with web publishing. The DOI system is not solely designed for use on the W3; the same functionality can be made available through any digital network and protocol . 2.4.5.
Many research organisations are entering a new era and are considering protecting their own inventions and engaging in research with other organisations, both public and private. These new relationships, often based on collaborative research agreements, may require 11
http://www.ode-project.eu http://www.ands.org.au/services/cite-my-data.html 13 The generic term for all types of names and addresses that refer to objects on the W3. A URL is one kind of URI. 12
precise documentation of certain activities and results. Laboratory and research practices will frequently need to be carefully formalized and noted in ways that will allow future intellectual property (IP) auditors to review the authenticity of results and certify the dates of occurrences . Such practices are important for potentially patenting possible discoveries made by these institutions or by their collaborators. Moreover, the possibility to continue the unfinished work of a researcher or may reproduce the steps of a certain experiment (therefore, allow knowledge to pass through several researchers), are objectives that can be achieved with the aid of laboratory notebooks. A laboratory notebook is an important tool that goes well beyond research management. In short, a laboratory notebook can be described as a daily record of every experiment that a researcher does or plan to do, where are documented his/her thoughts about each experiment and the results thereof. Likewise, the laboratory notebook is the basis of every paper and thesis that a research write and it is the record used by patent offices and, in the case of disputes, courts of law. However, recent scientific and information technology advances have resulted in the mass production of data, and the subsequent increased use of electronic data storage. While electronic data in general solves storage concerns, there has been no complementary advance in documenting the experimental procedures and analysis of that stored data. Indeed, the potential risk for misplacing important data files in a mountain of paper or even electronic files increases with the growing volumes of data. Whether Good Lab Practices (GLP) 14 , Good Manufacturing Practices (GMP) 15 , Good Clinical Practices (GCP) 16 , or Research and Development (R&D) labs, the paperbound laboratory notebook (PLN) has always been the natural choice for documenting day-to-day lab work. But with the advent of new technologies, the days of the PLN are coming to a close as the days of the ELN are beginning to dawn . This was the solution that organisations met to help researchers to keep pace and address the questions of how to most efficiently input, store and retrieve experimentally relevant information. ELN provides an attractive solution to the information recording, storage and retrieval issues surrounding PLNs and management of large volumes of data. ELNs allow organisations to efficiently leverage new experimental knowledge for informed decision making, at both the laboratory and management levels. The Collaborative Electronic Notebook Systems Association (CENSA), an international trade association for ELN technology and standards defined an ELN as "A system to create, store, retrieve, and share fully electronic records in ways that meet all legal, regulatory, technical and scientific requirements” . An ELN is not just a replacement for a PLN, it is a new method of storing and organizing data while maintaining the data entry flexibility and legal
http://www.oecd.org/chemicalsafety/testingofchemicals/goodlaboratorypractice.htm http://www.fda.gov/food/guidancecomplianceregulatoryinformation/currentgoodmanufacturingpractices.htm 16 http://www.fda.gov/ScienceResearch/SpecialTopics/RunningClinicalTrials/default.htm 15
recording functions of PLN. An important point is that ELNs are really a new form of record keeping17. The initial drive for the development of ELNs came from the field of Chemistry, perhaps driven by the early adoption by chemists of computer technologies for drawing chemical structures and for storing and searching them graphically within databases. There was, however, a parallel interest in ELNs in Biology, which was initially driven by the volume of data that was being managed in Bioinformatics. Table 2 – Primary market audience choices
Category R&D Biology
Description Indicate general research note-taking capabilities Indicate domain-
w v y
List built-in compliance with regulatory requirements
Indicate cross-functional capabilities
In , a review of ELNs available in the market today was made and were identified thirtyfive potential ELNs. In brief, Table 2 depicts the categorization of the reviewed ELNs. In Appendix B is a list and description of the ELN solutions currently on the market. 2.5.
Data Management Plan (DMP)
Research DM (RDM) has become an international topic of concern to researchers and their funders. As in the case of ELNs, where the main concern of researchers and organisations is to store and organize data while maintaining the data entry flexibility and legal recording functions of PLN, these same bodies have realized that the motivation for the better management comes from the recognition that the best return on the investment made in data acquisition, can only be realized through using it in the most effective way and by maximizing its sharing and reuse. As such, it is important that researchers organize and archive past, present, and future data for other researchers to build upon with full acknowledgment of the original efforts of investigators. Not only does this benefit the research community, but it can also be selfserving, because clear recollection of a researcher’
. T s is
the idea underlying a DMP. An effective DMP includes protocols for recording raw data, maintaining data, analysing data, and disseminating data (in the form of publications and the archiving of raw data and metadata).
http://www.records.nsw.gov.au/recordkeeping/dirks-manual/recordkeeping-systems QA is a process used to measure and assure the quality of a product. QC is the process of meeting products and services to consumer expectations. 18
Most research groups already conduct many of the steps in a typical DMP, although the practices tend be informal and not uniform among all members of the research group. Major funding agencies including National Science Foundation (NSF) 19 , recognize the value of formalizing this process and are now requiring a DMP for all proposals, suggesting that “Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labelled DMP. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. Proposals that do not include a DMP will not be able to be submitted.” 20
Another example is the Digital Curation Centre (DCC) that views plans submitted in grant proposals as preliminary outlines , which should then be developed into more coherent processes and procedures at the outset of research. In addition, many journals now require that raw data and metadata accompany publications . Having a detailed DMP will facilitate transfer of scientific data and ensure compliance with DM policies set by funding agencies and journals. To sum up, a DMP is a formal document that outlines what researchers will do with data during and after they complete the research. It describes data and metadata that will be gathered in a study. It also includes details about the processes for preserving and sharing the data and indicates which procedures will be needed to access and use it. A DMP helps the researcher map out which data and metadata to collect, and how to organize and store it, based on the ways the data will be used and shared in the future. It is fundamental when developing a DMP for researchers to critically assess what they can do to share their research data, what might limit or prohibit data sharing and whether any steps can be taken to remove such limitations. 2.5.1.
The importance of DMP for research community
In particular, those who deal with large amounts of data in the new paradigm of e-science, have increasingly perceived concerns about DM. Besides the experiment itself, researchers need to concern about a number of issues before, during and after a scientific research, such as, identify the nature of the research products (characteristics of the data, samples, physical collections and other materials to be produced in the course of the research), share the data and research products (experimental materials resulting from research that will be available to others), provide access to data (the ways that researchers will be able to obtain the data and research products) and archive the data (the long term storage and organisation of the data and research products). DM accomplished through a DMP will not only help researcher save time, but it will help, throughout the data lifecycle, to increase impact of research with data citation ; clearly document and provide evidence for research in conjunction with published results; comply with sharing mandates, and meet copyright, and ethical compliance; preserves data in long-term to
safeguards researcher investment from loss supports open access/sharing with others and further benefits interdisciplinary research. Therefore, international research funders and research groups are increasingly aware of these practices. Whereas each funder specifies particular requirements for the content of the plan, common areas are :
Which data will be generated during research;
Metadata, standards and quality assurance measures;
Plans for sharing data;
Ethical and legal issues or restrictions on data sharing;
Copyright and intellectual property rights (IPR) of data;
Data storage and backup measures; and
DM roles, responsibilities, costing or resources needed.
Implementing DM measures during the planning and development stages of research will avoid later panic and frustration . Many aspects of DM can be embedded in everyday aspects of research co-ordination and management and in research procedures. Good DM does not end with planning. It is critical that procedures are put into practice in such a way that issues are addressed when needed before mere inconveniences become insurmountable obstacles . Researchers who have developed DM and sharing plans found it beneficial to have thought about and discussed data issues within the research team. In first instance, the development of a DMP will be to the extent necessary that it conforms to the requirements of and supports the researcher in planning and performing the analysis. Each DMP produced should have two functions . On the one hand, it will act as a guide to researchers on reusing existing data, repurposing their own data and supporting data reuse throughout the research activity. On the other hand, its purpose is to act as a record of how the data have been reused and repurposed, where applicable, and how data reuse has been supported. Further, the DMP will provide information about the location, accessibility and ownership of the data associated with a project as a means of supporting the principal motivation of RDM, which is to promote the second use. 2.5.2.
DMP international practices
This kind of practice is quite usual at the international level, in universities, research institutes, scientific communities, etc. For example, Australian National University (ANU) 21 created a DMP Manual (cf. Appendix D), which addresses issues related to DM, its methods and benefits, the DM services of ANU, as well as, a recommended structure of a DMP. Directly 22
linked to ANU, the ANDS has as objective to help researchers and research organisations by leading the creation of a cohesive national collection of research resources and a richer data environment that will make better use of Australia's research outputs, enable Australian researchers to easily publish, discover, access and use data, and enable new and more efficient research. 21 22
ANDS provide a set of services related to DM, including “Cite My Data”, “Publish My Data”, “Register My Data”, “Identify My Data” and “Controlled Vocabulary”. The ANDS “Cite My Data” service allows research organisations to assign DOIs to research datasets or collections. The DOI system supports the citation of research data in scholarly communications and research data collaborations. Intended for individual researchers, “Publish My Data” lets you easily publish an individual collection of research materials, with basic metadata. ANDS only registers the description of your collection; ANDS does not store the collection itself. “Register My Data” is a more advanced service, allowing full control over metadata. Descriptions of registered collections are published in a number of discovery services and e-research applications. “Identify My Data” service allows you to persistently identify your data. The service enables you to create a clickable reference to your dataset that will not be broken when the location of the dataset changes. The ANDS “Controlled Vocabulary” prototype service will allow research organisations to create, manage and query "controlled vocabularies" relating to their research datasets. The service is designed for Australian organisations managing research datasets that require use of controlled vocabulary terms in their data or metadata values. Joint Information Systems Committee (JISC) 23 conducted a project named “M R
ns, among other subjects, the Laser
Interferometer Gravitational Wave Observatory (LIGO) 24 DMP, and pull together whatever information is available on the estimation of DP costs. Another example of a report about “RDMP for the Department of Mech
Bath . In this report is described the concept of DMP and specified how projects should use DMPs: where they should be stored, how open they should be, expectations for review and revision, and so on. Further, it provides a template for project DMP and it recommends tools that can be used for DM and additional guidance that may be consulted. On the other hand and being more concrete in terms of artefacts required to establish a DMP, there are examples, such as National Climate Change and Wildlife Science Centre 26
(NCCWSC) and the University of Western Australia (UWA) . In the first case, a report has “G
equirements for NCCWSC/CSC DMPs” w
concerns related to the utility of a DMP and it was made a description and guidance about the contents that a DMP should include. In the second case, the UWA create a template for researchers to fill, where subjects related to the research are present, such as, project overview (e.g., name of the project, fund source, etc.), documentation and metadata (e.g., collection methods, description of data, etc.), among others. In , is performed an extensive survey on the practices of DM through DMPs in research funders in the UK and in universities/research organisations all over the world (Australia, UK, US, etc.). 23
http://www.jisc.ac.uk/ https://dcc.ligo.org/public/0009/.../LIGO-M1000066-v17.doc http://www.bath.ac.uk/ 26 https://nccwsc.usgs.gov/ 27 http://www.uwa.edu.au/ 24 25
Research problem and goals
The need for better management of research data is progressively becoming assumed not only by those who collect and generate data for research first use, but also by those who have a research funding or governance role. Particularly in the area of e-science arises the need for management of activities undertaken during the research process. This fact emerges due to collaborations carried out between various scientific communities, where there is a need for a plan that describes what data will be created, what policies will apply to the data, who will own and have access to the data, what DM practices will be used, what facilities and equipment will be required, and who will be responsible for each of these activities. This is a process that must be carried out before and after a research is initiated in order to maintain the state of researches always updated. Moreover, during a research there is the need to keep an updated record of activities that are being developed. This record should be a set of specifications of the infrastructure used, steps and activities perpetrated by the researcher during a data analysis. Consequently, the purpose of this work is to make the management of the infrastructure to create an environment so that a researcher can work and develop his/her research analyses, as part of international collaborations. So, two approaches were followed to address this problem: the concepts of DMP and logbook. In a nutshell, the goals of this work are: I.
Identify the main concepts to address the problem of preservation of the infrastructure in e-science scenarios;
Describe the scenario of a research institution, both at infrastructure level as the research process conducted during a data analysis;
Propose a set of guidelines and recommendations for the creation of a DMP;
Propose a mechanism that helps researchers to undertake the management of the infrastructure of a data analysis, based on the concept of logbook.
The need for better DM in scientific institutions
Concerns about DM have been increasingly perceived, especially by the scientific community, in particular by those who deal with large amounts of data in this new paradigm of e-science. Besides the experience itself, researchers need to concern about a number of issues before, during and after a scientific research, such as, identify the nature of the research products, share the data and research products, provide access to data and archive the data. Data integrity, validation and security, access, curation  and preservation are recognized as essential parameters to ensure the quality and value of research in most disciplines. A clear definition of the responsibilities involved in DM from inception to preservation is another element of weight that should be shared among researchers and their research groups as well as by research funders and research institutions.
In order to recognise and manage these responsibilities, research institutions require a strategy that assists them to plan and develop their DM approach. As a result, in the last few years many international research funders have introduced a requirement within their data policies for DM and sharing plans to be part of research grant applications. These research funders also expect evidence of formal DM planning as early as in the initial bid for funds, with detailed plans being required early in the management of a funded project . Figure 8 depicts a conceptual map with the context and domain of DMP.
Figure 8 – The main concepts of the context and domain of DMP
As mentioned in section 2.5, there is a concern by the funding bodies but also by research institutions in being possessors of a DMP. This DMP may assume various structures, according to the research area and the type of data respectively. 3.2.1.
The preservation of HEP data
The generation of large datasets over several years by data acquisition systems (e.g., particle detectors) are the inheritance of HEP. These datasets provide a unique opportunity to develop future scientific studies. However, it is necessary to preserve this datasets and create a plan to fit them in a specific context. The result is that the HEP data analysis becomes quite complicated, since it depends on a set of factors based on both infrastructure and activities developed by researchers. This is because interpretation of this data is heavily dependent on software, the use of which requires detailed experimental knowledge, which it may be infeasible to preserve. HEP data is not expected to be generally intelligible for very long: two or three-decade old data might potentially be useful or intelligible, but much beyond that would count as archaeology . In most of the cases, the data generated in HEP is considered unique. This consideration is made due to the enormous effort involved in preparing an experiment involving several scientific communities. So, from the moment that the data is lost or unusable,
it is quite complicated to recover it back. The cost of maintaining this inheritance through collaboration, oriented to the data preservation would not be very high, when compared with the cost of produce or reproduce large international projects, from which the datasets are obtained. The preservation of HEP data and planning of research may be a secure way to allow researchers to complete their analyses and seize new opportunities for future scientific researches. The HEP community (as well as other scientific communities) would benefit from this type of preservation since, for example, would be a possible way to avoid duplication of data by combining preservation with the integrated management of data analysis. Due to all factors that make the data recovery process almost impossible to recover, in case where there is no concern for preserving the experiment, there is the need to plan and manage the experiments according to the type of research that is being performed. Long-term preservation of HEP data is crucial to preserve the ability of addressing a wide range of scientific challenges and questions at times long after the completion of experiments that collected the data. In many cases, these data are and will continue to be unique in their energy range, process dynamics and experimental techniques. New, improved and refined scientific questions may require a re-analysis of such datasets. The main scientific reasons for preserving not only the data itself but also the ability to analyse it and be able to perform its management are the long-term completion and extension of scientific programs, cross-collaboration analyses and data reuse. 126.96.36.199. Long-term completion and extension of scientific programs According to , long-term completion and extension of scientific programs involves the natural continuation of the physics program of the individual experiments, although at a slower pace, to ensure a full exploitation of the physics potential of the data, at a time when the strength of the collaboration has diminished. It is estimated that the scientific output gained by the possibility to maintain long-term analysis capabilities represents roughly 5% to 10% of the total scientific production of the collaborations. More important than the sheer number of publications is the nature of these additional analyses. Typically, these analyses are the most sophisticated and benefit from the entire statistical power of the data as well as the most precise data reprocessing and control of systematic effects. 188.8.131.52. Cross-collaboration analyses The overall analysis of various experiments as a whole, allows to discover new scientific opportunities in order to reduce the uncertainties in individual experiments as well as allows to develop experiments entirely new, which would otherwise be almost impossible. Indeed, innovative combinations from experimental results were performed, for example, at Large Electron Positron Collider (LEP)28, Hadron-Elektron-Ring-Anlage (HERA)  or Tevatron . These experiments were conducted through collaborations, providing new insight into the results obtained in previous experiments, opening up new sights for future scientific discoveries. Through the preservation of datasets, it is possible to increase the potential of the 28
experiments carried out by scientific communities. Likewise, documentation and DM, including issues related to the organisation, maintenance and sharing at charge of researchers, expands opportunities for combinations between experiments that would otherwise be hindering due to technical and scientific problems. For this, there needs to be a concerted effort on the part of funding agencies and programmes, national policies, and the technology trendsetters and the researchers themselves, for the standardization and documentation of a plan covering the datasets produced in order to preserve them for a long time. Consequently, the return would be experienced by these entities, facilitating data sharing and proper management of data analysis for present and future collaborations. 184.108.40.206. Data reuse Several scientific breakthroughs can be exploited by reusing data from previous experiments. For example, new theoretical developments may lead to new predictions of scientific phenomena that were not considered when the data were analysed. The discovery of new phenomena may also require re-analysis of the data, without being necessary to rebuild the whole experiment. Likewise, new experimental knowledge (for example, new analysis techniques and Monte Carlo 29 simulation models may create opportunity to reprocess the existing data and obtain higher precision results) may allow the improvement of data analysis preserved with a much higher potential than it had when it was published. Similarly, results obtained in future experiments may have to be re-analysed due to inconsistencies in physical data analysis or due to ideas that arise in contexts that are only available in former experimental data. 3.3.
The need for process preservation
One of the most challenging problems faced by researchers today is the DP problem, i.e., how to ensure that digital data being stored today can be read and interpreted many years later. There are several examples of applications and activities that require long-term preservation, partially driven by compliance and regulation . These include: medical retention regulations in the healthcare industry; pharmaceutical companies that need to preserve their research, development and filing application records for decades; aircraft design records in the aerospace industry, and many more. One of these examples is the Particle Physics, where research is increasingly carried in collaborative environments among different scientific communities. The data and results produced by an experiment are valid and retain their value as information and knowledge, even after the end of the collaborations. Thus, there are many cases where data may still be useful after a collaboration has ended, namely:
Evolutions in theories can lead to new predictions of physics phenomena that were not considered when the data was analysed. Its experimental evidence can be searched inside this data without the costs associated to building a new experiment;
Monte Carlo experiments are a class of computational algorithms that rely on repeated random sampling to compute their results. It is often used in computer simulations of physical and mathematical systems.
The discovery of new phenomena may demand a reanalysis of existing data in search for things not known at the time of collection, thus enabling cross checking the results;
New ideas for studies may appear in contexts only available in older experimental data;
Combined analysis by joining data from several experiments offers the possibility to reduce statistical and/or systematic uncertainties, or even to perform completely new analysis.
So, these cases may require access to past analysis processes. Particularly in P y
’ xperiments, besides the analysis of research data, the analysis
related to research and development of the experimental apparatus is also very important. These procedures must be preserved because they document the knowledge behind many strategic and technical decisions that influence the construction and expected capabilities of each experiment. The analysis processes performed on the data are valuable assets of knowledge, where are captured parts of the scientific process that must be preserved. Likewise, these processes provide the context for the creation of a research analysis. Among the possible reuses offered by these processes, are included: researches in the same research group and in the same research project, where it is sometimes necessary to make changes to certain parameters or input data; researches by other researchers, in other institutions, but with similar research processes; or researches conducted by researchers, who do not have a direct relationship to the analysis process, but may benefit from the results obtained during the research analysis. In addition to reuse, preservation of analysis processes is of utmost importance to ensure authenticity and provenance of the results obtained during scientific analysis. Provenance permits the tracking of the history and ownership of the experiment, and authenticity is a major factor in the validation of scientific experiments. 3.4.
Large international scientific collaborations in Particle Physics
Particle Physics studies the basic constituents of matter and its interactions. The Particle Physics theoretical models, like the standard model   , attempt to explain nature mathematically, and by doing so, predict certain physical processes and behaviours. The goal of experimental Particle Physics is to search, observe and measure real physical processes, which may validate the current theories or raise new questions leading to new theories. The core of Particle Physics research takes place deep inside matter where researchers look at its most basic constituents and study how they interact. There are several approaches for doing this, which allow studying different aspects of these processes. In any case of Particle Physics, experiments require the construction and management of very complex instruments, whose development can take many years (e.g., ATLAS 30 and CMS31). These instruments (such as detectors) jointly with acquisition systems, process and analyse structures of data as part of the design of the experiment. Due to the high effort required for creating an environment conducive to the development of an experiment, most of 30 31
these experiments require a collaborative effort at international level, in the context of large scientific collaborations composed by several research groups. Due to all these special features that characterize an experiment, it is unlikely that the data obtained in a past experiment can be fully reproduced in the near future. The Particle Physics is a discipline in which the complexity of experiments carried out is reflected in the data, software and analysis processes developed . Therefore, this causes a problem related to the difficulty in recovering the past scientific work. Concerning data, it is necessary to distinguish, at the outset, the data types that are used by researchers to conduct their work: the one that is acquired from the experimental apparatus, also called raw data, and the data that is produced by Monte Carlo simulations  according to the expected theoretical models. The raw data must be reprocessed to produce data with physical meanings. The process of reconstructing the data consists of the passage of electronic signals obtained during the experimental apparatus for the measurement of physical quantities of the particles used in the analysis. From this reprocessing, several types of datasets are produced for analysis, including products generated by the central experimental collaboration. Afterwards, the products are made available to researchers to perform their analyses. So, Figure 9 describes the general flow of the data, from its production in world centres for scientific research using experimental apparatus, to its distribution among different experimental laboratories, where researchers develop local analyses. Along the data processing flow several different types of information are produced and required for the next step, for instance calibration data, detector conditions, geometrical alignment of the detector, particle beam characteristics and many others. Another variable that makes the recovery of the analysis process more complicated is the use of grid computing. According to , while Biomedicine and Geoscience use grids to bring together many different sub-disciplines, Particle Physics use grid computing to increase computing power and storage resources, and to access and analyse vast amounts of data f
. For example, the Large
Hadron Collider (LHC) will generate 50 to 100 PB of data each year, with about 20 PB of that data stored and processed on a worldwide federation of national grids linking 100,000 CPUs . Additionally, Particle Physics detectors must simulate and interpret these gigantic amounts of data. It is important to understand the boundary between the central data production by the experiments and local data analysis processes. In the first case, the output data is beyond the scope of researchers, since it is considered a black box operated by major research institutions (e.g., CERN). In the second case, the analysis is carried out by researchers in their local research environment. This type of analysis is described as local analysis, to stress the fact that the analysis workflow is performed within the scientific research institution.
Particle Detectors Data Acquisition Systems Experiment Collaboration Data Production
World Centres for Scientific Research
Central Production Controlled by Experiment
Large Data Centers
Reconstructed Data Detector Info Data Simulated Data Software
Individual Laboratories Research Teams
Figure 9 – General experiment dataflow
During the entire process that runs from the data generation (raw data) through the experimental apparatus until the data processing within local analysis, there are a number of products that are generated and very useful to researchers during their analysis. However, one key aspect for the creation of scientific knowledge lies in the analysis process. This is a process that is developed at the local level (for students and/or researchers) in the context of theses and where are used data generated by international collaborations. The analysis process consists of a set of iterative activities performed by a researcher, applying a set of analytical methods to the data. Due to its iterability, each analysis is refined and improved over the different steps carried out, which depend on the results of the previous step. An analysis process can take several years, resulting issues related to preservation, such as, the environment of the analysis evolves over time; reproduction of certain activities of the analysis may become difficult due to changes in environment (software, hardware, versions, libraries, etc.). Added to this, the Particle Physics is considered a unique creative and intellectual discipline. For this reason, there is no systematic and automated way to perform data analysis, emerging even more, the need to preserve all activities and steps of the analysis deemed relevant. The iterative nature that characterizes Particle Physics is common to many other scientific fields that have similar patterns and as such suffer from the same problems regarding the preservation of data analysis. Therefore, the analysis that will be performed at an institution of e-science, in the field of Particle Physics, can be extrapolated to other similar areas, as well as the proposals presented further.
The LIP scenario
In this section will be carried out an analysis of LIP, the institution of e-science that will be considered as part of the validation of this work and where was made the requirements elicitation, and its concerns relating to DP. Also will be identified the relevant problems in this institution that can be resolved based on the analysis performed in the state of the art. 3.6.1.
Organisation description 32
LIP is a scientific and technical association of public utility that has for goal the research in the fields of experimental HEP and associated instrumentation. LIP's research domains have grown to encompass experimental HEP and astroparticles, radiation detection instrumentation, data acquisition and data processing, advanced computing and applications to other fields, in particular Medical Physics. The main research activities of the laboratory are developed in the framework of large collaborations at CERN and at other international organisations and large facilities in Europe 33
and elsewhere, such as ESA , SNOLAB , NASA and AUGER . In its three laboratories in Coimbra, Lisboa and Minho, LIP has about 170 people, including 70 PhDs, and many are professors at the local universities. LIP was created in May 1986, simultaneously in Lisboa and Coimbra. The birth of LIP has merged and boosted the efforts of an embryonic community of experimental particle physicists. CERN was the first international scientific organisation that Portugal has joined. The history of LIP is thus an unavoidable element of the history of scientific research in Portugal. In particular, LIP appears with high relevance in the chapters devoted to the internationalization of and to the great development of advanced training in the last decades. In 2001, LIP became an associate laboratory of the ministry of science, technology and higher education. Through LIP, Portugal has been in the first row of the great Particle Physics projects of the last decades. Its research domains include today experimental Particle and Astroparticle Physics, detector development and the associated instrumentation, applications to medical physics and advanced computing. 3.6.2.
LIP is a technical and scientific association, which aims to research in the field of experimental HEP and associated instrumentation. Consequently, it can be considered an institution belonging to the e-science domain. Thus, as is common in such institutions and in LIP specifically, the interest in long-term preservation and recovery of Particle Physics analysis carried out by its researchers is a priority. Apart from that, it is necessary to preserve all the steps and activities of the analysis in order to facilitate its repetition, if and when researcher needs. On the other hand, LIP would benefit if implements a DMP, since would include all
http://www.lip.pt/ http://www.esa.int/esaCP/index.html http://www.snolab.ca/ 35 http://www.nasa.gov/ 36 http://www.auger.org/ 33 34
activities associated with the data, such as data organisation, archiving data for long-term preservation, data sharing or publishing, or ensuring security of confidential data. The data analyses carried out in institutes, such as LIP, are computationally intensive tasks that take advantage of using diverse infrastructures, both local and distributed across different sites (e.g., grid environment). An analysis process actually begins when a particular collaboration delegates a certain task (pertaining to an experiment) in a specific research group belonging to a laboratory. From this point, the local analysis is a process under the responsibility of specific institution where the researcher is leading the analysis (by selecting the relevant data products) and will produce the data in accordance with the goals of the data analysis. A data analysis is a very complex process that can take months or years. The researcher starts the data analysis with a set of reconstructed data obtained from the experiment, which is then reprocessed to build distributions of physical variables like particle energy, charge, momentum, position or mass. Beyond the data received by the experiment, the researcher needs to produce simulated data to determine the error introduced by measured experimental data (generated by experimental apparatus) and make the necessary corrections. Only in this way, the researcher can correct the experimental data and perform an assessment of the results in a correct and reliable fashion. Throughout a local analysis the researcher performs refinements on the data (experimental and kinematic cuts37, studies correlations) and produces as result, graphics (e.g., histograms) via specialized data analysis tools. This is an iterative work, which includes advances and setbacks, where are generated significantly amounts of data that support the decisions taken by the researcher during the analysis process. A data analysis in Particle Physics does not have a restricted limit, i.e., there is no clearly defined end to consider an analysis as completed. Thus, it appears that an analysis can be considered as terminated when, within the scientific community, is considered unbiased from any external effect. That is, all studies and questions relevant to that kind of data analysis were explored and properly justified. At this time, the results can be published in a scientific paper or thesis. That said and after an analysis is completed, the need to repeat it is a situation very common in this type of domain. For example, a scientific paper or a thesis is published announcing certain results and years later, a researcher decides to improve these results with new methods and techniques developed in the meantime. Or otherwise, during the analysis process, the researcher needs to repeat a step of the analysis, for example, applying a different data analysis tool. So, in order to succeed the recovery of data analysis, multiple conditions must be attained:
The experimental data must exist as well as a detailed description of its formats;
The simulation software and libraries must exist as well as the required input data;
Cuts to select events of interest, refining the selection and reject background events for the final analysis.
The analysis software and libraries must exist as well as the required input data;
The environment where all the programs were executed must be known, and systems to run those environments must exist.
The data itself is typically created by international experiments and is owned by the collaboration, i.e., the set of research institutes that work under a Memorandum of Understanding (MoU)  agreement. The experimental data is replicated worldwide through ff
. When the experiment finishes, someone (one institute) assumes
the responsibility for the data custodial. However there are no clear rules, frameworks or standards for the long term responsibilities related with the data storage and/or data access. The availability of the data requires archiving it safely for indefinite term jointly with enough information about its format and the tools or libraries necessary to access it. This is one of problems identified in LIP, since there are no policies defined for the data store or access. Likewise, there is no policy about what information should be stored, what format, or which tools or libraries are required to reuse data. To address this, we recommend the concept of DMP, which is a plan for the effective creation, management and sharing of data enabling researchers to get the most out of their research. Considering the proposed scenario, the analysis of the “
f LIP will be made
from two perspectives. The first includes the infrastructure made available and used by researchers at a local analysis. The second comprises the description of the analysis process performed by a researcher during a local analysis. 220.127.116.11. Infrastructure’s view The view of the infrastructure allows to provide a perspective on the different environments that a researcher is likely to encounter during a local analysis. A.
Scientific computing infrastructure The LIP scientific computing infrastructure comprises three data centres (Figure 10). Two
of them are co-located at the LIP research centres in Lisboa and Coimbra. A third data centre in Lisboa (NGC) is operated by LIP as a national grid computing service and its capacity is shared with the scientific community. The three data centres are interconnected by a highspeed 10 Gb private network provided by the Portuguese Academic Research Network (FCCN), and are fully integrated in the European Grid Initiative (EGI) infrastructure. As such, their computing and storage resources can be accessed both locally and/or remotely using sophisticated grid middleware. These data centres are part of the Worldwide LHC Computing Grid (WLCG), the CERN grid infrastructure that supports the production chain for the LHC experiments data processing and simulation.
Figure 10 – LIP computing infrastructure (operated data centres)
In this context, LIP is a resource provider to the ATLAS and CMS experiments production services. For physics analysis and depending on the experiment, LIP users can perform the data processing at the LIP data centres or in certain cases in grid centres worldwide (including LIP). However in most experiments and at most laboratories the facilities used to support the analysis are local computing farms. Data centre’s architecture
LIP has 4 main areas of activity in its infrastructure (Figure 11):
Farm computing (computing nodes);
High performance file system (LUSTRE);
UI machines (external login machines, interactive machines, desktops, and laptops);
Backup system (AMANDA).
All of these components are connected between a central switch. Connected to switch there is a router that allows the communication for and from the outside. From outside, researchers can connect to LIP network through external login machines (via ssh connection). However, the operations that users can accomplish through these machines are limited to, for example, editing or listing files and performing small jobs. The researchers before starting any type of analysis in the LIP, they need to connect to the network. This connection can be made through:
Wired network, which involve providing the media access control (MAC) address;
LIP network wireless;
desktops. After being connected to the network of LIP, researchers can begin their analysis
after they connect, via ssh, to an interactive machine. To do this, they must be authenticated and authorized to perform the data analysis they want. During a local analysis, researchers create a set of scripts that contain various tasks that they intend to carry on the computer nodes (belonging to the farm computing). For this, there is a system that performs the interface between the user machines and the computer nodes, which is a local research management system (LRMS). This system provides a share policy, in which the allocation of resources is balanced by the load that is coming, so as not to allow monopolization of resources by a particular user. Thus, it is ensured that other using the farm computing in a more intensive way does not adversely affect users with less active for a certain period of time. However, others share policies can be implemented, being the responsibility of IT managers handle this type of situation. Then, after researchers send the scripts to LRMS system, it manages the requests by putting them in a queue. Next, the system processes the elements (scripts indicating what tasks should be performed, for example, what data/software should be used) in the queue, and copies them to farm computing nodes, for example, according to the central process unit (CPU) or load of the machines. When the applications are copied to computer nodes (subject to availability of them and characteristics of jobs), executed and the jobs launched, the LRMS system provides a jobid to the user, so that it is possible to check the status of the respective job. When the job is finished, its log is copied to the user area with the purpose of researcher can check its status. For each computing server (from farm computing), there is a storage limit of 750 GB. In these computer nodes are mounted the /NFS and /LUSTRE systems. The network file system (NFS) system contains the software and the home directories which comprise /home, /soft, /hometmp and /data. In the second one (/soft), the researcher has permission to write and in y
most of the times, he/she downloads th
area the researcher, usually, puts the small data that he/she needs to performs a particular analysis. However, NFS system has a storage space allocation distributed by quotas for each researcher or group of researchers. The IT manager installs operating systems on computers as well as the generic software. When there is more specific software, researchers are themselves engaged in the installation. Another important area
e high performance file system
(LUSTRE). This system consists of storage servers where there are three boxes expansion in redundant array of independent disks (RAID) 538 for each server. Each box has 15 disks of 1 TB. However, since the system used RAID 5 configuration, the actual capacity of the disks corresponds to 14 TB/box (1 TB is used for parity data). In this system is where the large amounts of data are stored. As in NFS, there is control over the allocation of space in the LUSTRE system. However, in this case, space allocation is performed by volume (for example, by paying the desired space), i.e., in each RAID are
partitioned areas for each group of researchers (being an operation completely transparent to those using the space).
Internet Router Backup System (AMANDA)
Farm Computing (Batch System)
Interactive Machine Central switch
... NFS (software, home directories)
High Performance File System (LUSTRE)
Figure 11 – LIP infrastructure
All write operations are performed in disks of the LUSTRE system. So, to have a high performance level, the RAID controllers that exist in LUSTRE system have their own memory, i.e., when is carried out a written to d k
However, the controller uses the “right-back” mechanism, that is, when writing is performed, the controller informs the operating system that writing is carried out immediately. But in fact, the writin w
a very high performance, there is the disadvantage when errors occur during the writing w
n lead to inconsistent volumes), for
example, due to a power failure. To address this, the controllers have a battery that lasts about 72 hours, so can be overcome possible errors and failures that may happen (power failures, problems with disks, etc.). Another main
LIP’s infrastructure is the data backup system (AMANDA).
This system provides data backup to NFS area, such as, /home, /data and /soft and some data saved in LUSTRE system. However, the last case only happens if requested by researchers to IT manager and this request should be negotiated between those parties. The system backup is performed by user area (/home, /data, /soft) and not by machine, since the operating system of the machines would be easily changed, instead of the content in the various areas of the assembled system NFS.
This data backup system performs nightly copies of non-transient data. The system backups allow to verify if the data are intact, indicating if there was a problem in a given volume of data (through error codes that are later interpreted by IT manager). C.
Monitoring systems S
and software, there are management systems to help IT managers. For example, Nagios is a network system with the aim to check for problems on the entire network infrastructure and notify about them. It is used to check disks, memory, and hardware status, gather information from machines, etc. Cacti40 is another monitoring tool and a complete network graphing solution, which gathers and displays information about network status through graphics. It checks the performance of a machine, when a failure took place, etc. But, this network administration tool monitors machines individually. However, if it is necessary to determine the overall performance of the computing machines, Ganglia
provide this information. Ganglia is a scalable distributed monitoring
system for high-performance computing systems, such as clusters and grids. For example, if IT manager wants to check what is the random-access memory (RAM) that is being used in a cluster, Cacti does not indicate this information because it only shows the RAM on each machine individually. So, with Ganglia is possible to integrate different types of information and correlates multiple information sources. Another management system is Quatter. It is a system that ensures the software that is installed on a machine is consistent with a predefined configuration, i.e., if a user, as root, try to install software on a machine (which is running Quatter), the system automatically uninstall the software, since this was not part of predefined settings for that machine. These are the set of monitoring software used in the LIP, to make possible the supervising of the infrastructure employed in a specific research environment. D.
The grid system The grid system (site) provided by LIP along with other international institutions is present
in Figure 12. The grid computing, as stated by , is a “hardware and software infrastructure that provides
computational capabilities.” The grid system aims to integrate the different sites, with an extra layer, where services are used to allow the integration/use of the sites in grid environment. Computing Element (CE) allows researchers from outside the site (with access to a UIk
http://www.nagios.org/ http://www.cacti.net/ 41 http://ganglia.sourceforge.net/ 40
submit jobs to a site without having to connect via ssh to the site, as the process is carried out through the CE on this site. Likewise, scripts are used to indicate where the script will be run. Grid Services
LFC File Catalogue
Berkeley Database Information Index
Resource Broker User Interface
Storage Resource Manager
Storage Resource Manager
High Performance File System
High Performance File System
Figure 12 – The grid infrastructure
Resource Broker (RB) is a grid service that helps researchers to find specific sites. Researchers through the UI submit the script to the RB, and indicate certain requirements (e.g., the user wants to run the script on a site with a certain processing power and storage). The RB discovers machines according to the requirements and sends the script to the CE, so that the jobs are thrown into the respective machines. Storage Resource Manager (SRM) is an interface that has a service, which allows the f
researchers can send, delete and list data. Also, the SRM makes the management of the storage system, which defines what information can be obtained/accessed through the grid, w
SRM. The Berkeley Database Information Index (BDII) has information about the various sites obtained from multiple sources, which can be accessed via RB. For example, if a researcher wants to submit his jobs on sites with a certain processing power, the RB queries the BDII and then BDII returns to RB a list of sites that meet this requirement. The LCG File Catalogue (LFC) is a catalogue containing logical to physical file mappings. In the LFC, a Grid Unique Identifier (GUID) represents a given file. A given file replicated at different sites is then considered as the same file, thanks to this GUID, but (can) appear as a unique logical entry in the LFC catalogue. Thus, LFC has information about the files that exist
in SRM, i.e., it maintains a list of where are the instances of a given file. So, this catalogue allows replicating data in order to be able to run a particular job at more than one site. The data from the grid will go to LUSTRE system of LIP, which is mounted on the machines of farm computing. The user perceives these data in the same way as if they were generated locally. The main difference is the ownership of data, that is, the data from grid have a proper ownership and they are under control of SRM (so that data can be manipulated via grid). Moreover, the ownership of data generated locally is belonging to the local user. So, for there is manipulation of data by users via local grid is necessary to use Access Control Lists (ACL). 18.104.22.168. Processes’ description view T
’ v w
extremely important to get a perspective on how and when should be recorded changes made during the analysis process, so that researchers can repeat data analyses later. T
workflow can be divided in three business processes: obtain data, analyse data and produce final data. Figure 13 depicts the overview of the processes.
Figure 13 – LIP business processes42
Below are described the sequence of activities of each business process identified in Figure 13. A.
Obtain data A local data analysis is initiated from the moment that a collaboration delegates a part of an
experiment to a researcher or group of researchers belonging to a laboratory. From here, the researcher can obtain data and software needed from previous work of other researchers or through the experiment. After obtaining all the necessary products (data and software products), the researcher can begin his/her data analysis.
Figure 14, Figure 15 and Figure 16 are retrieved from technical deliverable D9.1 of the TIMBUS project, with no public access.
analysis Obtain Data and Software
Send Requested Data
Experimental Data [external]
«flow» Request External Data
external data needed
internal data needed
Obtain Data and Software Start
Receive External Data
Experimental Data [Internal]
Retrieve Internal Data
analysis software needed Retrieve Software
Figure 14 – Obtain data and software
Note that the data obtained through both local storage belongings international institutions or downloaded directly from experiment, can be reconstructed data, simulated data, calibrations, etc. Together with the data may be obtained specific software to perform the data analysis. Figure 14 depicts this process. B.
Analyse data Since all products required for data analysis are available to the researcher, is initiated the
so-called, analysis process. This is an iterative process that requires the execution of analysis tools including several programs (some, à priori, obtained through the experiment or enhanced through the development of patches by researchers), such as simulation or computationally intensive tools. The result of these simulations is the production of physical observables. Then the researcher will observe and analyse the products obtained (e.g., graphs, histograms, etc.) and decide whether to make further iteration over data using the input data from the previous analysis. To do this, may be necessary to use different types of software or develop new patches. Figure 15 depicts the described process. analysis Analyze Data «flow»
«flow» Experimental Data [Under Analysis] no Run Software On The Data
Do Analysis On The Data Start
Experimental Data [Under Analysis]
more analysis needed?
Experimental Data [Not Final]
new analysis software needed?
Analyzing Physical Observables From Data
Produce New Data
Prepare Software End
Figure 15 – Analyse data
Produce final data After researcher finishes his/her analysis and consider the new data consistent with the
initial objectives, is initiated the decision making process, where the results will be analysed by other collaborators together with the researcher. After the decision making process is completed, the analysis process may have to be repeated (because the results were not accepted), initiating one or more iterations in the analysis process. If the results are accepted, the final data are produced and may be materialized in writing articles or theses. Figure 16 depicts this process.
Figure 16 – Produce new data
D. LIP Workflow Figure 19 in Appendix E describes the local analysis process. The workflow usually starts with a student or senior researcher who wants to perform a new analysis or wants to retake the previous work of other researchers from LIP or different International Institution. The first step is to obtain experimental data, which is the process of retrieving the initial necessary items for analysis, such as datasets, tools and programs, libraries, etc. This data could be taken from a published paper, thesis or from a researcher who left the process in the middle of the work and did not conclude it. For instance, a scientific paper or thesis is published announcing certain results, and years later, someone needs to refine these results with new methods and techniques, or to perform the same study changing inputs because the knowledge has evolved. Likewise, data may be obtained from the international scientific institutions, such as CERN. The next step is to do the analysis process, which means performing the activities to retrieve physical results from the initial data. It includes running many programs and applications, such as, data analysis tools, simulation tools, etc., in order to produce new data. The new data is studied through the analysis of the physical observables there-in; this study includes the production of histograms, and applying statistical methods to the physical observables distributions, a step that may require using again applications for the production of new data in order to obtain the results. Sometimes, during this process, the researcher
realises that he/she has to repeat the analysis (because of a bad decision in the past or simply because the new data is an intermediate item needed as input in the next step of analysis). After analysing the data, the researcher together with a collaborator, will make the so-called decision making process. This process is a decision making activity that, depending on a given criteria, can lead to an iterative process where associate collaborators are consulted to decide on whether further analysis steps are needed. Finally, when the collaborators consider the analysis unbiased from any external effect, the researcher will produce the final results. At this point, the results are eligible to appear in a scientific publication, like a paper or a thesis, or to be presented in a public conference. Currently, researchers in the LIP only keep a record of the physical processes of their analysis, i.e., they do not record any information about the data processing, neither information about the infrastructure that supports the local analysis. However, the practice of recording information deemed relevant by the researcher, it is not a policy shared by all researchers from LIP. That is, presently each researcher has a PLN, which is updated according to their perception and assessment of the data analysis that is performed. For example, two researchers carrying out the same data analysis, at the end of the process, they may produce two PLNs completely different regarding to information about the data analysis. However, there is information that is registered by most researchers during a data analysis, which can be categorized into infrastructure and process information (Table 3). Table 3 – Information recorded during a local data analysis by researcher (“as-is”) The software versions used throughout the data analysis
are recorded by researchers because of the high amount of simulations and multiple iterations performed, where are used different software versions, yielding different results. As software versions, versions of compilers are an element
that is highly variable due to the quantities of simulations performed. Like the two previous resources, the versions of libraries
used during the simulations vary and are produced different results according to a combination of software, compiler and libraries used. Researchers register decisions only related to the analysis
process, i.e., recording the reason for using certain software applied to a dataset.
A stakeholder in an organisation is (by definition) “any group or individual who can affect or is affected by the achievement of the organisation’s objectives” . Therefore, in this section we describe the stakeholders who have responsibilities in conducting local data analysis. Stakeholders identified in a local analysis process are the IT manager and researcher. On the one hand, international institutions (such as, CERN) could also be considered as
stakeholders, since it provides the data products for analysis. However, because of its role be considered, not principal for the scenario discussed, will not be considered as stakeholder. Consequently, the first stakeholder to be considered is the IT manager, which is responsible for managing, operating and maintaining the infrastructure to support local analysis. Among other activities, the IT manager must carry out the management of computer hardware and software, and monitor the activity and resource allocation under the research groups. With regard to researcher, this is responsible for developing the analysis process, which includes obtaining data, analyse data and produce final data. Both stakeholders have concerns about DP, in particular regarding the organizational and technological problems. In practice, the IT manager is responsible for managing the infrastructure necessary to conduct data analysis, so he/she needs to concern about minimizing the effort required to researchers in recovering a local analysis. For this it is necessary to create a mechanism, in the same context of SACM (cf. 2.3.3) that provides a logical model to identify, control and keep a record under the assets and resources that make up an IT infrastructure, which allows record information about activities performed in a data analysis, as well as maintain information about the different types of services used by the researcher. These include preserving and managing meta-information and copies of the operating system, compilers, libraries, tools and analysis tools developed within the research community. Likewise, digital objects produced should be stored by the IT infrastructure. On the other hand, the IT manager being responsible for technological infrastructure that supports the analysis process, has knowledge that will be important to assist the researcher to characterize the environment in which the local analysis takes place. Table 4 describes the concerns of this stakeholder. Table 4 – IT manager concerns
C1. Have a system (tool) to track and monitor all information that the researcher considers relevant for all activities of the data analysis process. This information may include the steps taken by the researcher, services and infrastructure used, etc. This tool must be able to record any changes that are considered irreversible. C2. Have a plan to assist the preservation of the data so that it can be properly secured, stored and shared. This stakeholder should help in the description of this plan, with regard to the technical infrastructure.
The researcher is responsible for performing the analysis process or resumes an unfinished process left by another researcher. One major concern of this stakeholder is the preservation of data analysis, which includes the preservation of information about the workflow of activities performed, the preservation of the environment in which these actions were performed and the preservation of digital objects produced. Thus, this is a situation that will bring added value to researchers as it will be possible to go back on a data analysis, recovering or redoing a particular activity, without being necessary to resort to high-cost and time consuming processes. Likewise, recovery and re-execute a data analysis depends heavily on the type of information and how the researchers store it. This is also an advantage
when we assume that a researcher can leave a data analysis unfinished. This way, it will be possible to continue the data analysis with another researcher with minimal effort. Consequently, the scientific results and the researches themselves, may increase and enhance the level of exigency, allowing connect and share relevant information about the analysis process. Another concern of a researcher is to define, à priori, a set of best practices and recommendations for how the data will be managed before and after the data analysis. This information can be translated into a plan where are identified the data that will be created, stored, shared and preserved. This plan will be an advantage for researchers, as it will describe how this DM will be carried out, clarifying the roles and responsibilities of each player in the analysis process, as well as identify some type of restriction on data access. Table 5 – Researcher concerns
Concern C1. Preserve local analysis process
C2. Replay the local analysis C3. Have a plan to assist the preservation and management of the data so that it can be properly secured, stored and shared.
The problem of DP is, more and more, a subject studied and analysed by scientific communities. The DP strategies are well documented and considered for the preservation of digital objects . In this area, several entities that are concerned only with the preservation of objects as digital objects kept indefinitely in a database or repository can be included, such as libraries, archivists, and others. Likewise, we can see that in general, issues of DP has “ f
Connected with this concern, arises the need to preserve the software as well as the tools necessary for the use of digital objects in the future. However, if we consider an IS, in its most general sense, as a system designed to handle different types of information (data input, processes and data output), where DP is not a main functional concern, the issue of DP will have to include another type of approach. If we consider scenarios where information needs to be always available and updated, even though susceptible to constant changes due to the execution of complex processes, the concept of DP, in its broadest sense of information archive, will not be enough to meet the requirements in this type of domains (e.g., e-science disciplines). In this sense, there is a need to provide new knowledge in the DP domain, introducing new requirements that enable research organisations take advantage of DP systems targeted to their needs in complex studies. In this context, the field of Particle Physics is a candidate to meet the requirements of this new paradigm of DP. Due to the large amounts of manipulated data and the constant acquisition of digital products from multiple sources, the proper management of this type of information is a core activity for correct development of these research areas. During the
analysis process, there is also a complex use of data along with data analysis tools. The concept of collaboration is more and more present between different scientific communities, which lead to the need to organize and manipulate data collected and derived from experiments in order to be able to share and disseminate the analyses, as well as the data itself. As stated by , in the scientific context, the information to be archived should be designed and documented in such a way as to support future scientific analysis. LIP fits in this context, where studies are conducted in the areas of experimental HEP and associated instrumentation. Researchers conduct data analyses under international experimental collaborations. Data analyses are conducted locally in LIP, with the support of existing infrastructure. During an analysis process, the researcher performed a series of activities ranging from the data acquisition to its analysis to production of the final data. During execution of these activities, the researcher often uses the data generated in a previous step of the analysis as input data of the next cycle of the analysis (being the analysis process considered an iterative process). However, there may be a need to go back at some point of the analysis process or retrieve the process in its entirety. Thus, there must be a mechanism to make possible the recovery of the context of the analysis. That is, retrieve information about the activities carried out by the researcher, which was the infrastructure used to produce certain results, as well as preserve data used and produced during the analysis of local data.
PROPOSED SOLUTION After the study carried out on the main themes considered relevant to solving the problem
(cf. 2) and as consequence of the analysis performed (cf. 3), the following sections will present the proposal, based on two key concepts: the DMP and the Logbook. 4.1.
For scenarios such as LIP, it is possible to join the two concepts of DMP and logbook, and address the problem as a case of business governance. The described business processes of LIP are supported by a technological infrastructure that is rearranged at very high frequency. One reason that differentiates this type of technological architecture when compared to the architecture present in commercial organisations (e.g., insurance agencies, banks, etc.) is the f
frequently. Now, according to the best practices of IT Governance, recommendations for addressing cases of this kind are provided through ITIL/CMDB (cf. 2.3.3). Therefore, taking into account the analysis described in section 3, there is, on one hand, the need for such organisations to possess good practices to manage data. On the other hand, it is intended that the analysis processes, which involve issues related to technological infrastructure, are preserved and subsequently recovered. Thus, we intend to introduce the best practices of IT Governance in an area where this concern is still not much discussed and where there is a need to solve the problem of information record. Consequently, the main objective of the proposal is to make the connection between IT Governance best practices and concepts of DP and e-science, and applying the result to an actual case, like the LIP. 4.2.
DMP for scientific research
The DMP is directly connected to DM, maintenance and documentation of information handled during a data analysis. As a research progresses, the amount and diversity of data generated is varied, which makes it necessary to undertake a proper and suitable information management. Particularly in the area of e-science, where access to data-intensive computing is permanent, this phenomenon of DM has been revealed even more precious to the scientific community. DM is a practice, which must be taken into account before, during and after a research is initiated. For this, there must be a prior planning by the research team, regarding the information that will be handled during a data analysis. Thus, within the scientific community, it is widely recognized the need for the existence of a DMP by research organisations, i.e., the research institutions require the adoption of a strategy that helps researchers to plan and develop a DM approach. Following the analysis carried out in section 3, supported by the study conducted in section 2, we identified a problem of business governance in e-science organisations that develop and execute projects within international collaborations. Therefore, one of the problems identified in such scenarios, was the defective information management, i.e., there was no detailed planning, previously established, by research bodies. However, several research funders and
guidelines/recommendations in order to provide DMPs to researchers, so that they can demonstrate their concerns regarding the information that is handled during a research. In this sense, there are a number of organisations, identified in section 2.5.2, which require to be submitted, together with the research proposal, a DMP, where the data that will be created, stored, shared and preserved is identified. It should state how it will be done, clarifying roles and responsibilities and any requirements to restrict access to data. Although this is not an innovative concept, being quite common in the area of e-science, it was analysed in detail and synthesized, with the purpose of delivering innovative and useful results for this area. 4.3.
Logbook for scientific research
Due to advances in science and technology, which together have been providing scientific progresses in large-scale, data production and research processes have become increasingly complex. As already referred (cf. 2.4.5), the registration of procedures performed during an experiment is addressed mainly on two levels: through the ELN or by paper recording, through the PLN. While the first case is properly identified in the literature and widely publicized by the vendors of such products, the point of view that we intend to address (cf. 3) is not so linear and well-defined as domains where ELNs are used. The ELNs are used in disciplines where research processes are, à priori, well-defined and are part of the laboratory certification. That is, for a laboratory can perform a particular experiment (e.g., Biology or Chemistry), must be considered a series of steps in order to be entitled
y” to produce that
experiment. However, in the scenario described in section 3, this type of process is not conducted in this fashion. This way, the concept of logbook proposed is an innovative concept, since it is aligned with the best practices of IT Governance, specifically on the concept of CMDB in ITIL. Therefore, there are concepts in ITIL that can be contextualized in e-science scenarios, where it has been recognized that there is a need to preserve the execution of a collaboration. That is, there is a need to record all activities/modifications considered relevant for a collaborative process. It is in this context that the use of the concepts presented in ITIL can be beneficial, such as configuration management, CMDB, CI, among others, which will be contextualized and duly exploited in e-science domain. The concept of logbook is supported by the study carried out in section 2.4, and by the analysis performed in section 3. 4.4.
A consolidated scenario: using a logbook in alignment with a DMP
The growth of the new paradigm of e-science that involves the use of high-speed computing and networking, providing the creation of virtual laboratories, collaboratories  and computational methods to enable scientific discovery, brings a new concept of doing science into the scientific community. Despite the scientific method has not been significantly
changed (cf. 2.4.1), researchers and research organisations are increasingly demanding as regards the implementation of the scientific analysis process. This appears not only due to scientific and technological advances that we have been observed in recent years (e.g., increased number of collaborations, use of grid technologies, use of intensive data-gathering and processing systems, etc.), but also to the concern, increasingly present in the scientific community, regarding the sharing and dissemination of results produced in an experiment. Therefore, researchers cannot only be concerned with the execution of the experiment itself, but also with the entire context that involves the scientific analysis process, which encompasses both the management of data processed, as well as the set of tasks performed during the experiment. For this, there is a need to define and implement good RDM practices through the development of an appropriate DMP. According to , “good RDM is part of the research process and therefore required wherever “research‟ takes place”. The purpose of this plan is based on "improve practice on the ground" through a more effective and appropriate DM based on recommendations in the management of research data. A large number of research funders already require the development and implementation of a DMP for each project that is submitted, by organisations wishing funding. This DMP needs to be appropriate and proportionate to the nature and type of research being conducted. Articulating RDM requirements can be difficult for researchers, either because of the nature and stage of the research project or their lack of awareness and appreciation of the scope of RDM. However, the DMP is very useful in helping them to understand and articulate their RDM needs. However, there is a set of guidelines that each DMP must identify, for example, the data that will be created, stored, shared and preserved. Likewise, should be described how this process will be carried out, clarifying the roles and responsibilities of each participant in the research process, as well as the requirements that restrict access to data. Although this is a common and well known concept in the field of e-science, only solves the problem related to the effective DM, staying overlooked the record of the activities performed during an analysis process and which are essential for a correct and proper understanding of the data analysis (for example, a researcher who want to repeat the same process in the future). The concept of logbook aims to overcome this problem of business governance identified in e-science organisations that develop and execute research projects as part of collaborations with other organisations. The logbook, despite being an innovative concept (at least, according to the presented approach), is commonly used by researchers, but in paper format, where they register some notes considered important. However, from the point of view of the preservation of the execution of a collaboration, what is currently registered in PLNs is not sufficient to preserve that execution, since it is only recorded the software and hardware versions used, and individual decisions that the researcher may consider relevant, not covering the whole context of a collaboration, so that it can later be re-executed.
This way, combining the concepts of DMP and logbook, is possible on the one hand, have a mechanism targeted for information, with its focus on management, maintenance and documentation of data that are obtained and produced as part of a collaboration. And on the other hand, the existence of a concept, aligned with the best practices of IT Governance already developed by the concept of CMDB in ITIL, which also concerns the recording of activities/changes considered relevant during a collaborative process. In short, these two concepts encompass two of the major problems identified by the escience scientific community, regarding the effective management of data produced as part of a collaboration and the preservation of the context of the same collaboration, in order to preserve all activities that allow retrieving a particular experiment in the future. As proposals presented in section 5 and section 6, respectively, the first case focuses on a set of recommendations presented in the form of a DMP in order to be possible to make an appropriate DM. In the second case, the proposal consists of a high-level presentation of what may be a logbook as a means of recording the activities/changes considered important by the researcher during a data analysis process.
DM PLANNING IN E-SCIENCE At first glance, the development of a DMP within a large-scale scientific project seems to be
a promising and important engineering process as regards the DM. However, there are still a small number of entities and organisations that do so. As noted in section 2.5, most large-scale experimental projects have various resources regarding the DM, including data management systems, simply because the experimental apparatus will be unusable without them. Thus, it might be thought that a part of the problem of DM would be resolved, as illustrated by . However, when we narrowed the scope and address the problem of local laboratories, where the researcher develops the local analysis processes within the context of an international collaboration, the requirements and concerns of the problem have additional outlines. This is the case that will be addressed in the proposal presented in this chapter. In this context, a DMP becomes a matter of formalizing and storing the management performed on the data handled during a local analysis, so that these projects do their duty to the society and their research funders. 5.1.
Infrastructure and implementation issues
The creation and implementation of a DMP depends on both the research that is being developed by the researcher, particularly in relation to the data produced, but also on the infrastructure provided by the research laboratory, supporting the analysis process. That is, basically a DMP can be supported via a set of mechanisms, specifically by infrastructure management and the actual execution of the data analysis itself (where data are generated and fed the creation of a DMP). Figure 17 describes, through a conceptual map, these concepts and the relationships between them.
Figure 17 – Infrastructure and implementation issues conceptual map
Relating the DMP to other documentation
Documentation specific to a research project or research activity should be made available from a single unrestricted-
Redirection may be provided here also to documentation located elsewhere including that
having access restrictions and for which access will be available only to those who have the appropriate access authority at that location. It should be noted, however, the following key documentation (at least) must be available, without access restriction:
The DMP (final version);
Confidentiality agreements (where such agreements are themselves not confidential);
IPR statements and other documents that affect how the research data may be used (e.g., MoU).
However, care should be taken to ensure that sensitive and confidential information is protected in an appropriate manner. In addition to this the locations should be given of any other management documents relating to the research activity or protocols, regulations or procedures for carrying out the research activity. These might include requirements and guidance from a receiving repository, if any, to relation to a DM, ethics forms, etc. The location of electronic records should in general be identified using, where possible, an embedded uniform resource locator (URL). For physical records, a description of the physical location should be given, together with the name and contact details of the owner of the records. 5.1.2.
Roles and responsibilities
The responsibilities for writing, implementing and reviewing the DMP are shared between a number of roles within the institution. On the one hand, the researcher or researcher group is responsible for the writing of the DMP because they are the main actors during a local analysis, in relation to creation and management of data. On the other hand, the IT manager, which is responsible for the administration of the infrastructure, should be considered as part of the DMP proposal, since he/she has the essential knowledge about the technological infrastructure. 5.1.3.
Creation and development of the DMP
The creation and development of a DMP is widely identified by research organisations and funding bodies. However, according to research area where the DMP will be developed, this should be adjusted to the desired contents. For initial creation, and subsequent through-project development of the DMP, it is strongly recommended that use of a template related to the scientific domain. This template should contain questions that reflect the local needs of the institution. When this template was used, it is recommended it versioning. The outputs of the template, that is, the initial DMP and any subsequent versions, should be stored in a place where they can be accessed and revised as necessary. However, it is strongly advised that users of the service export copies of these outputs and store them locally. It is recommended that the file name and versioning for these exports DMP conform to the guidelines for the project documentation f
Review of the DMP
The DMP acts as guidance and as a record of activity. Provision of the DMP will fulfil the governance requirements of the research funder (if necessary) and will provide potential for good DM. Implementing and conforming to the DMP will promote good DM practice and result in better managed data, making its use and reuse more effective. To ensure conformance with and accuracy of the DMP, and to guarantee that the DM arrangements best support the research data as the research activity unfolds, periodical reviews (being included in regular project meetings) of the DMP will be required during the project. These reviews should be recorded in the DMP itself. The DMP is a living document and should be reviewed and updated regularly (during and at the end of the project), in order to reflect what actually happened. It is therefore important that the versioning and revision history of the DMP be created and maintained. 5.1.5.
It is the expectation of funding bodies that funds will be accrued with the details of the DM requirements for the project. For digital data storage cost may arise from the purchase of storage space, for example, in the form of hardware (e.g., local hard drives). Thus, the cost of implementing the DMP and future preservation activity should be estimate and factored into the project budget. It follows that a DMP of sufficient detail that identifies the DM budgetary requirements will be a required part of any funding submission. 5.1.6.
Security for digital information is important over the data lifecycle. Data may include direct identifiers or links to direct identifiers and should be well-protected during collection, cleaning, and editing. Processed data may or may not contain disclosure risk and should be secured in keeping with the level of disclosure risk inherent in the data. Secure work and storage environments may include access restrictions (e.g., passwords), encryption, power supply backup, and virus and intruder protection. Thus, in DM process is important that steps are taken to ensure that research data are not lost, and are made accessible only to those who are entitled to see them. Likewise, two approaches can be defined: A) planning to maintain confidentiality and B) planning for long-term preservation. A.
Planning to maintain confidentiality The need to keep data confidential arises from such issues as a desire to protect IP or
because of commercial or state sensitivity. The DMP should identify any areas of sensitivity and make provision for data use where access is constrained or made available for sharing with appropriate limitations. In order to disseminate data, archives need a clear statement from the data producer of w
of the IPR for data that experiment generates.
Where external collaboration will be carried out it is likely that a collaboration agreement will already be mandated to formalize such procedures as the agreements of duties, responsibilities and IPR (e.g., MoU). Where appropriate, consideration should be given to a form of words clarifying the data access, sharing and security requirements agreed by research partners. B.
Planning for long-term preservation Digital data needs to be actively managed over time to ensure that it will always be
available and usable. This is important in order to preserve and protect our shared scientific heritage as technologies change. Preservation of digital information is widely considered to require more constant and on-going attention than preservation of other media. Depositing data resources with a trusted digital archive can ensure that they are curated and handled according to good practices in digital preservation. Data should be organized and contextualized in an appropriate way by ensuring that the information should be provided as comprehensively as possible. In particular, advantage should be taken of appropriate metadata standards for describing research data and its organisation and location. Singular consideration should be given to the file formats used in connection with data. Standards format and formats with widespread software support are likely to remain understandable for longer than closed, software-specific formats. In cases where the latter f
should be made and kept
alongside them. 5.1.7.
Identifying contractual and legal obligations
There will be legal and, in most cases, contractual obligations with respect to DM which will have to be met as a result of funded or collaborative research. Obligations to funders might include the requirement to develop and submit DMPs, to keep research data for a specific length of time, to submit a research data set to national or discipline-related repositories, and so on. Obligations to collaborators might include the security and confidentiality of the data that they are providing, constraints on reuse and publication, and end-of-project disposal. Any such obligations must be identified and recorded in the DMP. 5.2.
Key practices and process areas for consideration in the design of a DMP
The study carried out to a set of institutions and research funders allowed to establish and synthesize a set of recommendations and best practices proposed in section 5 and that should be followed by these entities with regard to DM. Through the analysis made in , the DM budget is one of the scenarios not addressed by the most of research organisations, because the most of them do not describe the proposal budget percentage allocated for DM activities for the new data collected. This cost includes DM services or purchasing equipment, such as fileservers, backup media or software required for DM activity. This way, it is not possible to
have effective supervision by the research funders, on the impact/total cost that the DM will have on research activity. Another scenario that is slightly addressed by the research bodies is the fact that they do not prepare the data resulting from research for future use by other research entities or by the scientific community. Not even further documentation exists to assist in the process of reusing data, so that the user entities can make use of data easily and properly. Scientific data processing typically involves a great deal of computation over a specific amount of data. This type of computation organizes and manipulates data, usually large amounts of it (e.g., numeric data). So, to process this data is necessary to specify data processing steps or scientific workflows that will be adopted to manipulate and generate new data, as appropriate. However, very few entities stipulate and define the type of processing that will perform on the data. This can happen due to the type of discipline, because there are areas,
researcher does not really know which steps will he/she follow, à priori. That is, although the researcher understands which is the objective that he/she wants to achieve, as experiment advances, the researcher will take actions in accordance with the immediate results obtained. Regarding the technical requirements, it was clear that most research institutions, as well as the NSF, do not propose any management rules or guidelines on the existing infrastructure. That is, the proposals and guidelines for the creation of a DMP do not include a description of the current and future infrastructure. Likewise, in most of these institutions, there is no care about the interoperability between strains of research conducted through collaborations. This issue is addressed in an implicit fashion, i.e., there is not an explicit explanation of why not take into account the description of the infrastructure supporting experiments. Table 6 describes the key process areas and related practices in relation to management of scientific data. After analysis carried out in the state of the art (cf. 2.5), we identified a set of key practices, which were grouped into four process areas based on the high-level objectives that the practice helps to achieve. For each key process area, is identified high-level goal and performed a brief description. The key process areas and practices described in Table 6 can be mapped in a scientific workflow lifecycle, describing an analysis process, from data collection to its preservation and subsequent dissemination to the scientific community. Hence, a set of guidelines can be created for the DMP during the analysis process and subsequent assess the impact that DM has at the end of the process. For example, the DCC Curation Lifecycle Model43 defines digital data in a comprehensive manner to include both physical objects and digital objects, where steps, such as create, access and use, appraise and select, preservation action, etc., are included. Most of these practices can be mapped in the key process areas in Table 6.
Table 6 – Process areas for consideration in the design of a DMP
Key process area
Data acquisition, processing and quality assurance
Capture and describe the data processing such a way that helps its preservation and reuse.
Data description and representation
Data access and dissemination
Develop and describe data/metadata to allow the contextualization of the data, allowing its preservation and future reuse.
Provide and describe interfaces for users to access and obtain data in the future.
Prepare and process data for data analysis
Assure data quality
Develop and describe metadata specifications
Contextualize, describe and document data
Describe data structure and formats
10. Encourage data reuse 11. Store, backup and secure data
Repository services/ preservation and IPR
Preserve collected data and describe the protections specific for long-term use.
12. Provide information about data preservation 13. Identify the data privacy restrictions 14. Comply with data preservation policies
Description Data analysis is always initiated with the capture of the data, which will then be prepared and processed by the researcher. The researcher has to certify the source and state of the data, as well as establish the research context. Due to the large amounts of data produced during the research, it is necessary to ensure that its management is carried out in a proper way in order to have a description of both the data and the metadata. Consequently, this information will assist in subsequent activities (access, sharing, etc.), as well as will provide to funding bodies an overview on the data. The researchers responsible for data access and how sharing of the resultant products takes place should be clearly identified. There should be a concern, in respect to the shared datasets in encouraging and supporting the reuse of these. There is the need to describe long-term plans for storing research data through the explanation of the approach taken for storage of the information associated with the research project. The protections that will be put in place to prevent unapproved disclosure of data should be described.
Recommended contents for DMP
At this point, we are able to provide a set of guidelines and respective structure of a DMP. Based on the analysis in section 2.5 and using as case study the LIP, the structure of the DMP and recommendations will be proposed in order to create a plan whereby all aspects related to the DM (from its creation/acquisition to its dissemination by the scientific community) are respected, but also that issues related to infrastructure that supports the analysis process can be included during the creation of the DMP.
Research funders expected basic information about the research being conducted by institutions. It should indicate the research discipline and briefly outline how the research will be conducted. It should include the initial planning and decisions for DM. Research context should provide a brief summary of high-level project documentation relating to the research activity, including confidentiality agreements, IPR statements and other documents that affect how the research data may use. 5.3.2.
The data generated or obtained from external sources, are often manipulated to produce useful results for the on-going research. During this process, several analyses are performed on the data, for example through specific software where the researcher defines steps that must be met to obtain the final result. It is this information that must be described by the researcher, which consist of, describe any data processing steps or provide a scientific workflow planned to use to manipulate the data. Researchers should give a detailed description of how the data will be generated and manipulated, including the methods, technology, conventions, standards, etc., that will be used. Sometimes in some studies of certain areas, the level of detail of data generation and manipulation cannot be very high due to data are undergoing to multiple iterations in the moment of analysis. In these cases, the level of detail should be increased as the plans are implemented. 5.3.3.
The data quality is an important issue when we are talking about sciences and engineering fields with big amount of data involved. Maintaining data quality requires going through the y
” . Q
calibration of instruments, the collection of duplicate samples, data entry methods, data entry validation techniques, methods of transcription, etc. So, researchers should describe the quality assurance procedures and standards that will be used as well as document the appropriate metadata. 5.3.4.
Depending on the domain specificity, the data created and manipulated are derived from different sources, so its structure and format are diverse. Thus, it is necessary to specify the information, tools or resources that would be needed to manipulate or render the data, along with any relevant instructions that researcher consider important to this scenario. If there is a need to use some format unusual within the on-going research it should be clearly explained. 5.3.5.
During a data analysis, there are funds that are provided for research funders in order to conduct the research process. In DMP, these funds should be discriminated in a proposal budget, specifying the portion allocated for DM activities for the data output. This should include DM methods and responsibilities established. However, the time involved in documenting, writing metadata, and archiving are underestimated. Any costs associated with
using DM services or purchasing equipment (such as fileservers, backup media, software, etc.) used for DM should also be documented. 5.3.6.
Access and sharing
Access and sharing are two mechanisms, which are expected to be triggered after the completion of a data analysis. During a research the need to change the owner of a particular analysis might arise (e.g., if a researcher leave a research unfinished). In these cases, it is necessary to define who can have access to research and its datasets, beyond the principal researcher (if there is only one researcher responsible for analysis). On the other hand, at the end of each research, the results should be made public in order to foster scientific progress in the scientific community (there is a general expectation that data will be made available with as few restrictions as possible). So, it is necessary to list who/how will have access to the research data, including websites maintained by the research group or other publicly available sources. Also, should be described when and how the data will be available. Whenever possible, access should be provided through DOI (cf. 2.4.4) for electronic records. In particular, the experimental HEP community does not have a tradition of sharing its data publicly. Data from an experiment is usually owned by the collaboration that created and ran that experiment, which include the set of researchers that are members of a specific experiment. 5.3.7.
After completing a research, the results generated are often exploited for further research analysis later. Therefore, researchers should take into account possible uses of their data, à posteriori, through listing any bodies/groups, which might be interested in the data, and the foreseeable current or future uses to which they might put on the data. Likewise, special provisions can be made during the research that increases the compatibility of the data with that future use (principle of reusability ). Thus, should be stated the procedures that will be taken to prepare the data for these bodies/groups/uses (e.g., forms of data organisation, choice of standard formats, ontologies, conventions, etc. for the data and metadata, etc.). 5.3.8.
Archival and preservation
The archiving and preservation of digital objects have been becoming an increasingly common practice in scientific communities, especially those who work through international collaborations among different organisations and research areas. So, there is the need to describe long-term plans for storing research data through the explanation of the approach taken for storage of the information associated with the research project. Information about: where will data be stored; how data will be saved; what additional metadata is saved with the data, should be provided. Because there are some institutions where data is stored in different formats and in different locations for compliance with international practices, this should be explained by researchers. Also, should be identified the organisation that will be responsible for the long-term archive and preservation of the data collected.
Protection and IPR
At this phase, researchers should state who owns the copyright of the datasets that will be collected/produced. In some cases, data contains personal information where privacy restrictions apply, i.e., there is no permission to disclose data (e.g., medical data records). So, in these situations should be described the protections that will be put in place to prevent unapproved disclosure (e.g., encryption, password protection). Likewise, if needed, should be explained how data will be anonymised, due to any issues related to private, sensitive or secret data. Researchers should mention if different data products will have distinctive protection (e.g., raw data, observational data, processed data). If applicable, description should be provided about policies for the protection of proprietary data, privacy and confidentiality, and IP. 5.3.10. Current infrastructure To support a data analysis is necessary to have an infrastructure in place. Researchers should describe the infrastructure that they currently have (e.g., servers, software, storage, etc.) in their institutions. This is an important aspect to consider in the DMP given that for the production and proper DM, researchers must have a view of the infrastructure that supports the entire analysis process. Likewise, for future data reuse, researchers who will use the shared data in future scientific processes need to have an overview of the analysis process, including information on the process of DM, but also the description of the infrastructure used. 5.3.11. Future infrastructure When a researcher is planning a research analysis, he/she needs to plan ahead, i.e., it is necessary to foresee how experiment will be assembled, even if not too detailed. Should be stated what is needed to review the infrastructure, what IT assets should be purchase and calculate an IT budget including these assets, like hardware, software, hosting, etc. Increasingly, researches undertaken by fields of Science and Engineering make use of grid computing. In this way, the infrastructure is no longer a static domain, acquiring a dynamic status due to the phenomenon of grid computing. Researchers can then carry out their analysis in grid environments, and no longer has the restrictions imposed by local infrastructure. However, the description of the infrastructure (grid or not) that it intends to use during the research should continue to be made. 5.3.12. Interoperability The collaboration between scientific communities is becoming increasingly common in the areas of scientific research. Therefore, arises the concern about interoperability. When there are people or communities, in which researchers need to interoperate with, this subject are very important to address. The way how researchers test for interoperability or compliance, which includes standards, schemas, vocabularies, community data conventions, data and metadata formats, etc., should be described.
After considering the sources mentioned in , a generic model with a recommended structure of contents of a DMP was proposed, which is depicted in a summarized form in Table 7. Therefore, a recommendation for a DMP was created (cf. Appendix F) based on references and best practices explored in the state of the art (cf. 2.5), through which it will be possible to make a proper DM before, during and after a local analysis process. The content of the DMP was developed based on similar plans developed by universities and institutions that are already aware about the need to create and put into practice a plan of this kind. There is a particularly difference when creating DMPs, since these may be proposed by research funders or by the department or research group as part of a project within a university or a collaboration. In the first case (research funders), the survey has been carried out with intention to analyse what is the relation that these entities have with organisations that seek funding ( , §5). However, in comparison with similar entities such as the NSF, it appears that the level of detail of the plan and the specificity of its content is more evident in first case. Funding bodies do not define DMPs for them, but define what should be considered in these plans, for those who want to be funded in its researches by both the NSF or by the UK funding bodies (AHRC, BBSRC, etc.). Therefore, it is not relevant to compare NSF with the institutions funded by it, but comparing the NSF with organisations at the same level (institutions that do not define DMPs for them, but set guidelines for others who are under its area of influence) makes more sense. Regarding the DMPs proposed by the NSF for some areas of research, we can conclude that these plans should be developed based on the specificity of information that the area involved produces. That is, there must be an adjustment of the contents in focus in a DMP of M
.) in comparison to that of
an area such as Education and Human Resources (EHR). In the first case, especially in Particle Physics and Astrophysics the amounts of data produced are very high when compared with ERH. Likewise, the DMP for Geospatial science discipline ( , §4.1.5) has a specific concern when compared with the other sciences outline: the volume of data. This issue is especially relevant in this type of disciplines, like Geospatial or Physics, due to large amounts of data involved in the process of scientific analysis. Therefore, in this case the DMP should have as main concern showing the detail of the volumes of data produced (specify the quantities produced, assess its quality as relevant information, specify the format and structure of data, etc.). In these cases, where datagathering is a constant and an on-going process, that there must be additional care in the documentation of such content. On the other hand, the areas of EHR ( , §4.1.3) and Social, Behavioural, and Economic Sciences (SBES) ( , §4.1.8) are committed to timely and rapid data distribution, regarding to the period of data retention. Despite being a very important subject due to the principle of timely access, researchers should address how this will be met in their DMP.
Considering the information sharing, it is a phenomenon that has been increasing with the passing of years although it is still not widely used for some scientific communities. There is considerable disparity between disciplines in terms of sharing research data. Some disciplines, such as Environmental science, have a long practice of sharing data while other are much less motivated to make their data accessible beyond the immediate research team. However, there are many benefits to be gained from making data as accessible as possible. With the advancing of the years and the technology itself, areas such as Physics, Astrophysics, Engineering or the Geospatial Sciences, increased the sharing and publication of data and some researches carried out in important scientific discoveries. Still, several scientific communities are not yet aware of the benefits of sharing of information between research communities. However, such areas are more likely to have resort to the sharing of data, while areas such as Bioengineering, have a more sensitive level of sharing. In terms of data privacy, all scientific areas, both of Science and Engineering, must have specific policies for each. Researchers are responsible for the ethical and proper treatment of data. Research data, which includes confidential or private information, must be managed in agreement with any contractual or funding pacts. Research data, particularly in health-related disciplines, may contain personal information about identified individuals. Table 7 – Recommended structure for DMP
High-level DMP section
Content Research context
Research data and protocols
Data processing Data quality Data structure/format DM budget Access and sharing
Future reuse Archival and preservation Protection and IPR Current infrastructure
Future infrastructure Interoperability
Consequently, confidentiality should be a major concern in areas such as Medicine, but not so much in Veterinary and even less in fields like Agriculture or Rural Economy and Land. This fact is more or less obvious and perceptible, since Medicine works with data of high sensitivity, since we are talking about research and trials conducted in Human Beings. Similarly, quality assurance and quality control of the processed data must be taken into account, particularly in cases such as Neuropsychiatry ( , §4.2.7), where experiments that lead to the improvement of understanding of brain and mind disorders or the quality of care for patients with complex mental health problems and disorders are developed. Table 7 summarizes the recommended structure of contents described in section 5.3.
THE LOGBOOK IN E-SCIENCE The proposal to create a logbook for e-science disciplines, particularly in the case of LIP,
follows the study carried out in the state of the art (cf. 2.4.5), in which was identified a common practice among researchers belonging to this research field, which is characterized by recording all activities within an analysis process. Hence, the purposed logbook should be aligned with best practices of IT Governance, combining the concept of CMDB in ITIL (cf. 2.3.3). Supporting CMDB, there is the configuration management, which provides information about service asset components and relationship between the various components of the infrastructure. So, according to these concepts, the logbook will have the objective to record the activities performed during a data analysis process, where is involved an infrastructure associated to the execution of a collaboration. Consequently, the objective is provide a logical model that identify, control, maintain, verify and report on the assets and resources comprising an IT infrastructure, as well as their constituent components and relationships. By doing the comparison with the CMDB, the kw
these records are the activities/tasks performed by the researcher in the context of a collaboration. 6.1.
The logbook applied to data analysis process
Science and Engineering are increasingly digital and data-intensive disciplines. Digital data is not only the output of research, but it can also be used to provide input to new hypotheses, enabling new scientific insights and driving innovation. Therein lies one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future datasets and data streams. For example, at local level, researchers use personnel laptops, desktops and/or computing farms. The choice of computing resources to use depends on the exact actions to be performed. They can be executed interactively (in their laptops, desktops or in dedicated servers) in sessions that take a couple of hours, or in batch mode by launching arrays of high intensive production tasks. So, when a researcher is conducting a local analysis, he/she is supported by an infrastructure where he/she will analyse a set of data, applying a set of software, in order to achieve a particular result. Therefore, it is necessary to record all of relevant information in order to facilitate the replay of the analysis if and when the need arises. Consequently, the concept of logbook emerged. A logbook is a primary record of research (cf. 2.4.5). Researchers use a laboratory logbook to document their hypotheses, experiments and initial analysis or interpretation of these experiments. The laboratory logbook serves as an organizational tool, a memory aid, and can also have a role in protecting any IP that comes from the research.
Likewise, a laboratory logbook allows researchers to store information in a structured way in order to be able to repeat some step of the analysis, whether it is after a day or a year. To perform the preservation of the local analysis, the laboratory logbook must comply with a set of rules and have a specific content so that the preservation of local analysis can be made as simple as possible and less intrusive to users. Thus, in accordance with the description of the infrastructure and processes (cf. 3.6.2), it is proposed that the record of the context of a local analysis be conducted through a logbook. During the entire analysis process, the researcher goes through a set of activities, where he/she has to obtain the data, analyse it and produce new results, according to the requirements of the experiment. Specifically, during the data analysis, the researcher has to describe the steps that led up to that point in the analysis, what software and versions that have been used. Basically, the context (the environment in which a task operates) of the local analysis should be recorded in the logbook. And it is precisely at this point of the process (data analysis) that the researcher has to interpret and make decisions on the results, according to histograms, plots, charts, etc. Another important issue is the number of iterations that a researcher performs during an analysis. This number can range from one (very unlikely) to n. In each iteration, the researcher has to deal with different data, which will apply on the particular set of applications to obtain the desired result. Therefore, it is necessary that the logbook follows a well-defined structure, so that the recording of activities during a local analysis process follows a standard that can be perceived by all elements of a collaboration as well as by other researchers who want repeat the analysis process years later. 6.2.
Recommended contents for logbook
Each job or project can usually be broken down in four major phases: background, planning, execution and study of results . In typical laboratory or pilot scale experiment that might last from a few hours to a few months or years, the following contents should be :“
” “ xperimental
”. However and according
to the evaluation made to analysis process of LIP, the purpose of the logbook is to be the least intrusive and time-consuming, as possible, for researchers. Hence, assuming each collaboration as something that has been analysed, planned and contracted previously, it is only necessary to make reference to it. For example, if any research project that was subjected of a contract with an entity, like European Commission, CERN, etc., already there will be at least one document with this contract, which has the objectives, work plan, etc., so that should already be referenced in the organisation. Otherwise, the proposed DMP addresses some issues that are relevant to the logbook, existing even a point at which these two concepts intersect each other, where some existing information in the DMP can be simply referenced in the logbook, making unnecessary the researcher have to collect all the information by value during a local data analysis.
Moreover, in the case of Particle Physics (and so in LIP), researchers do not follow a welldefined sequence of steps, so the logbook cannot be a rigid structure. For this, we propose a categorization of records that the researcher registers during his/her local analysis. 6.2.1.
The purpose section of a logbook is where the researcher explains the reason for doing the data analysis in the first place. This section focuses on the specific short-term goal of the work. The purpose of the work should be referenced through the information that was described in the DMP. 6.2.2.
The problem/experimental plan section provides a description of the problem and the planned experiment. This section should not be omitted because readers will likely have trouble understanding what a researcher was doing. A clear statement of the problem means that the researcher understands what he/she wants to do and that he/she is focused on a particular approach to solving the problem. In order to clarify the analysis procedure, the researcher should draw a flowchart, outline, or numbered list of experimental steps. The experimental plan should be a look forward, describing the specific work to be done. 6.2.3.
The list of hardware and software used should be listed in a complete manner. The versions of software, compilers, libraries used should be recorded. If the researcher developed new patches (to the software provided by the collaboration) to perform his/her analysis and if it proves to be important to carry out the analysis, should be included in the logbook. The information provided in the DMP about the infrastructure should be combined with information record in the logbook. In this case, the concept underlying the registration of the infrastructure is based on the concept of CMDB, which is composed of components of the IT infrastructure and applications and/or services. 6.2.4.
This section might be considered the heart of the logbook, because, in it, the researcher actually records the observation that he/she makes during the course of a local analysis. This is a very clear, step-by-step list of actions that researcher plans on doing during the data analysis. Each step should be short and very succinct in order to facilitate reading by others. All the decisions made during the local data analysis according to results obtained from intermediate analysis steps should be recorded and duly justified. 6.2.5.
This section is where the researcher specifies all of his/her analysis data. It should consist of quantitative (numerical) data arranged in charts, plots, histograms, as well as, qualitative (non-numerical) data. The results section will probably be long because the researcher should describe all the results he/she gets, whether considered relevant results to the analysis or not.
This is where the researcher explains the meaning of the results. If the researcher needs to generate a graph or a chart, must use the data that he/she obtains in the results section. If the researcher needs to explain why something happened, he/she needs to describe it here. If calculations are required, they belong here. The analysis section is the part of the analysis where he/she explains why his/her hypothesis is right or wrong, based on the data he/she has taken. This step of the registration process of the logbook can be performed in parallel with the execution of methods and simulations. This is particularly relevant in Particle Physics because it has iterative processes, where the results of a particular simulation may be used for the next iteration of the analysis. This section is where the researcher decides whether he/she needs to perform more steps in his/her analysis, using as input for next analysis, the results obtained up to this step. 6.2.7.
At the end of the data analysis, the researcher will ask to assistance of other researchers to conduct a review of the results obtained. After the review is completed, two findings may occur: the revision was positive, i.e., the results were considered relevant and can be used for publications or writing papers, theses, etc.; or the revision was negative, i.e., the results were not deemed suitable according to the expectations of the collaboration. Table 8 – Match between business processes and logbook contents Local Analysis Process (business processes)
Records Category Introduction/Purpose
Problem/Experimental Plan Infrastructure Procedures/Activities (Method)
Produce Final Data
Thus, the analysis process will be subject to a new iteration through the processing of new simulations. All decisions and conclusions that led to the acceptance or rejection of the results, should be described, justified and recorded as conclusion of the analysis process. And should f
( . . “I ( . . “I
A match between the categorization of records made by researchers (during a local data analysis) and local analysis process can be observed in Table 8. 6.3.
Summary of logbook context
After presented the proposal for the creation of a logbook and according to the evaluation carried out at LIP’ local data analysis, Table 9 and Table 10, summarizes the information that must be recorded in the logbook during a local analysis. This information can be divided into two parts: information relating to the infrastructure (Table 9) and processes (Table 10), respectively.
Table 9 – Recommended information that should be recorded during a local analysis (Infrastructure)
Software version control is often used for keeping track of incrementally different versions of electronic information. So, the researcher needs to know what is the software version used on certain types of data. A compiler checks about the legality of the statements in source-code, import from libraries, make calls to
functions, manage variables of different scopes, etc. So, as in software version, it is very important that the researcher record the version of the compiler used during the research. As software and compiler version, libraries are an important asset in development of code (application and
Libraries version Infrastructure
patches). It is important to record information about these three resources, since often, researchers need to develop patches that in combination with software provided by international collaboration, allow to obtain new results and consequently new scientific discoveries. The record of hardware components (desktop, computer nodes, etc.) and its features during an analysis is important in that, further re-executions of data analysis, have access to a complete record of all
infrastructure used and not just the software or the compiler version used. Thus, future research has the possibility to modify the analysis, but having à priori, the knowledge of the basis of the analysis that generated a particular kind of result.
Sketches of the equipment
The visual/graphical representation of the environment analysis can elucidate better and faster, a
and important details of the
researcher who wishes to perform an analysis, based on a logbook. Therefore, the description of the
method should be made, where possible, through drawings.
Table 10 – Recommended information that should be recorded during a local analysis (Processes)
Description of the activity
Specify the start date of the analysis
Provide a description of the problem and the planned analysis.
Iteration of analysis steps Define observation events to a specific Process
The purpose/objective of the analysis is defined in order to find out new information and ideas, to answer a question raised previously, etc. During an experiment, a researcher would normally have to repeat the data analysis more than once (the output of one step of the analysis can be the input of the next iteration). Thus, it is necessary to keep the number of iterations that were performed during an analysis. Observational analysis events involve watching and monitoring the environment of the analysis. All the relevant steps of an analysis should be recorded and defined. During the various iterations performed during an experiment, the researcher has access to multiple information
(data products). Much of this information is represented through graphs, plots, histograms that the researcher
has to interpret and make decisions about the next step of the analysis. This type of information also helps the researcher to analyse the data and patterns of data analysis.
Decisions, comments and descriptions (facts of decisions)
After performing a set of iterations of an analysis, the researcher must make certain decisions regarding the data obtained. Likewise, the researcher has to make decisions related to resources and infrastructure use. For this, it is necessary to take notes and make comments in the logbook in order to justify decisions during the analysis research process.
Summarize the goal of the analysis, what was done, and what the researcher found. Usually, in the final of the
analysis, the researcher obtains the answer to his/her question raised at the beginning of the analysis through
about the experiment
the hypothesis considered.
The Record Point (RP)
During the assessment of the scenario, it was found that the main requirement in such activities (collaborations in e-science scenarios) is that the business be endowed with a certain property, such as DP. Therefore, it is necessary to record the manipulated information during a collaboration, specifically in the case of a local analysis, through the definition of a set of information structures that assist in this process of preserving the context (both at the level of the infrastructure and of activities carried out by researchers). In the particular scenario of local physics analysis, there is no information or centralised system keeping track of the data and steps performed along the process. As the main objective, related to the preservation context of the data analysis, there is a y
y – the researcher should be able at any time during
a local analysis, go back and repeat a specific task or change a particular step in the analysis. Likewise, the fact that in the end of the local analysis it can be possible to repeat some step or the entire analysis with different parameters is another of the desired requirements for this type of collaboration. The way this information is recorded in the logbook and updated by researchers, is a critical point during the entire process. So, as in SACM, where the configuration control activity controls the recording of CIs in the CMDB (i.e., ensures that the record is done with appropriate controlling documentation), associated to the concept of logbook arose a notion that comprises the specific points on which the researcher might consider necessary to record some information in the logbook (due to a change that alters the state of the process). This notion was named Record Point (RP). 6.4.1.
Definition of RP
A RP refers to the systematic recording of information related to a local analysis. It is related to the information that will be record in the logbook, so it needs to be accurate, complete and understandable to researchers. A RP has as underlying notion the concept of snapshot, known from Distributed Systems, where it is used for recording a consistent global state of an asynchronous system .
Figure 18 – The components of RP
Recording the global state of a distributed system is an important paradigm since, for example, the global state might help in failure recovery, where a global state of the distributed system (called a checkpoint) is periodically saved and recovery from a processor failure is done by restoring the system to the last saved global state. In our case, the global state of the system should be examined and recorded for certain properties; for example preserve local analysis data, preserve local analysis task, replay local analysis, rerun local analysis. RPs should be well-defined so the information described in each one can be useful and understandable to those who will use it. Each RP must be defined through a set of information gathered during the local analysis, such as data, the infrastructure and a set of actions that led up researcher to that point in the analysis (Figure 18). This set of information defines the context of the analysis (global state in distributed systems). 6.4.2.
Recommended properties for RP
The purpose of a RP will be behave like a snapshot to collect specific points during an analysis, in order to enable researchers to have a perspective and a control as accurate as possible, over the activities performed. So, if necessary go back and repeat some steps of the analysis, this will be possible. Since the RP is an essential mechanism for the proper recording of the analysis in the ’
ambiguity and can take into account the properties of DP. Table 11 – Recommended RP properties and respective descriptions
Description Integrity is a concept of consistency of actions, values, methods, measures, principles, expectations, and outcomes. Assuming that a RP is intact, we can conclude that the information recorded on it has not tampered, i.e. f ’ content. In general, integrity leads us to other two important properties such as the correctness and consistency. Correctness is the ability to perform exact tasks, as defined by specification. The correctness is verified when the record is correct with respect to a specification. In this case, a record is considered correct when it is according to what researcher registered. Persistence refers to the characteristic of state that outlives the process that created it. A RP is considered persistent if the previous versions of it (if any) are preserved when it is modified. The records performed by a researcher should be kept throughout the entire process of local analysis. So, to be able to repeat steps of the analysis or preserve all the activities performed during an analysis, it is necessary that these records remain stable, that is, since the records acquire a certain value, this value should remain unchanged (it can only be changed exclusively by the researcher). Reliability is an important issue in the most of the systems that are concerned with preservation. Reliability is often related to security, trust or confidentiality. It is important that a RP cannot be changed by unauthorized persons, in order to put the system in a incoherent state. Authenticity is related to security systems, which involves proof of identity. A RP should be authentic and researchers should be persuaded to accept the changed object as authentic through establishing its identity and integrity. This does not mean that a record must be precisely the same as it was when first created. A record is considered to be essentially complete and uncorrupted if the content that it is meant to communicate in order to achieve its purpose is unaltered.
According to the analysis of the state of the art (cf. 2.2.1) surrounding properties of the DP environment, it is possible to propose a set of properties that define a RP (Table 11). The RP is a mechanism that enables researchers to establish synchronization points during the local analysis, so that they can recover activities at some point of the analysis or even recover the analysis in its entirety. Therefore, there may be a question related to the right moment to register a RP. This is a question that must take into account not only the current activity that the researcher is performing, but also all the steps that were executed previously. The RP should be saved whenever actions are performed by the researcher that creates disruptions with the following activities. That is, when is being performed an activity that takes the process to a new state, different from the previous, it is necessary to save a RP. For example, if a researcher is developing a patch for particular software should be saved a RP, between version changes. Likewise, between iterations during a data analysis should be stored RPs if the researcher uses different datasets, different simulations, etc.
7. CONCLUSIONS At this point, the evaluation of proposal according to the proposed solution exposed in sections 5 and 6 will be carried out. However, according to the research methodology followed (cf. 1.4), this work is not based on the technical-scientific process of analysis, hypothesis, solution, implementation and validation. In these areas of IS, where there are essentially problems within organisations, the final results are mostly techniques or methods and not exactly measurable artefacts. The AR method has the particularity to propose that research validation should be based on the study carried out during the analysis process. So, according to this method, the proposed solution (cf. 4), and the proposals (cf. 5 and 6) were based on a study of the state of the art (cf. 2) and the problem analysis (cf. 3). Therefore, the evaluation will be supported by the study conducted and findings presented during the state of the art and proposed solution, respectively. Thus, in this section will be explained in detail the decisions and solutions proposed in sections 5 and 6. 7.1.
Application of the proposal
LIP is a scientific and technical association of public utility that has for goal the research in the fields of experimental HEP and associated instrumentation. This association is interested in the long-term preservation and recovery of the Particle Physics data analysis performed by its researchers. The goal is to enable the systematic archival of all the analysis steps in such a way that facilitates its replay if and when the need arises. The motivation to revisit past analysis processes may happen many years after the initial work had been performed and can be originated by researchers that are unaware of the original analysis details. LIP was used as an institution in e-science domain to perform requirements elicitation, regarding the problem identified on this type of research organisations. So, in order to be able to address these requirements identified in LIP, a study was made on the main concepts and current best practices in these communities of e-science. Consequently two concepts were considered: DMP and logbook. The main idea consists in the identification of a problem of business governance in escience organisations that develop and execute research projects in the context of collaborations with other organisations. For this problem we developed a point of view of preserving the execution of a collaboration, proposing a solution combining the best practices of IT Governance and the requirements of typical e-science scenario. The target institution to validate these proposals was the LIP, since it deals with external, real-world dynamic data and research generated project data and had a range of data requirements and
This highlighted many
practical issues (e.g.
sharing/transfer, ownership and IPR, confidentiality, anonymisation, multiple versions of both raw data and derived data requiring robust version control/tracking, required/permitted access rights, geographic boundaries), which are common to other disciplines/contexts. During the problem analysis (cf. 3), we found that there is a low level of knowledge about RDM amongst staff whose research discipline is not information management/computing.
Specifically in the case of LIP, researchers (mostly physicists), are concerned mainly with the physical analysis itself and not with the process involving the activities and tasks performed during the data analysis process. Thus, it was necessary to set the solution to the problem on two concepts derived from the analysis to the state of the art. The former formalized in section 5 focuses on the information itself, based on the management, maintenance and documentation of data that are obtained and produced as part of a collaboration. LIP as an institution of e-science, where data analysis is performed using an intensive computing tasks that take advantage of different infrastructures (either local or distributed across different sites) and handles with large quantities of data from different sources, has the need to have a DMP. This DMP although not an innovative concept (fairly common practice in international institutions of e-science), was deeply analysed and synthesized with the goal of producing a set of best practices and concrete recommendations for the LIP. The second concept presented was the logbook. The analysis and proposal of this solution, was based on the concept of ELN (cf. 2.4.5). However, we propose an innovative concept, aligned with the best practices of IT Governance developed in the concept of CMDB in ITIL. In LIP, through the recording of changes triggered by the activities performed by the researcher during a data analysis, it is possible to create a set of RPs (logbook) that has as point of view the preservation of execution of a collaboration. In conclusion, the mixture of those two concepts (DMP and logbook) allows institutions, compared to LIP in its business, to take advantage from best practices and mechanisms to ensure appropriate information management, including management, documentation and maintenance of data, as well as to be able to preserve the context of a collaboration. 7.2.
After carrying out this work was notoriously the high dimension of the problem of DP when interconnected with the good practices of IT Governance, in e-science scenarios. Especially in this area, organisations are increasingly sensitive to this kind of problems, with regard to organisation, management and maintenance of data handled. However, it is still necessary to have an increased awareness by both organisations and researchers in order to approach and overcome the problem of DP in e-science domains. The TIMBUS project, to which this work contribute, through its component of business continuity, demonstrates the work still to be done regarding the development of activities, processes and tools that ensure continued access to long-term business processes and its underlying infrastructure. Therefore, as future work and possible continuation of this thesis, we can mention the implementation of a technological solution that performs the monitoring of the infrastructure that supports the implementation of a local analysis. Likewise, this technological solution should provide to researcher the possibility to record all the steps that he/she considers relevant during a local analysis.
This way, it is provided a well-defined process for future researchers containing the information necessary for them to re-analyse a given local data analysis. Some of the concepts and the structure that can be adopted are proposed in this dissertation. 7.3.
This dissertation focused on two topics considered very important for the current scientific communities of e-science, including information management and preservation of the activities/tasks performed during a collaboration. The problem that has been identified by studies conducted in this area of e-science is directly bonded to a problem of business governance in organisations, in this field. These organisations develop and execute research projects in the context of collaborations with other organisations and therefore need to explore a set of new practices in order to be able to compete in this new era of science. This era is marked by several features that make the way of doing science into something completely different from what was done in the past. The concepts of data preservation, sharing, dissemination, reuse, management among others, caused changes in the way we do science today, leading to it has to be endowed with special features. During the research work, became clear that DM is an essential process to ensure that diverse datasets can be efficiently collected, integrated/processed, labelled/stored, and then easily retrieved through time by people who want to use them. On the other hand, it was vital to establish requirements for the long-term preservation of workflows involved in large-scale mathematical simulations and data analysis. In this context, the aim of this dissertation was to develop a point of view of preserving the execution of a collaboration, where we propose a solution combining the best practices of IT Governance and the requirements of typical scenarios of e-science. For that were proposed two concepts: the DMP (cf. 5) and logbook (cf. 6). The DMP has the main objective of providing researchers of a mechanism for organizing and archiving past, present and future data during a collaboration. Concerning the concept of logbook, as proposed in this dissertation, is an innovative concept, aligned with the best practices of IT Governance surrounding the concept of CMDB in ITIL. Although it has been made the analogy with the ELN (cf. 2.4.5), the two concepts (ELN and logbook) should not be confused, since the first consists of tools designed for areas where the analysis comprise well-defined processes, whereas the logbook is intended to record all activities/changes considered relevant during a collaborative process. This dissertation had as validation support an institution making use of e-science, the LIP. Consequently, the research method used was AR (cf. 1.4), since this suggests that the researcher should try a theory with individuals in real environments. In this context, a series of meetings with members of the LIP took part, in order to understand what were their concerns within a collaboration and to validate the proposals. However, it is recognized that these validations are carried out in a light fashion, not existing metrics, or a validation process consistently defined.
8. REFERENCES  Anne FITZGERALD, Kylie PAPPALARDO, and Anthony AUSTIN, "The Legal Framework for e-Research Project. Practical Data Management: A legal and police guide," Queensland University of Technology, Queensland, Australia, September 2008. Available at http://www.e-research.law.qut.edu.au/.  T. HEY and A. E. TREFETHEN, "The Data Deluge: An e-Science Perspective," in Grid Computing: Making the Global Infrastructure a reality. Hoboken, New Jersey, US: John Wiley & Sons Ltd., January 2003, pp. 809-824.  Su-Shing CHEN, "The paradox of digital preservation," Computer, vol. 34, no. 3, pp. 24-28, March 2001.  Wim Van GREMBERGEN, Strategies for Information Technology Governance. London, UK: Idea Group Publishing, 2004.  L. Michelle BENNETT, Howard GADLIN, and Samantha Levine FINLEY, "Collaboration and Team Science: A Field Guide," National Institutes of Health (NIH), 2010. Retrieved from http://teamscience.nih.gov.  Francis L. MACRINA, "Dynamic issues in scientific integrity: collaborative research," The American Academy of Microbiology, Washington, DC, 1995.  J. P. WALSH and N. G. MALONEY, "Collaboration structure, communication media, and problems in scientific work teams," Computer-Mediated Communication Journal, vol. 12, no. 2, pp. 712-732, January 2007.  Nicholas W. JANKOWSKI, "Exploring
e-Science: An Introduction,"
Journal of Computer-Mediated
Communication, vol. 12, no. 2, January 2007.  The Digital Archiving Consultancy Limited, "Towards a European e-Infrastructure for e-Science Digital Repositories," DG Information Society and Media Unit F – GÉANT and e-Infrastructure, United Kingdom, 2008. Retrieved from http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/e-scidr.pdf.  Yogesh L. SIMMHAN, Beth PLALE, and Dennis GANNON, "A survey of data provenance in e-science," ACM SIGMOD Record, vol. 34, no. 3, pp. 31-36, September 2005.  José BARATEIRO, Gonçalo ANTUNES, and José BORBINHA, "Addressing Digital Preservation: Proposals for New Perspectives," in First International Workshop on Innovation in Digital Preservation, Austin, Texas; USA, 2009.  Richard BASKERVILLE and Michael D. MYERS, "Special issue on action research in information systems: making is research relevant to practice-foreword," MIS Quarterly Journal, vol. 28, no. 3, pp. 329-335, September 2004.  K. LEWIN, "Action research and minority problems," Journal of Social, vol. 4, no. 2, pp. 34-46, 1946.  Gail McCUTCHEON and Burga JUNG, "Alternative Perspectives on Action Research," Theory into Practice, vol. 29, no. 3, pp. 144-151, 1990.  Regina HATTEN, D. KNAPP, and R. SALONGA. (1997) "Action Research: Comparison with the concepts of The Reflective Practitioner and Quality Assurance" in Action Research Electronic Reader.  Judy L. WYNEKOOP and Nancy L. RUSSO, "Studying system development methodologies: an examination of research methods," Information Systems Journal, vol. 7, no. 1, pp. 47-65, January 1997.  Richard BASKERVILLE and A. Trevor WOOD-HARPER, "A critical perspective on action research as a method for information systems research," Journal of Information Technology, vol. 11, no. 3, pp. 235-246, September 1996.  David TRIPP, "Action research: a methodological introduction," Murdoch University, Perth, Australia, 2005.  D. DeLUCA, D. GALLIVAN, and N. KOCK, "Furthering information systems action research: A post-positivist synthesis of four dialectics," Journal of the Association for Information Systems, vol. 9, no. 2, pp. 48-72, February 2008.  Richard BASKERVILLE, "Investigating Information Systems with Action Research," Communications of the Association for Iinformations System Journal, vol. 2, no. 3, November 1999.  Juhani Iivari and John VENABLE, "Action Research and Design Science Research – Seemingly similar but
decisively dissimilar," in 17th European Conference on Information Systems , Verona, Italy, 2009.  R. DAVISON, M. MARTINSONS, and N. KOCK, "Principles of canonical action research," Information Systems Journal, vol. 14, no. 1, pp. 65-86, 2004.  Consultative Committee for Space Data Systems (CCSDS), "Reference Model for an Open Archival Information System (OAIS)," Blue Book January 2002.  David S. H. ROSENTHAL, Thomas ROBERTSON, Tom LIPKIS, Vicky REICH, and Seth MORABITO, "Requirements for Digital Preservation Systems: A Bottom-Up Approach," D-Lib Magazine, vol. 11, no. 11, November 2005.  Gonçalo ANTUNES, José BARATEIRO, and José BORBINHA, "Aligning OAIS with the Enterprise Architecture," in 8th European Conference on Digital Archiving, Geneva, Switzerland, 2010, p. 31.  R. L. ACKOFF, "From Data to Wisdom," Applies Systems Analysis, vol. 16, pp. 3-9, 1989.  Andrew WILSON, "InSPECT: Significant Properties Report ," Arts and Humanities Data Service, 2007. Retrieved June 12, 2012 from http://www.significantproperties.org.uk/wp22_significant_properties.pdf.  J. Zachman, "A Framework for Information Systems Architecture," IBM Systems, vol. 26, no. 3, pp. 276-292, 1987.  Marc LANKHORST, Enterprise Architecture at Work, 2nd ed.: Springer, 2009.  Software Engineering Standards Committee, "IEEE Standard 1471-2000. IEEE Recommended Practice for Architectural Description of Software-Intensive Systems," Institute of Electrical and Electronics Engineers Inc., New York, 2000.  P. SOUSA, C. PEREIRA, and J. MARQUES, "Enterprise Architecture Alignment Heuristics," Microsoft Enterprise Architecture, December 2004.  Office of Government Commerce (OGC), "Service Support," The Stationery Office, ISBN 0-11-330015-8, 2000.  International Organization for Standardization (ISO), "ISO/IEC 20000-1:2005," 2005.  International Organization for Standardization (ISO), "ISO/IEC 20000-2:2005," 2005.  Ronald D. MOEN and Clifford I. NORMAN, "The History of the PDCA Cycle," in Proceedings from the Seventh Asian Network for Quality Congress, Tokyo, September 2009.  Valerie ARRAJ, "ITIL: The Basics," Office of Government Commerce, UK, White Paper, May 2010.  Office of Government Commerce (OGC), "The Official Introduction to the ITIL Service Lifecycle," UK, 2007.  Peter DOHERTY, "Improving Service Asset and Configuration Management with CA Process Maps," CA Technologies, New York, 2008.  Office of Government Commerce (OGC), "ITIL V3 Glossary: Glossary of Terms, Definitions and Acronyms," May 2007.  Claire ENGLE, How to Develop, Implement and Enforce ITIL V3's Best Practices.: Emereo Publishing, April 2008, ch. 6, p. 75.  Distributed Management Task Force (DMTF), "CMDB Federation (CMDBf) Frequently Asked Questions (FAQ) White
http://www.dmtf.org/sites/default/files/standards/documents/DSP2024_1.0.0.pdf.  Thomas MENDEL, Jean-P
Implementing A CMDB Is Not A Five-Year Project," White Paper April 2006.  The IT Service Management Forum (itSMF), Foundations of IT service management based on ITIL V3., September 2007.  BMC S f w
(CMDB)? " BMC S f w
Houston, Texas, USA, Best Practices White Paper, 2006.  Dale CLARK et al., "The federated CMDB vision V1.0," White Paper 25 January 2007.
 Emilie Sales CAPELLAN and Roy R. CONSULTA. (2007) "E-Science Models and the Next Generation Grid Infrastructure in the Philippines (Collaborative Efforts in Educational and Research Institutions)". Available at Academia.edu.  Tony HEY, Stewart TANSLEY, and Kristin TOLLE, The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond,
http://research.microsoft.com/en-us/collaboration/fourthparadigm/.  Michael NENTWICH, Cyberscience: research in the age of the Internet. Vienna, Austria: Austrian Academy of Sciences Press, 2003.  Bertram LUDÄSCHER, Shawn BOWERS, Timothy McPHILLIPS, and Norbert PODHORSZKI, "Scientific Workflows: More e-Science Mileage from Cyberinfrastructure," in Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, Washington, DC, USA, 2006, p. 145.  I. J. TAYLOR, E. DEELMAN, D. B. GANNON, and M. SHIELDS, Workflows for e-Science: Scientific Workflows for Grids, 1st ed. Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.  P. ROMANO, "Automation of in-silico data analysis processes through workflow management systems," Bioinformatics Journal, vol. 9, no. 1, pp. 57-68, October 2007.  E. Bright WILSON, "An Introduction to Scientific Research," American Journal of Physics, vol. 21, no. 5, pp. 392-394, 1953.  Susan B. DAVIDSON and Juliana FREIRE, "Provenance and scientific workflows: challenges and opportunities," in Proceedings of ACM SIGMOD, Vancouver, Canada, 2008, pp. 1345-1350.  I. ALTINTAS, "Lifecycle of Scientific Workflows and their Provenance: A Usage Perspective," in Services - Part I, 2008. IEEE Congress on, Honolulu, HI, 2008, pp. 474-475.  Feng WANG, Hui DENG, Bo LIANG, Ling GUO, and Kaifan JI, "A study on a lightweight scientific workflow system for astronomical e-science service," in Proceedings of the 3rd international conference on Intelligent information technology application, Nanchang, China, 2009, pp. 69-72.  Hector FERNÁNDEZ, Cedric TEDESCHI, and Thierry PRIOL, "A Chemistry-Inspired Workflow Management System for Scientific Applications in Clouds," in Proceedings of the 2011 IEEE Seventh International Conference on eScience, Washington, DC, USA, 2011, pp. 39-46.  Andrew DOLGERT et al., "Provenance in High-Energy Physics Workflows," Computing in Science and Engineering Journal, vol. 10, no. 3, pp. 22-29, May 2008.  Han MINMIN, Thiery THOMAS, and Song XIPING, "Managing exceptions in the medical workflow systems," in Proceedings of the 28th international conference on Software engineering, Shanghai, China, 2006, pp. 741750.  Bertram LUDÄSHER, Mathias WESKE, Timothy McPHILLIPS, and Shawn BOWERS, "Scientific Workflows: Business as Usual?," in Proceedings of the 7th International Conference on Business Process Management, Ulm, Germany, 2009, pp. 31-47.  Adam BARKER and Jano VAN HEMERT, "Scientific workflow: a survey and research directions," in Proceedings of the 7th international conference on Parallel processing and applied mathematics, Gdansk, Poland, 2008, pp. 746-753.  Bertram LUDÄSHER et al., "Scientific workflow management and the Kepler system: Research Articles," Concurrency and Computation: Practice & Experience - Workflow in Grid Systems, vol. 18, no. 10, pp. 10391065, August 2006.  Ian TAYLOR, Matthew SHIELDS, Ian WANG, and Roger PHILP, "Distributed P2P Computing within Triana: A Galaxy Visualization Test Case," in Proceedings of the 17th International Symposium on Parallel and Distributed Processing, Washington, DC, USA, 2003, pp. 16.1-.  Ew
: A f
w k f
ﬁ W kﬂ w
Systems," Scientific Programming Journal, vol. 13, no. 3, pp. 219-237, July 2005.
 J ff y L. BROWN
: A G
ﬁ W kﬂ w System," International Journal of
Computer & Information Science, vol. 6, no. 2, pp. 72-82, June 2005.  Anthony ROWE, Dimitrios KALAITZOPOULOS, Michelle OSMOND, Moustafa GHANEM, and Yike GUO, "The Discovery Net System for High Throughput Bioinformatics," Bioinformatics, vol. 19, no. 1, pp. 225-231, February 2003.  Dimitrios GEORGAKOPOULOS, Mark HORNICK, and Amit SHETH, "An overview of workflow management: from process modeling to workflow automation infrastructure," Distributed and Parallel Databases Journal Special issue on software support for workflow management, vol. 3, no. 2, pp. 119-153, April 1995.  Ustun YILDIZ, Adnene GUABTNI, and Anne H. H. NGU, "Business versus Scientific Workflow: A Comparative Study," University of California - Department of Computer Science, Davis, California., March 2009.  Paul A. DAVID, Matthijs DEN BESTEN, and Ralph SCHROEDER, "Collaborative Research in e-Science and Open Access to Information," SIEPR Discussion Paper No. 08-21. Stanford Institute for Economic Policy Research, Stanford, California, 2009.  Organisation for Economic Co-operation and Development (OECD), "OECD Principles and Guidelines for Access to Research Data from Public Funding," OECD, Paris, France, 2007.  T. GREEN, "We Need Publishing Standards for Datasets and Data Tables," OECD, Paris, France, OECD Publishing White Paper, 2009. Available at http://dx.doi.org/10.1787/603233448430.  Karen BAKER and Lynn YARMEY, "Data Stewardship: Environmental Data Curation and a Web-ofRepositories," International Journal of Digital Curation, vol. 4, no. 2, pp. 12-27, October 2009.  Laura WYNHOLDS, "Linking to Scientific Data: Identity Problems of Unruly and Poorly Bounded Digital Objects," International Journal of Digital Curation, vol. 6, no. 1, pp. 214-225, March 2011.  Norman PASKIN, "Digital Object Identifiers for Scientific Data," Data Science Journal, vol. 4, no. 1, pp. 4-12, March 2005.  Norman PASKIN, "Naming and Meaning: key to the management of intellectual property in digital media," in The Europe-China Conference on Intellectual Property in Digital Media, Shangai, October 2006.  Norman PASKIN, "Naming and Meaning of Digital Objects," in 2nd International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, Leeds, December 2006.  Norman PASKIN, The DOI Handbook. Oxford, UK: International DOI Foundation, Inc., October 2006.  Jennifer A. THOMSON, "How to Start–and Keep–a Laboratory Notebook: Policy and Practical Guidelines," in Intellectual Property Management in Health and Agricultural Innovation: A Handbook of Best Practices.: MIHR (Oxford, U.K.), PIPRA (Davis, U.S.A.), Oswaldo Cruz Foundation (Fiocruz, Rio de Janeiro, Brazil) and bioDevelopments-International Institute (Ithaca, U.S.A.), 2007, p. 214.  Michael RUBACHA, Anil K. RATTAN, and Stephen C. HOSSELET, "A Review of Electronic Laboratory Notebooks Available in the Market Today," Association for Laboratory Automation Journal, vol. 16, no. 1, pp. 90-98, February 2011.  K. T. TAYLOR, D. HUGHES, and P. McHALE, "Electronic Laboratory Notebooks: What are they, and what do they need to do?," Elsevier MDL, Washington, DC., Presentation at American Chemical Society National Meeting, 2005.  Martin DONNELLY and Sarah JONES. (2011, March) DCC Checklist for a Data Management Plan v3.0. Retrieved March 13, 2012, from http://www.dcc.ac.uk/webfm_send/431.  Veerle Van den EYNDEN, Louise CORTIE, Matthew WOOLLARD, Libby BISHOP, and Laurence HORTON, "Managing and Sharing Data: Best Practice for researchers," UK Data Archive, United Kingdom, ISBN: 1904059-78-3, May 2011.  Joan STARR et al., "DataCite metadata schema for the publication and citation of research data (version 2.2)," DataCite, Hannover, 2011. Available at http://dx.doi.org/10.5438/0005.  Veerle Van den EYNDEN, Libby BISHOP, Laurence HORTON, and Louise CORTI, "Data Management
Practices in The Social Sciences," University of Essex, The Economic and Social Research Council, UK, August 2010.  Alex BALL, Mansur DARLINGTON, Tom HOWARD, Chris McMAHON, and Steve CULLEY, "Engineering Research Data Management Plan Requirement Specification (version 1.1)," ERIM Project Documentation erim6rep100901ab11. University of Bath, Bath, UK, 2011.  Diogo FERNANDES, José BORBINHA, and Marzieh BAKHSHANDEH, "Survey of data management plans in the scope of scientific research," INESC-ID, September 2012.  Alex BALL, "Review of the State of the Art of the Digital Curation of Research Data (version 1.2)," University of Bath, Bath, UK, 2010.  Norman GRAY, Tobia CAROZZI, and Graham WOAN, "Managing Research Data in Big Science (v1.1)," University of Glasgow, July 2012. Available at http://arxiv.org/abs/1207.3923.  DPHEP, "Data Preservation in High Energy Physics," Study Group for Data Preservation and Long Term Analysis in High Energy Physics, 2009. Available at http://arxiv.org/pdf/0912.0255.pdf.  Aqsa AHMED, "Study of the Photon Remnant in Resolved Photoproduction at HERA," University College London, March 2005. Available at http://www.hep.ucl.ac.uk/~jmb/Teaching/Project/2004-2005/aqsa.pdf.  Robert
http://lss.fnal.gov/archive/test-tm/0000/fermilab-tm-0763.pdf.  Michael FACTOR et al., "The need for preservation aware storage: a position paper," SIGOPS Operating Systems Review, vol. 41, no. 1, pp. 19-23, January 2007.  G. S. GURALNIK, C. R. HAGEN, and T. B. KIBBLE, "Global Conservation Laws and Massless Particles," Physical Review Letters, vol. 13, no. 20, pp. 585-587, November 1964.  F. ENGLERT and R. BROUT, "Broken Symmetry and the Mass of Gauge Vector Mesons," Physical Review Letters, vol. 13, no. 9, pp. 321-323, August 1964.  G. L. KANE, "The Dawn of Physics Beyond the Standard Model," Scientific American, pp. 68-75, June 2003.  Eugene KENNEDY, Particle Physics. Rijeka, Croatia: InTech, 2012.  Nicholas METROPOLIS and S. ULAM, "The Monte Carlo Method," American Statistical Association, vol. 44, no. 247, pp. 335-341, September 1949.  Katie YURKEWICZ, "Sciences on the grid," Symmetry, vol. 2, no. 9, pp. 16-19, November 2005.  European Organization for Nuclear Research (CERN), "Memorandum of Understanding for Collaboration in the Deployment and Exploitation of the Worldwide LHC Computing Grid," CERN, Geneva, Switzerland, March 2008.  Ian FOSTER and Carl KESSELMAN, The Grid 2: Blueprint for a New Computing Infrastructure, 2nd ed. San Francisco, USA: Morgan Kaufmann Publishers Inc., 2003.  Helen SHARP, Anthony FINKELSTEIN, and Galal GALAL, "Stakeholder Identification in the Requirements Engineering Process," in Proceedings of the 10th International Workshop on Database & Expert Systems Applications, Washington, DC, USA, 1999, p. 387.  Stephan STRODL, Christoph BECKER, Robert NEUMAYER, and Andreas RAUBER, "How to choose a digital preservation strategy: evaluating a preservation planning procedure," in Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, Vancouver, BC, Canada, 2007, pp. 29-38.  Juan BICARREGUI, Norman GRAY, Roger JONES, Simon LAMBERT, and Brian MATTHEWS, "DMP Planning for Large Projects (v0.1)," University of Lancaster, Lancaster, UK, March 2012.  Richard T. KOUZES, James D. MYERS, and William A. WULF, "Collaboratories: Doing Science on the Internet," Computer Journal, vol. 29, no. 8, pp. 40-46, August 1996.  Julie McLEOD and Sue CHILDS, "DATUM in Action: Supporting researchers to plan and manage their research data," Joint Information Systems Committee (JISC) and Northumbria University, Newcastle, United
Kingdom, JISC Final Report, April 2012.  C. McMAHON et al., "The development of a set of principles for the through-life management of engineering information," University of Bath, Bath, England, 2009.  Howard M. KANARE, Writing the Laboratory Notebook. Washington, DC, US: Library of Congress Cataloging, 1985.  Ajay D. KSHEMKALYANIT, Michel RAYNAL, and Mukesh SINGHAL, "An introduction to snapshot algorithms in distributed computing," IOPScience: Distributed Systems Engineering, vol. 2, no. 4, pp. 224-233, December 1995.
Appendixes Appendix A
The real business value of the CMDB, as a standalone implementation focusing only on f ITIL’
to fight that, by adding some extra functionalities that transformed a CMDB into something much more powerful. Table 12 – CMDB functions
Visualisation and mapping
Tracking and reporting Configuration control and verification
Description Provides a comprehensive model of the IT infrastructure for complete support of both ITSM and IT Asset Management (ITAM). Modelling not only the CIs in the infrastructure but also the relationships between them, where is provided a serviceoriented view of the infrastructure. Aggregate systems, IT services, business services and any other logical structures can be modelled with complete flexibility over the type of relationships represented in the CMDB. Draw data from multiple sources together into a single location to promote universal visibility of all components of the infrastructure, and the activities v y f v .F “ v ” w f CI ’ stored in the own CMDB but in an external one. This enables a better management of all CIs in an organisation as it allows the construction of domain specific CMDBs with more flexibility than a monolithic database model. A service-oriented view of the infrastructure is essential to accurately assess the business impact of changes. This feature provides graphical representations of the infrastructure, linking CIs in the infrastructure to dependent systems and the services they provide. This acts as a powerful tool to support risk-free change management as an accurate assessment of business impact can be made, where should be allowed that different stakeholders can access to views that really mean anything to them. With feeds from multiple sources, data must be consolidated to ensure records are complete (ensure data is reconciled before imported into the CMDB) without introducing duplicates. This feature is responsible for the data quality in the CMDB, w y k A ’ f . A y situations where only people can decide whether an introduced CI already exists in the CMDB or not. Provides automated discovery tools and a number of integration points to enable external tools and applications to update the CMDB on an as-it-happens basis. So, it is possible to update the CMDB with changes that were being in study. This is basically a structure of branching for the data stored in the CMDB and, merging capabilities that enable the creation of alternate branches for test changes and then commit them to the baseline. Once an inappropriate change is detected, a notification to change management workflow should be triggered, in order to alert the responsible for the IT domain where the change will take place, to remediate the situation. Provides a single, comprehensive, and easily accessible source of tracking information for reporting purposes, eliminating, as much as possible, the need for manual data-gathering and consolidation. It is critical to ensure that all changes are carefully controlled through best practices change management processes. This feature provides a list of all people authorized to approve changes and what types of changes each person is authorized to approve; a list of all people authorized to implement changes and what types of changes each person is authorized to implement; and a list of authorized configurations for all IT technology assets.
Comparison between business and scientific workflows’ features
Table 13 – Comparison between business and scientific workflows’ features 
Features Implementation Vs. Modelling
Experimental Vs. Business driven goals
Multiple workflow instances
Users and roles
Control-flow focus Vs. Dataflow
Dataflow computations Vs. Service invocations
Business workflow Develop a common understanding of the process that involves different persons and various information systems. The outcome of a business workflow is known before the workflow starts. The goal is to execute the workflow in a heterogeneous technical and organizational environment and, thereby, to contribute to the business goals of the company. Handle large numbers of cases and independent workflow instances at any given time. For example, each instance of an order workflow makes sure that the particular customer receives the ordered goods, and that billing is taken care of. Usually involve numerous people in different roles (in particular human interaction workflows). A business workflow system is responsible for distributing work to the human actors in the workflow. A A→B workflow typically means B can only start after A has finished, i.e., the edge represents control-flow. Dataflow is often implicit or modelled separately in business workflows. There are usually no data streams. An activity gets its input, performs some action, and produces output. An order arrives, it is checked, and given to the next activity in the process. In typical enterprise scenarios, each activity invokes a service that in turn uses functionality provided by some underlying enterprise information system.
Scientific workflow Developed with executability in mind, i.e., workflow designs can be viewed as executable specifications. A typical scientific workflow can be seen as a computational experiment, whose outcomes may confirm or invalidate a scientific hypothesis, or serve some similar experimental goals
In scientific workflows, truly independent instances are not as common. Instead, large numbers of related and interdependent instances may be invoked, e.g., in the context of parameter studies. Largely automated, with intermediate steps rarely requiring human intervention. The nature of these interactions is usually different, i.e., no work is assigned, but runtime decisions occasionally require user input. I f w kf w A→B y y represents dataflow, i.e., actor A produces data that B consumes. In dataflow-oriented models of computation, execution control flows implicitly with the data, i.e., the computation is data-driven. Data is often streamed through independent processes. These processes run continuously, getting input and producing output while they run. The input-output relationships of the activities are the dataflow, and ultimately, the scientific workflow. As a result, a sequence of actors A→B→C v concurrency, since they work on different data items at the same time.
ELNs products currently on the market
Currently, the ELNs are gradually replacing the PLNs, once the former has several advantages, providing the experimental knowledge for informed decision making, at both the laboratory and management levels. Table 14 and Table 15 depict the survey carried out to ELNs products currently on the market and complements the subject discussed in section 2.4.5. Table 14 – ELN companies, solutions and descriptions
Solutions and Companies Symyx Notebook 44 (Accelrys)
What they manage
Spectrus 45 (ACD/Labs)
LogiLab ELN (Agaram)46
C ’ ELN (AgileBio)47
SeaHorse Scientific Workbench (BSSN Software)48
Manages the flow of the information, tasks, materials among researchers, software, instruments within and between labs; Refine processes for better efficiency and regulatory compliance. Offer support of multiple analytical chemistry techniques and all major instrument vendor formats; Enables research organisations to extract, retain and leverage chemical and analytical knowledge in ways not previously possible; Retains the progression of human interpretation from analytical data to chemical knowledge and provides a collaborative project environment that can capture experimental information today, and foster projects and decisions in future. With original interpretations and annotations preserved to offer chemical context, all analytical data can later be reviewed, reprocessed and reinterpreted in the same environment at the researcher’ k . Designed for quality control and analytical development labs; Capture complete test procedure related data as well as the original raw data form the instruments; Perform complex calculations; Data captured with ELN is automatically version controlled and at the same time the ELN template can be released in a control manner. Store, organise, find and share laboratory work; Customised (custom fields, specific page templates, workflow design, mixed solutions); Includes an electronic signature validation system that allows data integrity check; Includes a diagram designer tool to design both simple and complex chemical reactions and experimental diagrams; Offers unlimited users, books, experiments and pages, spread sheets, workflow design and search engine. Captures, analyses and shares analytical data; Consolidates raw and result data from multiple experimental techniques in a single tool (based on the ASTM49 AnIML50 standard); Captures each step of such a workflow and presents it in its entirety; Navigation model allows researchers to explore experiments and samples; Includes visualisation, annotation and reporting features; Raw data can be captured directly from instrument software and stored together with interpreted results, images and annotations.
http://www.accelrys.com/eln http://www.acdlabs.com 46 http://www.agaram.co.in 47 http://www.labcollector.com 48 http://www.bssn-software.com 49 American Society for Testing and Materials (ASTM) is a globally recognized leader in the development and delivery of international voluntary consensus standards. 50 Inter-Mountain Labs (IML) provides state of the art accredited analytical chemistry, monitoring, and field services in soil, water, and air science. 45
Table 15 – ELN companies, solutions and descriptions (continuation)
Solutions and Companies Core ELN (Core Informatics)51
Studies Notebook 53 (Dotmatics)
What they manage
eStudy (iAdvantage Software)54
Labware ELN (LabWare)55
Waters Vision Publisher ELN (Waters)58
Captures, analyses, manages and shares data; 100% web-based enables researchers to share data and results 52 organisation-wide in a fully compliant, 21CFR11 validated environment; “F F ” y ( y entered from Word, Excel, PDF, PowerPoint, etc.). Provides user and record authentication at every level; 100% web-based supports chemistry, biology and ad-hoc research; Offers security and convenience to ensure efficient mining, reporting and visualisation of information; Can be accessed on the cloud or locally, does not requiring any installation ’ k . Designs, captures, query and report study data; 100% web-based; Users can manage, conduct and report multiple study-types (preclinical/in vitro, toxicology, ecotoxicology and Ag biotech) with one software platform; Users can design studies in eStudy manager: enter protocol information, identify experiments, select test articles, define and assign treatments and observation events to specific experiments, randomise test subjects, define sample events and automatically assign sample numbers; A report can be generated: eStudy pulls notebook data and protocol information together to generate MS Word study report documents. Can operate in experiment driven, research environments and also provides a method execution mode suitable for QA/QC; Provides instrument integration, flexible management of images and raw data files and all operations are comprehensively audited. Focuses on ease of use, performance, and flexibility for research and IP protection workflows; CERF 4.5 extends real-time cross-platform collaboration, add freehand sketching, semantic data forms, and visualisation of any data directly on the notebook page; Annotation and social tagging were extended to help find information quickly. Manages and shares research data and experiments results; Gives customers a scalable, integrated research environment for managing and sharing data within a robust framework that protects valuable IP, streamlines lab processes and supports 21CFR11 compliance; Users can design studies, conduct analysis and generate visualisations, and the environment offer customisable project types, states and workflows; ELN for analytical labs; Supports non-specific, flexible development workflows, but also the 59 procedure driven GMP relevant workflows in the quality control labs; Includes predefined worksheet templates with electronic interfaces and instrument software, its instrument and consumable inventories, the advanced sample management capabilities;
http://www.coreinformatics.com Title 21 CFR Part 11 of the Code of Federal Regulations deals with the Food and Drug Administration (FDA) guidelines on electronic records and electronic signatures in the United States. Part 11, as it is commonly called, defines the criteria under which electronic records and electronic signatures are considered to be trustworthy, reliable and equivalent to paper records 53 http://www.dotmatics.com 54 http://www.iadvantagesoftware.com 55 http://www.labware.com 56 http://www.rescentris.com 57 http://www.ruro.com 58 http://www.waters.com 59 GMP is the production and testing practice that helps to ensure a quality product. 52
International DMP practices
In the last few years many international research funders have introduced a requirement within their data policies for DM and sharing plans to be part of research grant applications. So, Table 16, Table 17 and Table 18 describes some of the international DMP practices, complementing the description carried out in section 2.5.2. Table 16 – International DMP practices
Understand what research data is and why it needs to be managed; Appreciate legal, institutional and funding issues related to data; Learn how various DM methods can help you work more effectively with data; Develop an awareness of the DM services at ANU.
DMP Planning for Large Projects – Managing Research Data Infrastructures for Big Science
Addressed to people who have, or who have been landed w y f v DMP f “ scie ” -institutional or multinational project with a need for a bespoke plan.
A Research DMP for the Department of Mechanical Engineering
Support principal researchers, project, data and research managers and researchers and others, such as service providers, in performing data management at the project level.
Guidance and Requirements for NCCWSC/CSC DMP
Establish a data sharing policy requiring a DMP for all proposals and funded projects. This document provides guidance on what should be included as part of the data management plan and it present in two stages: the Proposal Data Management Plan (PDMP), and the Research Data Management Plan (RDMP).
Research Data Management Plan
Template for researchers fill according to research where they are inserted. It specifies the contents like Documentation and Metadata, Storage and Backup, etc.
University of Bath
Table 17 – Curation policies and support services of the main UK research funders
Research funders AHRC BBSRC CRUK EPSRC ESRC MRC NERC STFC Wellcome Trust
Policy Coverage Published Data outputs
Policy Requirements Access/ Long-term Data Plan Sharing curation
Table 17 terminology clarifications:
Published outputs: a policy on published outputs, e.g., journal articles and conference papers; Data: a datasets policy or statement on access to and maintenance of electronic resources; Time limits: set timeframes for making content accessible or preserving research outputs Data plan: requirement to consider data creation, management or sharing in the grant application; Access/sharing: promotion of Open Access (OA) journals, deposit in repositories, data sharing or reuse; Long-term curation: specifications on long-term maintenance and preservation of research outputs; Monitoring: whether compliance is monitored or action taken such as withholding funds; Guidance: provision of FAQs, best practice guides, toolkits, and support staff; Repository: provision of a repository to make published research outputs accessible; Data centre: provision of a data centre to curate unpublished electronic resources or data;
Costs: a willingness to meet publication fees and data management/sharing costs.
Support Provided Data Repository centre
Full coverage: Partial coverage: No coverage:
Retrieved from http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
Table 18 – DM policies and research data requirements 
Research data and protocols Research bodies
NCCWSC Bath University Monash University OAK Law Project QUT Melbourne University Melbourne Neuropsychiatry Centre RELU Project NASA
Data structure/ format
DM policy DM budget
Future reuse 1
Archival and preservation
Technical requirements Protection and IPR
Research Funder NSF – PHY NSF – ENG NSF – EHR NSF – BIO NSF – AGS NSF – AST NSF – CHE NSF – SBES Key terminology:
NB: The numbering that is present on cells of the Table 18, are explained in .
Fully addressed Partially addressed Not addressed (Implicitly) Intentionally not addressed (cf. ) 92
This a complete view of the LIP workflow, described in section 22.214.171.124-D.
analysis BPMN LIP «flow» Start
Send requested Data
End Experimental Data [External]
Receive External Data
Request External Data yes
Experimental Data [Internal]
LIP Student/Senior Researcher
Retrieve Internal Data
Do Analysis on the Data «flow»
External dataset needed? «flow»
Analyzing Physical Observables from Data
Produce New Data
Run Applications on the Data
more analysis needed?
«flow» «flow» Experimental Data [Not Final]
Experimental Data [Under Analysis]
Revision of the Results
Receive Feedback Result accept?
Experimental Data [Final]
Request Revision of the Results
Experimental Data [Under Analysis]
Produce Final Result End
Figure 19 – Local analysis process
1. INTRODUCTION AND RESEARCH CONTEXT 1.1. Project information Type of project Name of researcher(s) Title of the project Name of organisation(s) funding project Collaboration organisations Official duration of project B f f j ’ j v Location of the project documents that provide the above type of information
1.2. DMP information 1.2.1. Requirements for DMP Are there funding body requirements to produce DMP? If yes, give details about requirement documents/information Are there university requirements to produce DMP? If yes, give details about requirement documents/information Are there requirements from any other body to produce DMP? If yes, give details about requirement documents/information 1.2.2. Roles and responsibilities Role and responsibility
Project team member
Completing the DMP Writing RDM protocol documents Setting up the access and sharing practices Other (specify)
These tables were originally built in a excel file, so the green colour corresponds to the text input by the user and grey colour corresponds to predefined options that can be chosen by the user.
1.2.3. DMP version tracking # Version
1.2.4. DM budget Proposal budget portion allocated for DM activities
New data collected Data output Other (specify) 2. DATA COLLECTION AND PROCESSING 2.1. Using existing data Are you going to look for existing data in repositories, publications, etc.? If yes give name/location of data source(s) Are you using existing data supplied by other researchers/organisations? If yes give name/location of data source(s) Are you using your own data/data of other research team member? If yes give name of datasets If you are using existing data, what are the conditions that underlie the data usage? Dataset
Conditions of use
2.2. Creating and capturing data List briefly the data creation and data capture methods you will use; What data processing will be required, what equipment, hardware, software you will use; What file types/size you will be using. Creation method Capture method Data processing Equipment, hardware, software File type/size
2.3. Data Analysis List briefly the data analysis methods you will use; What equipment, hardware, software you will need; What file types/size you will be using. Analysis method Equipment, hardware, software File type/size
2.4. Data quality 2.4.1. Data structure/format Data stage
Other (specify) 2.4.2. Contextual information List briefly what contextual information will be needed to make your data meaningful; How you will produce/capture this information; Where it will be located. Contextual information How produce/capture Location Other (specify) 3. ACCESS AND SHARING 3.1. Controlling access Data/outputs/publications can be accessed by the scientific community? If yes, give location of documents/information Data/outputs/publications can be accessed by members of the research group? If yes, give location of documents/information Data/outputs/publications can be accessed publicly? If yes, give location of documents/information 3.2. Data sharing Do you plan to share the data? Are there any requirements for you to share your data? If yes, give location of requirement documents/information List any other reasons why you want to share the data List any reasons why you will NOT share the data
3.3. Future reuse Do you plan to reuse the data? If yes, give actions required to reuse the data Do you plan others reuse your data? If yes, give actions required to reuse the data 4. ARCHIVAL AND PRESERVATION 4.1. Data storage Specify how much data/associated documents in electronic format form you anticipate you will collect Will you have enough computing resources to accommodate this? If you do not have enough physical/computing resources how will you deal with this? Where you will store the data/documents? Storage location Data/documents
4.2. Data security How will you ensure the security of the data/documents? Activity
Security action required
Other (specify) How you will ensure the security of personal/sensitive data? Activity
Security action required
Other (specify) 4.3. Data preservation Person responsible for identifying the need to preserve the data
5. PROTECTION AND IPR 5.1. Ethical and legal risk factors Does your research involve the following activities? Human participants (including data and records) Commercial sensitivities Environmental issues HEP collaboration Other risk factor (specify) 5.2. Ethical issues related to research involving human participants RDM issues associated with the ethical considerations of the project that are applicable and briefly describe how you will deal with them RDM issue Issue applicable Actions to address issue Other (specify) Is university ethics approval required? If yes, has approval been given? If yes, give location of relevant documents Is another organisation's ethics approval required? If yes, specify the name of the organisation(s) If yes, has approval been given? If yes, give location of relevant documents 5.3. Other legal issues RDM issues associated with the ethical considerations of the project that are applicable and briefly describe how you will deal with them RDM issue
Actions to address issue
Environmental Information Regulations Memorandum of Understanding Other (specify)
5.4. IPR List the people/organisations who have IPRs to the data Name of organisations/person
Name of datasets
Do you have an agreement on how IPR is to be handled?
If yes, give location of relevant documents
5.5. Other agreements List other agreements that you have established about your research data Name of organisation
Type of agreement
Location of documents
6. TECHNICAL REQUIREMENTS 6.1. Infrastructure and requirements Do you have enough resources to manage your data? If NO, what additional resources do you need? Resources required Describe the current infrastructure of your data analysis Resource
Plan to obtain resources
Other (specify) 6.2. Interoperability Are there any requirements for interoperability? If YES, describe the organisations and the requirement Organisation