A Digital Archiving System and Distributed Server ...

3 downloads 5464 Views 780KB Size Report
A Digital Archiving System and Distributed Server-side processing of .... Condor can seamlessly integrate both dedicated resources and non-dedicated desktop.
A Digital Archiving System and Distributed Server-side processing of Large Datasets Julien Jomiera, Stephen R Aylwarda, Charles Mariona, Joohwi Leeb and Martin Sytnerb a Kitware Inc, 28 Corporate Drive, Clifton Park, NY, USA; b Neuro-imaging Research Laboratory University of North Carolina, Chapel Hill, NC USA ABSTRACT

In this paper, we present MIDAS, a web-based digital archiving system that processes large collections of data. Medical imaging research often involves interdisciplinary teams, each performing a separate task, from acquiring datasets to analyzing the processing results. Moreover, the number and size of the datasets continue to increase every year due to recent advancements in acquisition technology. As a result, many research laboratories centralize their data and rely on distributed computing power. We created a web-based digital archiving repository based on openstandards. The MIDAS repository is specifically tuned for medical and scientific datasets and provides a flexible data management facility, a search engine, and an online image viewer. MIDAS enables users to run a set of extensible image processing algorithms from the web to the selected datasets and to add new algorithms to the MIDAS system, facilitating the dissemination of users' work to different research partners. The MIDAS system is currently running in several research laboratories and has demonstrated its ability to streamline the full image processing workflow from data acquisition to image analysis and reports. Keywords: Grid computing, Image processing, Database, Remote processing.

1. INTRODUCTION Medical imaging research often involves interdisciplinary teams, each performing a separate task from acquiring datasets to analyzing the processing results. Moreover, the number and size of the datasets continue to increase every year due to advancements in acquisition technology. In order to streamline the management of images coming from digital sources such as MR/CT machines, hospitals have been relying on PACS systems. However such systems are rarely made available to the research teams because of security and confidentiality issues. Moreover PACS systems have become increasingly complex and do not often fit in the scientific research pipeline. We present an alternative to PACS systems called MIDAS. The MIDAS system targets research and development units and aims at managing digital contents. The MIDAS system provides most of the functionalities of PACS system but is not limited to DICOM images. The system can store any digital media: word documents, PDF, 3D surfaces, etc… Once digital media are archived in the system they can be easily made accessible to other research units and several methods of transfer are available: either via the internet or via file sharing. Moreover, the stored digital content can be searched, visualized and downloaded directly from the system. Furthermore, one unique feature of the MIDAS system lies in its integration with grid computing environments. Thus, in order to facilitate its integration into the different research groups we have created a universal interface to several distributed computing environment. This interface is provided through the open-source tool aimed for batch processing called BatchMake[1]. Next we present the main component of the MIDAS system. First we present the digital archiving and data management aspect of the system. Second, we demonstrate how distributed computing is integrated to process datasets archived in the database. Finally, we present some current applications of our system.

Medical Imaging 2009: Advanced PACS-based Imaging Informatics and Therapeutic Applications edited by Khan M. Siddiqui, Brent J. Liu, Proceedings of SPIE Vol. 7264, 726413 · © 2009 SPIE CCC code: 1605-7422/09/$18 · doi: 10.1117/12.812227 Proc. of SPIE Vol. 7264 726413-1

2. DIGITAL ARCHIVING We present, in this section, the core functionalities of the MIDAS system: the digital online archiving system. Then we describe a typical use of the system and finally, we demonstrate the online visualization tool which allows browsing datasets from the convenience of a web browser. 2.1 Core Functionalities The MIDAS system has been built on top of the widely used open-source platform called DSpace[2]. DSpace has been initiated in 2002 by Hewlett Packard and MIT, with funding from the Library of Congress, to provide a central repository for digital archive. DSpace is intended as a platform for digital preservation activities. The system is currently been installed at over 250 institutions around the globe from large university to research centers. It is most commonly used by research libraries as an institutional repository, however there are many organizations using the software to manage digital data for a project, subject repository, web archive, and dataset repository. On top of DSpace, the so-called handle system[3] provides a unique identifier to repositories all over the world and acts as a digital object identifier (DOI). A handle is comparable to a URL in the internet world and allows moving transparently the digital repository to a different location. Finally, the DSpace system supports the Open Archive Initiatives Protocol from Metadata Harvesting (OAI-PMH). This protocol allows search engines, such as Google and Yahoo, to query DSpace for specific digital documents. As a result most of the DSpace repositories worldwide are referenced on Google Scholar[4]. DSpace has been primarily designed to archive text documents such as PDF and Word documents. Therefore, we have extended the DSpace system to support medical and scientific metadata. Thus, the MIDAS system uses the XCEDE standard[5] for storing metadata related to medical datasets. XCEDE is an XML schema that provides an extensive metadata hierarchy for describing and documenting research and clinical studies. The schema organizes information into five general hierarchical levels: a complete project, studies within a project, subjects involved in the studies, visits for each of the subjects, the full description of the subject's participation during each visit. XCEDE was originally designed in the context of neuro-imaging studies and complements the Biomedical Informatics Research Network (BIRN) Human Imaging Database, an extensible database and intuitive web-based user interface for the management, discovery, retrieval, and analysis of clinical and brain imaging data. The close coupling between XCEDE and MIDAS allows for an interchangeable source-sink relationship between the database and the XML files, which can be used for the import/export of data to/from the database, the standardized transport and interchange of experimental data, the local storage of experimental information within data collections, and human and machine readable description of the actual data. The XCEDE format is also being used by the Extensible NeuroImaging Archive Toolkit (XNAT)[6]. The MIDAS system offers a complete solution for data management. User registered in the system can be categorized by groups and different user policies can be set for each collection of data and even on an individual dataset basis. Policies include: rights to upload a new dataset to a collection, rights to download datasets, rights to visualize a datasets and much more. Once registered in the system, one can upload datasets via the web interface, or via FTP or using the MIDASClient, a stand alone application providing a direct interface to MIDAS. Users can search for different datasets and download them in bulk. MIDAS also provides “electronic carts” in which any users can add datasets to be downloaded or processed. Carts are containers for digital media and allow users to quickly group datasets. From a technological point of view, MIDAS uses PostGreSQL[7] as the main database and is written using the well known dynamic scripting language PHP. Add-ons and other modules are written in C++. Next we describe a typical use of the system. 2.2 Typical Use In this section, we describe a typical scenario to demonstrate how the MIDAS system can improve data management. Let’s imagine three characters working on a clinical study at a given institution: Alice who is responsible for acquiring images for a clinical study. Martin, who is managing an image processing laboratory responsible for analyzing the images acquired by Alice, and Steve, a statistician located at a different institution around the world. First, Alice receives volumetric images from her clinical collaborators; she logs into the MIDAS system and creates the proper collections of datasets. She uses the web interface to upload the datasets into the system. The metadata are automatically extracted from the datasets (DICOM or other well known scientific file formats). She then adds more information about each dataset, such as demographic and clinical information, and changes the collection’s policies to make it available to Martin. Martin is instantly notified that new datasets are available in the system and are ready to be

Proc. of SPIE Vol. 7264 726413-2

processed. Martin logs in and starts visualizing the datasets online. He visualizes the dataset as slices and also uses more complex rendering technique to assess the quality of the acquisition. As he browses each dataset, Martin selects a subset of datasets of interest and put them in the electronic cart. At the end of the session, he downloads the datasets in his cart in bulk and gives them to his software engineers to train the different algorithms. As soon as the algorithms are validated on the training datasets, Martin uploads the algorithms to the MIDAS system, selects the remaining testing datasets and applies the processing pipeline to the full collection of dataset. The pipeline is automatically distributed to all the available machines in the laboratory, decreasing the computation time by several orders of magnitude. The datasets and reports generated by the processing pipeline are automatically uploaded back into the system. During this time, Martin can monitor the overall progress of the processing via his web browser. When the processing is done, Martin gives access to Steve in order to validate the results statistically. Even located around the world, Steve, can access and visualize the results, make comments and upload his statistical analysis in the system. We have seen how the MIDAS system provides a central location for storing, managing and analyzing digital media. Next we present one unique feature of the system, the online visualization of datasets. 2.3 Online Visualization As we have seen in the previous section, one of the features of the MIDAS system is the online visualization of the datasets. No download is required and visualization can be performed almost instantly from the convenience of a web browser. Currently the MIDAS system allows for online visualization of slices of volumetric datasets. In the backend a server generate and sends slices as requested by the clients. Some technology includes pre-buffering and caching of the datasets to improve user’s interaction. There are two main advantages of online visualization. First, the datasets do not have to be completely downloaded to start the rendering, and only the significant part of the dataset to render is sent to the client, therefore the visualization time is significantly decreased. Second, complex visualization algorithms can process the data remotely and send only the final rendered image to the client. This has been proven to be extremely important, especially when the client machine is not been able to perform the rendering due to memory, video or CPU limitation.

$. , .iO)

Image Gallery

attery 0 til110geS Uptoade d to lOOt taO.

Training Data 2

cooliris

Q :-:

Fig. 1. Left: Main page of the MIDAS system displaying a digital microscopy dataset. Right: Interactive image gallery of datasets stored in the public MIDAS installation hosted by Kitware.

We are currently improving the online visualization to include 3-dimensional interaction of large datasets, such as surface and volume rendering. Next we describe how the MIDAS system handles the processing of datasets in a distributed manner.

Proc. of SPIE Vol. 7264 726413-3

3. SERVER SIDE PROCESSING The MIDAS system combines the digital archiving system previously described with the power of distributed computing. Nowadays, digital datasets to be stored are increasing in size as the resolution of the imaging devices keep improving. As a result, more processing units are needed to achieve a descent level of computation. We have integrated an open-source tool for distributed computing to MIDAS. 3.1 Grid processing Grids offer a way of using the information technology resources optimally inside an organization. Grid computing has been advertised publicly with projects like SETI@home for extraterrestrial research and Folding@home for genomics research, the latter has reached more than 3 millions computers at its peak. The grid can be composed of heterogeneous machines (Linux, Windows and Macs) and can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers, so-called cycle scavenging. Several open-source and commercial grid engines are available for download and we have been particularly interested in the Condor High-Throughput Computing System[8] and the Sun Grid Engine[9]. The Sun Grid Engine (SGE) is an open source batch-queuing system, developed and supported by Sun Microsystems. SGE is typically used on a computer farm or high-performance computing cluster and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory and disk space. Condor is a high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks. Condor can seamlessly integrate both dedicated resources and non-dedicated desktop machines into one computing environment. Condor has been initiated and is currently developed by the University of Wisconsin-Madison and follows an open source philosophy; Condor is currently licensed under the Apache License 2.0. Condor has been used by government agencies such as the NASA Advanced Supercomputing facility which connects approximately 350 workstations. Each workstation runs a daemon that watches user I/O and CPU load and when a workstation has been idle for a certain amount of time, a job from the batch queue is assigned to the workstation and will run until the daemon detects a keystroke, mouse motion, or high non-Condor CPU usage. At that point, the job will be removed from the workstation and placed back on the batch queue. The main Condor features include a mechanism to describe job dependencies and the ability to use Condor as the front-end to submit jobs to other distributed computing systems. Condor is one of the job scheduler mechanisms supported by GRAM (Grid Resource Allocation Manager), a component of the Globus Toolkit[10]. Furthermore, Condor-G allows Condor jobs to be forwarded to foreign job schedulers and support for Sun Grid Engine is currently under development. The MIDAS system has been successfully integrated with Condor and we are currently investigating support for SGE. Next we describe the BatchMake tool and show how it interfaces with the grid computing environment. 3.2 BatchMake BatchMake is an open-source, cross platform tool for batch processing large amount of data. BatchMake can process datasets either locally or on distributed systems. BatchMake uses a scripting language with a specific semantic to define loops and conditions. One of the strength of BatchMake is its ability to provide a scripting interface to any command line applications. The arguments of the command line executable can be wrapped automatically if the executable is able to produce a specific XML description of its command line arguments. Manual wrapping can also be performed using the “Application Wrapper” interface provided by BatchMake. Currently BatchMake can wrap command line applications that use the MetaIO library[11] or any Slicer3[12] execution models that use the GenerateCLP library. Once the executables have been wrapped they can be easily called within a BatchMake script, where each command line argument is named explicitly. For instance, if the executable expects the command line argument ‘–t [transformfile]’ to specify a transformation, it will be called within BatchMake by the command SetAppOption(‘MyExecutable’,’transform’,’transformfile.txt’). The BatchMake scripting language allows writing and debugging script easily. Once the script is written, BatchMake can run the script locally or distributed the jobs using the Condor system described previously.

Proc. of SPIE Vol. 7264 726413-4

Clients

H

JoI1N

vml_pjcasso.raiI000.edu I000IaimeII-IIIIe-b.0000I

ret )pharutoa "F: \Pata\Atlas\Data\ test atlasbu110er\dat000harutOmgraifl. 0101"

Achcly = Idle

ret )hase0151n

ret )baSedjrout F: /Data/Atlas/Data/test atlasbuildea/outputiLupUS/'

Arch = INTEL

ret (phase 1)

CIockDsy = 0

Lf)0)phase)

F: iDa/Atlas/p a/test atlasbullderilnputiLupusi'

COLLECTOR_HOST_SIRING = plca550rsd.unc.edu CIockMIn = 732

1)

#R0)00'l('y0'0)'p)yp)M( #000 )

41

CnodoeWntcher 1.0 betn

BolahMake .0.1 Beta (AttaaBrritderBotchmakeScript.bma)

CondorLosdAco = 0.000000000000000E5000

/)) f0# graia SCrppirrp

Liatrtiriratir)ljst '0)basedirlru)'( (orearh)suthjeot t)llst))

ret)fullpath 't)basedirla)t)suhjeot) Liatrtirimtir)djrtl '$)fullpath) Tl*) Liatrtirimlir )djrperf 0 )fullpath) Perfusion5

Condorpntlornn= $Ccrldorpntlorrn INTEL-IMNNT400

CondorVernIon=1CondorVernIon669InrIO2OO50 ConnoleIdle = 0

CpuOuoy= ((Loadntog - CondorLosdAco) = 0.500000) CpuBuoyTInln = 0 CpuIoOuSy

Cpuc

I

I

Currordpsnk = 0.000000000000000En000

Liatfileirntir)tl 't)fullpath)0)dir_tl) ' suhjeot5Tl5.5)

DserncnStsrlllnnn =1126667130

eet)tl 't)fullpath)$)dir_tl)t)tl)'

Dick

reteap )Brainotripping @Brainotripping) let0ppOptioa)grainOtripping. fraotiorual_threshold.value 0.0)

ErdnrndCurrenlStsle=1126667142

SetOppOl:tioa)Brainotripping. infile.fileruarue 0 )inputf 115)) eet)outputfile "t)basedlrout)O)SUhjSOt)O)dir_tl)SODieOt_tl_bralfl.mha") Setmppol:tiaa)grainutripping. outf 115. F ilenarue $)outputf lie)

HsnFilolrnnnler= I

Rrm)outp::t 0 )Brain0trippino)

HsnJICLocsIStdIn =1

retiu::putfjle "O)tl)"(

660216

EnteredCurrentAntioy = 1126667142 FilnSyoloniOcmnin= rsd.unc.edu

HsnlOprocy=I HSCJICL00SIC0nIIO =1

HsnklPI =1

enrltoreaclr)suhjeot)

(Flops = 260101 Iheybonrdld!o

0

Laslonnchniark= 1126667142

Compiling 0 error(s), 0 warning(s) Running

LaslHnardFronl =1126674347 Loadntog = 0.000000000000000En000

Machlnn=plca550rsd.unc.edu

Msnlory=256 Nips = 956

MyAddr000- 152 19.105.1051029

Fig. 2. Left: BatchMake graphical user interface. Right: Condor Watcher application to monitor the grid processing computation in real time.

The main feature of Condor is its support for job dependencies which is expressed as a directed-acyclic graph (DAG). When distributing jobs, BatchMake first creates the appropriate DAG based on the current input script. By default, BatchMake performs a loop unrolling and considers each scope within the loop as a different job, which will be ultimately distributed on different nodes. This allows distributing independent jobs automatically (as long as each iteration is considered independent). BatchMake also provide a way for the user to specify if a loop should be executed sequentially instead of in parallel. Before launching the script (or jobs) on the grid, the user can visualize the DAG to make sure it is correct. Then, when running the job, the user can monitor the process of each job using the CondorWatcher executable distributed with BatchMake. We have integrated BatchMake into MIDAS. First we have added web support to BatchMake which allows researchers to upload their BatchMake scripts with the associated parameters description directly into MIDAS. Thus, when a user decides to process a collection of data using a given BatchMake script, MIDAS automatically parse the parameter description and generate an HTML page. The HTML page provides an easy way to enter tuning parameters for the given algorithm and once the user has specified the parameters, a BatchMake configuration file is generated and the script is run on the selected data collection(s). This flexibility allows sharing processing pipelines easily between organizations, since the BatchMake script is describing the pipeline. MIDAS also provides online monitoring of the grid processing in real time. We use Ajax technology to provide real-time updates of the distributed computations. Furthermore, results of the processing at each node of the grid are instantly transferred back to the MIDAS system and can be used to generate what we call a “BatchBoard”. A BatchBoard is a dashboard for batch processing. It permits to quickly check the validity of the processing by comparing the results with known baselines. Each row of the dashboard corresponds to a particular processing stage and is reported in red color if the result does not meet the validation criterion.

Proc. of SPIE Vol. 7264 726413-5

Linearinterpolation (en olution overtime) Linearinterplent (Factor = 2, DimSize = 30)

May-I 9

May-20

May-21

iv oi

02

May-22

Muy-23

'/ 03 'V 04

Muy-24

V 05

May-25

May-26

06

08 1V 10

May-27

May-28

May-2

V 16

Fig. 3. Left: The “BatchBoard” showing a nightly regression test. Right: Online graph generation from results stored in the MIDAS system.

4. PUBLIC INSTANCES The MIDAS system is currently use in production at several institutions. These systems are described next. 4.1 Optical Society of America In October 2008, the Optical Society of America (OSA) launched the first interactive scientific publishing (ISP) system[13]. This initiative allows scientists to expand upon traditional research results by providing software for interactively viewing underlying source data and to objectively compare the performance of different technologies. The ISP system enhances the standard scientific publishing by adding interactive visualization. Authors can now create 3dimensional visualization of their datasets, add annotations and measurements in 3D and make the datasets available to reviewers and readers. The OSA and the National Library of Medicine have collaborated with Kitware Inc. to provide this service. The datasets available in the MIDAS system are made freely available to the scientific community. The system is composed of two main components. The data storage is provided by a customized MIDAS system which delivers low resolution datasets for pre-visualization and in the background serves the full-resolution dataset. The second component is the ISP visualization software which interacts directly with the MIDAS system in order to retrieve specific datasets. The ISP software allows authors of scientific papers to add annotations, comments and visualization parameters to their datasets. Once uploaded in the MIDAS system, readers of an ISP-enabled manuscript can click on the data to automatically interact with it. 4.2 University of North Carolina The Neuro-Image Research and Analysis Laboratory (NIRAL) at the University of North Carolina has been running the MIDAS system for internal image management and processing. The NIRAL distributed computing environment is currently composed of 56 dedicated cores (168GB of memory) provided by 5 compute servers and 24 cores (24GB of memory) provided by user’s desktops. The NIRAL area of research includes shape analysis of the brain and diffusion tensor imaging analysis in the context of clinical research. The NIRAL has been developing computational modules integrated with BatchMake scripts. For instance, the computational module for regional brain morphometry computes a probabilistic brain tissue segmentation, followed by a delineation of the mouse brain into 20 regions/brain structures ranging from larger regions such as the hippocampus, to smaller ones such as the fimbria or internal capsule. Then, depending on the availability of Diffusion Tensor Imaging (DTI) data, the module also computes regional histograms and summary statistics of DTI properties such as Fractional Anisotropy (FA) or Mean Diffusivity (MD). This module has been made open-source. 4.3 Kitware Inc. Kitware Inc. is the original developer of the MIDAS system and has been running a public digital repository[14] for more than three years. The current system is available through several websites.

Proc. of SPIE Vol. 7264 726413-6

The first website is an original MIDAS installation which hosts more than 100 healthy brain MR datasets provided by Dr. Bullitt at the University of North Carolina. These datasets are freely available for everyone to use. The MIDAS installation also hosts MR, T1 and T2 datasets from the Retrospective Image Registration Evaluation Project[15]. The RIRE project has open access to its datasets last year. The second website using MIDAS is the open-science journal initiated by Kitware and the National Library of Medicine: the Insight-Journal[16]. The Insight-Journal is an open access online publication covering the domain of medical image processing and visualization. The unique characteristics of the Insight Journal include: (a) open-access to articles, data, code, and reviews, (b) open peer-review that invites discussion between reviewers and authors, (c) emphasis on reproducible science via automated code compilation and testing and (d) support for continuous revision of articles, code, and reviews. Scientists and researchers can submit freely source code and datasets to the Insight Journal. Nowadays, the Insight-Journal and his companion journal, the MidasJournal are counting more than 1,000 subscribers and more than 200 submitted open-science articles. Kitware is also running several internal instances of MIDAS to manage its own digital documents and image processing pipeline. Other institutions are also currently using the system on a regular basis to distribute and automate their digital content management.

. [di!

Fik

V

DiyItd

-

Wi!! Hdp

31

Rfl'

I/&

++

F-I

I!I1IIOItI p

VI]

!'

54.O8]]

F

X

sage...

&

'io.o

F

L!bdh

F

C

.

VI3 241.3

18.5S 394.31

14dd Oill!

115.99

MilPilIlkl 4_ kk1 Nl4]b

111 9953

of

118

0

518959: (94.715,5.

V

9,9,), 9915,: -1874 0's

Fig. 4. Screenshot of the ISP system developed by the Optical Society of America. The ISP visualization software communicates with the MIDAS system to stream datasets stored in the system.

5. CONCLUSION We have presented a novel digital system that integrates an archiving and retrieving module with a grid processing engine. By combining data management and processing, the processing pipeline can be easily streamlined across the different organization. The system is still under development and we keep adding new features and improvements. For more information about MIDAS, visit http://www.kitware.com/products/midas.html.

Proc. of SPIE Vol. 7264 726413-7

REFERENCES [1] [2] [3] [4] [5] [6]

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

BatchMake – Batch Processing for Large Datasets: http://www.batchmake.org. Dspace, and open-source solution for accessing, managing and preserving scholarly works: http://www.dspace.org. Handle System Overview, S Sun, L Lannom, B Boesch - 2006 - Digital Library. Google Scholar: http://scholar.google.com. A general XML schema and SPM toolbox for storage of neuro-imaging results and anatomical labels, Keator DB, Gadde S, Grethe JS, Taylor DV, Potkin SG; Neuroinformatics. 2006;4(2):199-212. The Extensible Neuroimaging Archive Toolkit: an informatics platform for managing, exploring, and sharing neuroimaging data, Marcus DS, Olsen TR, Ramaratnam M, Buckner RL, Neuroinformatics. 2007 Spring;5(1):1134. PostGreSQL: The world’s most advanced open source database: http://www.postgresql.org. Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988. Sun Grid Engine: towards creating a compute power grid, W Gentzsch, Proceedings of the first IEEE/ACM International Symposium on Cluster Computing and the Grid (2001), pp. 35-36. The Globus Toolkit: http://www.globus.org/toolkit. The MetaImage file format and library: http://www.vtk.org/Wiki/MetaIO/Documentation. 3D Slicer: http://www.slicer.org. OSA’s Interactive Scientific Publication system: http://www.opticsinfobase.org/isp.cfm. The Midas Journal: http://www.midas-journal.org. The Retrospective Image Registration Evaluation Project: http://www.insight-journal.org/rire. Midas at Kitware Inc: http://www.insight-journal.org/midas.

Proc. of SPIE Vol. 7264 726413-8

Suggest Documents