Leverage Hadoop Framework for Large Scale Clinical Informatics Applications ... results are obtained, perform data analysis using the Mahout data mining ...
Leverage Hadoop Framework for Large Scale Clinical Informatics Applications Xiao Dong PhD1, Neil Bahroos MS1, Eugene Sadhu MD1, Tommie Jackson MS1, Morris Chukhman MS1, Robert Johnson BS1, Andrew Boyd MD1, Denise Hynes PhD, MPH, RN1 1 University of Illinois at Chicago, Chicago, IL Abstract In this manuscript, we present our experiences using the Apache Hadoop framework for high data volume and computationally intensive applications, and discuss some best practice guidelines in a clinical informatics setting. There are three main aspects in our approach: (a) process and integrate diverse, heterogeneous data sources using standard Hadoop programming tools and customized MapReduce programs; (b) after fine-grained aggregate results are obtained, perform data analysis using the Mahout data mining library; (c) leverage the column oriented features in HBase for patient centric modeling and complex temporal reasoning. This framework provides a scalable solution to meet the rapidly increasing, imperative “Big Data” needs of clinical and translational research. The intrinsic advantage of fault tolerance, high availability and scalability of Hadoop platform makes these applications readily deployable at the enterprise level cluster environment. Introduction Clinical and translational informatics research typically involves aggregating information from a variety of sources such as, demographic and observational data from medical administration systems, billing data from financial systems, tissue and genetic data from biorepository systems etc. The challenges presented here are two fold: first, the computational needs incurred by the sheer volume of data; secondly, data heterogeneity issues. With specific implementation strategies in place, Hadoop offers solutions to meet both challenges: it can facilitate distributed data processing with MapReduce algorithms on a cluster of commodity computers as well as push code to the computer nodes where relevant data resides, thereby alleviating the need to move and merge heterogeneous data sources. Method On top of the MapReduce paradigm and Hadoop Distributed File System (HDFS), Hadoop offers two high-level languages: Pig Latin for imperative programming and HiveQL for declarative programming, which can subsequently auto-generate MapReduce jobs. Such utilities can significantly expedite the information extraction and transformation process given a large cluster infrastructure. We are deploying such tools internally to aggregate diagnosis and procedure information from two separate billing systems from UIC Hospital and Health Science System (facility and professional billing respectively). We are also integrating Cerner EMR data for our cohort identification and research analytics applications. At the post-extraction and transformation phase where processed aggregate outcomes are obtained, data mining tasks are normally carried out to distill valuable business insight. On that front, Mahout, the analytics component in Hadoop, has a suite of widely-used data mining algorithms (clustering, classification, text mining, etc) to further streamline the data analysis processes inside the Hadoop cluster. Here, we also discuss some best practice rules we have learnt in terms of how to organize data during the data cleansing, processing and mining stages to best take advantage of the data locality principles of the Hadoop cluster. HBase is a distributed, column oriented database that was developed based on Google’s BigTable model. Its column family storage allows flexible adaptation of new attributes at any time, up to millions into the schema. Based on this concept, we develop a new patient centric model, where major information dimensions such as procedures, diagnoses, labs and medications are grouped into column families. In such a model, the extensibility along the column dimension is almost unlimited. One typical use case is when it comes to terminology updates one could simply broaden the column family with the contents in the updates. In addition, each HBase cell not only contains data content, but also a timestamp indexing the content, since many clinical data requests are temporal in natural, this feature provides highly efficient way to construct temporal query, compared to traditional RDMS that typically involves joining across different tables. Result and Discussion In this presentation we summarize our experiences of leveraging Hadoop framework to process large volume of clinical data and give some perspectives on how to apply this suite of cutting edge technology for the next generation informatics systems to meet the pressing “Big Data” needs. We also highlight the major advantages and potential pitfalls when migrating from a traditional clinical informatics system to a Hadoop based system, such content is very relevant to the audiences who are beginning adopters of Hadoop technology stack.
53