Medinfo 2007 Submission details - Immune Tolerance Network

Knowledge-Level Querying of Temporal Patterns in Clinical Research Systems Martin J. O’Connor1, Ravi D. Shankar1, David B. Parrish2, Amar K. Das1 1

Stanford Medical Informatics, Stanford University, USA {martin.oconnor, ravi.shankar, amar.das}@stanford.edu 2 The Immune Tolerance Network, Pittsburgh, PA [email protected]

Abstract Managing time-stamped data is essential to clinical research activities and often requires the use of considerable domain knowledge, which is difficult to support within database systems. As a result, there is a need for principled methods to overcome the disconnect between the database representation of time-oriented research data and corresponding knowledge of domain-relevant concepts. In this paper, we present a set of methodologies for undertaking knowledge-level querying of temporal patterns, and discuss its application to the verification of temporal constraints in clinical-trial applications. Our approach allows knowledge generated from query results to be tied to the data and, if necessary, used for further inference. We show how the Semantic Web ontology and rule languages, OWL and SWRL, respectively, can support the temporal knowledge model needed to integrate low-level representations of relational data with high-level domain concepts used in research data management. We present a scalable bridge-based software architecture that uses this knowledge model to enable dynamic querying of time-oriented research data.

As a result, there is a critical need to provide querying methods that can operate at the domain knowledge level rather than the database schema level. The lack of support for temporal information on research study design (such as longitudinal patient observations or time-course experiments) at the database level can limit the investigation of causal phenomena that are central to biomedical research. To address this problem, we have developed end-to-end methodologies and software architecture that permit design-time encoding and execution of temporal patterns needed for clinical research management. Our approach consists of three knowledge-based components: a temporal ontology, a temporal pattern specification language, and a database mapping model. The design of these components is driven by the needs of the Immune Tolerance Network [1], a collaborative clinical research organization focused on developing new therapeutics in immune-mediated disorders. Our methodology bridges the gap between clinical-trial specification and clinical-trial implementation, which enhances compliance monitoring and data analysis within this research environment.

2. Knowledge and Database Disconnect Keywords: Clinical trials, temporal querying, knowledge-based systems, Semantic Web, ontology.

1. Time in Clinical Research Databases Relational databases have become an essential part of biomedical research projects needing to maintain, integrate and share data. In biomedical research projects ranging from clinical studies to genomics research, relational databases are typically used to store data and custom queries are written to extract subsets of the data into specialized tools to support study management and focused analyses. A serious shortcoming of this approach is that the data-processing steps are often customized to a particular analysis and database and thus do not generalize to other research projects. By its nature, however, the standard relational model does not adequately support important biomedical domain concepts, such as hierarchies and time; thus, the link between domain knowledge and data representation used in database querying is often implicit.

Many clinical research systems have significant requirements for the querying and management of temporal data. Trial design and compliance monitoring tasks, for example, typically revolve around evaluating temporal patterns among data. Example patterns (found as free text in a study design document) include: “Visit 3 for a participant must occur with three weeks of visit 2,” “clinical assessments are required twice a week until day 28 or discharge from hospital,” and “test is scheduled on weeks 4, 6, and 8 during treatment.” When encoding such patterns, developers may face two types of disconnect between the initial specification of these patterns and their execution. First, there can be a knowledge specification disconnect. Constraints are typically expressed as unstructured free text throughout a study design document. Their interpretation is heavily dependent on the context of the research protocol being encoded. Even core terms—such as, for example, a definition of participant visit—can be poorly specified. Is a visit a single encounter between a participant and a provider, or can

it span a variable number of encounters? If it can span a variable number of encounters, what is the exact definition of the visit end? Producing a precise definition for a constraint can thus be difficult. In addition, the unstructured constraint specification process can also result in gaps in the final specifications. Second, there can be a database specification disconnect. Constraints are usually encoded in terms of data that is collected during a study’s execution. These data are often stored in relational databases. The schema design of the databases often reflects the operational requirements of the study managers whose activities were not defined within the research protocol document. Constraints can thus be encoded at a level that is at least one step removed from their initial specification, with a consequent loss of precision. This difficulty is compounded by the fact that developers may not have direct access to the protocol authors. A related problem is that constraints may only be checked after data has been entered into a database and not when the data is collected, allowing noncompliant data to enter the system. As a result of these disconnects, the encoding and implementation of protocolspecified constraints as temporal patterns may not reflect the intentions of the designers. The quality of the trial data can thus become seriously compromised, which may not be noticed until the final stage of analysis.

3. Methods To overcome these types of disconnect, we apply existing knowledge representation and temporal relational methods. 3.1. Knowledge Representation Language Our approach relies on using the standard knowledge specification methods for the Semantic Web. The Semantic Web is a shared research plan that aims to provide explicit semantic meaning to data and knowledge on the World Wide Web [2]. The Web Ontology Language (OWL) [3] has been designed as the language of the Semantic Web. OWL can be used to build ontologies that provide high-level descriptions of Web content. These ontologies are created by building hierarchies of classes describing concepts in a domain and relating the classes to each other using properties. OWL can also represent data as instances of OWL classes—referred to as individuals—and it provides mechanisms for reasoning with the data and manipulating it. OWL provides limited deductive reasoning capabilities, however, so recent work has concentrated on adding rules to it. The Semantic Web Rule Language (SWRL) allows users to write Horn-like rules that can be expressed in terms of OWL concepts and that can reason about OWL individuals. SWRL provides deductive reasoning capabilities that can infer new knowledge from an existing OWL ontology. We recently developed the first SWRL editor [4]. It was written as an extension to Protégé-OWL [3], an open source framework that provides a suite of tools for constructing OWL domain models and knowledge-based applications. Our editor permits interactive creation, editing, reading, and writing of SWRL

rules. We have also developed one of the first systems supporting inference with SWRL rules using the Jess rule engine. 3.2. Temporal Relational Model Most modern clinical research systems store data within relational databases, which provide a well-defined data model and query language. However the relational model provides poor support for storing complex temporal information. For example, if a database row contains some temporal information, there is no indication as to the relationship between the timestamp and the non-temporal data in the row. Does the timestamp refer to the point at which the information was recorded, or to the point at which it was known? Other shortcomings include no standard way to indicate a timestamp’s granularity, no support for automatic coalescing or merging of temporally overlapping data, and no standard means of writing queries with relative times or that refer to the current time [5]. Several proposed extensions to the relational model address these shortcomings. Most have focused on valid-time databases, in which temporal information is attached to all rows in a temporal table [6]. This structure adds a third dimension to two-dimensional relational tables. In these tables, every tuple holds temporal information denoting the information’s validtime. We have developed a temporal query system called Chronus II (http://chronus.stanford.edu) [7], which extends SQL and the standard relational model to support valid-time temporal queries. Chronus II adopts a valid-time temporal model and provides an expressive temporal query language. Chronus II is implemented in Java and operates as a layer above existing relational databases. Chronus II interacts with the database through a JDBC interface and is not tied to any particular database implementation.

4. Results We used OWL and SWRL to develop a valid-time temporal model (based on Chronus II) to support methods for temporal constraint specification in a protocol tracking system. We also developed a set of mapping tools to allow the use of this model with existing relational data. A knowledge-driven architecture was then implemented to support the efficient deployment and execution of system components. 4.1. Temporal Ontology We have encoded a temporal model in OWL [8] based on the valid-time temporal model [5-7, 9]. In this model, all facts have temporal extent and are associated with instants or intervals denoting the times that they are held to be true. The core class modeling this association in the OWL ontology is called ExtendedProposition. This class models information that extends over time. It has a property called hasValidTimes that holds the time(s) during which the associated information is held to be true. This property is modeled by an abstract class called ValidTime, which has subclasses ValidInstant and ValidPeriod. ValidInstant has the property hasTime, and ValidPeriod, has the properties hasBeginning and hasFinish. These classes represent instants and intervals, respec-

tively. Valid times also have granularities associated with them. Named points in time—often called anchor points—can be modeled as subclasses of the ValidInstant class. Temporal durations are modeled using a ValidDuration class that holds a count and a granularity. There are two types of extended propositions in the model: (1) extended primitive propositions, which represent data derived directly from secondary storage; and (2) extended abstract propositions, which are abstracted from other propositions. They are represented by ExtendedPrimitiveProposition and ExtendedAbstractProposition, respectively, in the temporal ontology. The extended primitive and abstract proposition classes can also hold a value in addition to its valid times. This value is denoted by the hasValue property. The value is any XML Schema data types, such as strings or integers. These extended propositions can be used to consistently represent temporal information in ontologies. For example, a set of visits in a protocol tracking application can be represented by defining a class called Visit that subclasses the extended proposition class. It inherits the hasValidTime property from that class, which holds its visit times. Similarly, an extended primitive proposition can be used to represent a drug regimen, with a value of type string to hold the drug name and a set of periods in the valid time property to hold drug delivery times. These extended propositions can then be associated with a class using OWL properties. Once all temporal information is represented consistently using the temporal ontology, SWRL rules can be written in terms of this ontology. However, the core SWRL language has limited temporal reasoning capabilities. A few temporal predicates called built-ins are included in the set of standard predicates, but they have limited expressive power. Fortunately, SWRL provides an extension mechanism to add userdefined predicates. We used this mechanism to define a set of temporal predicates to operate on temporal values. These predicates support the standard Allen temporal operators [10] to provide the equivalent of the operators supported within the Chronus II query language. Using these built-in operators in conjunction with the temporal ontology permits expression of complex temporal rules. For example, in modeling visits in a protocol as extended propositions and the start of treatment of a participant as an anchor point, a new SWRL rule can indicate that a second visit in a particular protocol must occur within two weeks of the start of treatment anchor, as follows: Participant(?p) ^ hasVisit(?p, ?v) ^ V2(?v) ^ temporal:hasStart(?v, ?startV2) ^ hasAnchor(?p, ?a) ^ StartOfTreatment(?a) ^ temporal:hasTime(?a, ?sot) ^ durationLessThan(?sot, ?startV2, 2, weeks) -> ConformingPatient(?p)

4.2. Temporal Pattern Specification Our efforts to model clinical trials is driven by the needs of the Immune Tolerance Network (ITN; [1]), which develops new therapeutics for immune-mediated disorders. In collabo-

ration with ITN, we have created a knowledge-based architecture (called Epoch [11]) to support the management of multisite clinical trial protocols and the discovery of common tolerance mechanisms across multiple trials. We have focused our efforts on developing participant and sample tracking models [11], both of which must specify complex temporal constraints at the knowledge level. To meet this need, we used the temporal ontology to model the temporal dimension of core components in the model and then analyzed a range of ITN’s protocols to determine the types of temporal constraints required by protocols and if our model could represent them. In principle, SWRL rules could be used to express all constraints within the protocol tracking application. However, while relatively concise, SWRL rules are not suitable for nonspecialists. As a result, we decided to define a high-level userfriendly constraint language to allow ontology developers to encode constraints at the domain level. These constraints are then mapped automatically to SWRL rules at run time. The constraint language allows times to be specified as absolute or relative times. For example, an indication that something must start within two weeks of a start of treatment anchor is SOT + WEEKS(2), where SOT is the name of the start of treatment anchor, to which an offset of two weeks is added. Offsets can be positive or negative and can be combined at different granularities. Offsets correspond to durations in the temporal model may also be referred to as such. For example, an offset of one month and two days can be specified as MONTH(1) + DAYS(2). In addition to named anchors, the constraint specification language can work directly with temporal propositions. So, for example, if visit number two is modeled as an extended proposition subclass V2, a constraint can refer to its start time as V2.hasBeginning. This syntax could, for example, express the expect start time of a third visit as a two month offset from the end of visit two as V2.finish + months(2). Temporally constrained entities in the protocol model are modeled using a plan that can hold temporal constraints specified in the constraint language. This class is modeled using an OWL Plan class and has a number of properties that can specify the expected temporal behavior of the associated protocol entities: expectedStart The time the protocol entity is expected to start. This is specified using the constraint language. expectedStartVariance The temporal uncertainty of the start time. This is expressed using constraint language offset clauses, e.g., WEEKS(2). expectedFinish, expectedFinishVariance End time specifications for an entity. expectedDuration How long this entity is expected to last. Specified using duration clauses, e.g., DAYS(2). expectedCycles Used for cyclical specifications to indicate the number of times the entity is expected to repeat and the intervals between repetitions. A plan class also has two properties that hold run-time time values for the protocol entity. These properties—called actu-

alStart and actualFinish—can be compared against expected values when validating an entity for compliance. Using plan specifications in conjunction with the constraint language allows us to express a large range of constraints for protocol entities, which can be mapped to SWRL rules for execution. 4.3. Database Mapping In principle, developers could take biomedical data in an existing relational database, develop an ontology to describe those data, and then convert the data into a knowledge-based form for all future processing. Apart from the significant development effort involved, this solution does not scale well. Current ontology-specification tools, such as Jena or ProtégéOWL, do not support high data throughput. For large data sets, an alternate mapping solution is needed. We have developed a customized mapping tool, called Synchronus [12, 13], that supports both a direct relational-to-OWL mapping and also a lightweight mapping mechanism for large data sets. This tool maps relation data described in terms of the temporal ontology to OWL individuals. Essentially, it creates extended propositions from time-stamped relational data. It also supports the reverse mapping of extended propositions to relational data. Two OWL ontologies are used to drive Synchronus: (1) a schema ontology, which is a knowledge-level description of a relational or Chronus II temporal-relational schema; (2) a mapping ontology, which describes how relational data are mapped to extended propositions. The schema ontology describes the structure of one or more databases that will be mapped. It contains descriptions of the tables in the database, such as the names of types of columns in those tables. The mapping ontology uses this schema ontology to describe the relational or temporal-relational tables to be mapped. Every extended proposition in the temporal model has an optional input and output storage descriptor. The descriptor uses the schema ontology to point to data that is stored in a database. Synchronus uses this descriptor to perform run-time transformations of the data between rows in a relational database and OWL individuals. The direct relational-to-OWL data-mapping method has two main modes of operation: batch mode, where an OWL knowledge base is fully populated with relevant data from a database, or a database is populated from an OWL knowledge base; and dynamic mode, where propositions are mapped on demand. The latter mapping mechanism reads and writes objects represented by extended propositions without creating OWL individuals. 4.4. Bridge Architecture We have developed a bridge architecture to support the integration of relational databases and reasoning methods into a knowledge-driven system. The Figure shows a schematic of the architecture and its five main components: (1) a knowledge base; (2) a relational database; (3) Synchronus; (4) a method; and (4) the bridge itself. A bridge is a customized method to provides a specific computational task through the integration of one or more existing knowledge sources (such as an OWL knowledge base); data sources (such as a rela-

tional database); and data-processing mechanisms (such as a rule engine). The bridge resolves low-level differences of how these software components interact with each other through the communication of data and knowledge. A deployed bridge may work with several databases, methods, and, potentially, knowledge bases. Each bridge is driven from its associated knowledge base. The knowledge base contains a number of ontologies that are used in deploying the bridge: (1) a method ontology, which describes at a high level the analytic method or methods being used; (2) a mapping ontology, which is used by Synchronus to map relational data; and (3) a domain ontology that describes the underlying application domain. API/ User Interface

OWL KB

Bridge

Engine

Synchronus

Data Knowledge

Figure. Bridge Architecture: Schematic showing how a bridge is deployed to isolate an existing analytic method (rule engine) from details of both an OWL knowledge base and an existing relational database, accessed through Synchronus.

To support the validation of temporal constraints for clinicaltrial management at ITN, we have developed a bridge architecture that provides the infrastructure necessary to incorporate rule engines into Protégé-OWL and execute SWRL rules specified in the knowledge base. The bridge provides a mapping layer that generates as input into the rule engine representations of all rules and relevant OWL classes, individuals and properties. A target rule engine implementation takes in these representations and implements them in the rule engine’s native format. Scalability is a primary goal of a bridge architecture; thus designing efficient data and knowledge access techniques is a central aspect of our bridge design and implementation. For example, when translating OWL knowledge into an intermediate form, instead of transferring all knowledge in a knowledge base, only potentially relevant knowledge is represented. The bridge examines each SWRL rule and only represents OWL classes, properties and individuals that are referenced by those rules. Such references can be indirect, so the bridge must traverse the interrelationships between all OWL concepts mentioned in SWRL rules to ensure completeness. This step significantly reduces the amount of knowledge that needs to be represented by a rule engine. Since the performance of most rule engines is a direct function of the size of their facts base, we can ensure significant performance benefits.

Another optimization technique relates to data access. Extended propositions used in rules may be held in databases and accessed through Synchronus. SWRL rules can operate on these propositions using temporal built-ins. There is a fairly direct mapping from SWRL rules with temporal propositions to valid-time queries. This parallel structure can be exploited by the bridge to optimize its data access. The bridge examines each SWRL rule with temporal operators and looks for operators that temporally restrict the range of propositions. For example, if a temporal operator restricts the range of a proposition to dates after a particular time point, only data after than time will be requested from Synchronus. Because SWRL rules do not have disjunctions, this optimization process is not elaborate. A more exhaustive optimization process could be facilitated by directly mapping temporal SWRL rules to Synchronus. In our clinical trial management architecture for ITN, we currently use an implementation that invokes the Jess rule engine, which is available as part of the standard Protégé-OWL distribution [3]. Our Jess implementation for a rule engine bridge employs the temporal built-in library for SWRL that we have presented and automatically undertakes the mapping of data and knowledge for knowledge-level querying of specified temporal patterns in a relational database.

and Manufacturers Association Foundation Research Starter Grant. The authors thank Valerie Natale for her editorial comments.

References [1] Rotrosen D, Matthews JB, and Bluestone JA. The im-

mune tolerance network: a new paradigm for developing tolerance-inducing therapies. J Allergy Clin Immunol 2002: 110 (1): 17-23. [2] Berners-Lee T, Hendler J, and Lassila O. The Semantic

Web. Scientific American 2001: 35 (May): 43-52. [3] Knublauch H, Fergerson RW, Noy NF and Musen M.A.

The Protégé OWL Plugin: an open development environment for semantic web applications. In: Third ISWC (ISWC 2004), Hiroshima, Japan, 2004; pp. 229-243. [4] O’Connor MJ, Knublauch H, Tu SW, Grossof B, Dean

M, Grosso WE, and Musen MA. Supporting rule system interoperability on the Semantic Web with SWRL. In: Fourth International Semantic Web Conference (ISWC 2005), Galway, Ireland, 2005; pp. 974-986. [5] Snodgrass RT. On the semantics of ‘now' in databases.

ACM Trans Database Systems 1997: 22 (2): 171-214.

5. Discussion

[6] Snodgrass RT (Ed). The TSQL2 temporal query lan-

The gap between the specification of a study protocol and the management of resulting data can often be quite significant in clinical research systems, such as clinical trial management applications. To help close this gap, we have developed a set of end-to-end general methodologies for specifying and executing temporal patterns at the knowledge level rather than the database level. Our approach demonstrates that proposed Semantic Web standards for ontology and rule representation, OWL and SWRL, respectively, can support the knowledge model needed to integrate temporal representations of relational data with the domain-specific semantics needed to reason with them for biomedical and healthcare applications. In contrast to previous work on constraint specification in clinical trials [14-16], our set of methodologies addresses the knowledge and database disconnect that exist in clinical research systems. Our approach requires that all relevant temporal knowledge on a study protocol and its corresponding data representation be encoded within an OWL ontology, which allows the uniform specification of temporal patterns in knowledge-level querying. Our bridge architecture supports robust optimization techniques to ensure that encoded constraints are automatically translated into an executable form at run time and are efficiently validated against study data held in an existing relational database.

[7] O’Connor MJ, Tu SW, and Musen MA. The Chronus II

Acknowledgments The authors thank David Parrish, Executive Director at the Immune Tolerance Network, for support of this work. This research was supported in part by the Immune Tolerance Network, funded by Grant NO1-AI-15416 from the National Institutes of Health (USA), and by a Pharmaceutical Research

guage. Kluwer Academic Publishers: Boston, 1995. temporal database mediator. In: AMIA Annual Symposium, San Antonio, TX, 2002; pp. 567-571. [8] Shoham Y. Temporal logics in AI: semantical and onto-

logical considerations. Artif Intell 1987: 33 (1): 89-104. [9] O’Connor MJ, Shankar RD, and Das AK. An ontology-

driven mediator for querying time-oriented biomedical data. In: 19th IEEE Symposium on Computer-Based Medical Systems (CBMS2006), Salt Lake City, Utah, 2006; pp. 264-269. [10] Allen JF. Maintaining knowledge about temporal inter-

vals. Comm ACM 1993: 26 (11): 832-843. [11] Shankar R, Martins SB, O’Connor MJ, Parrish D, and

Das AK. Towards semantic interoperability in a clinical trials management system. In: Fifth International Semantic Web Conference (ISWC 2006), Athens, Georgia, 2006: pp. 901-912. [12] Das AK and Musen MA. Synchronus: a reusable software

module for temporal integration. In: AMIA Annual Symposium, San Antonio, Texas, 2002: pp. 195-199. [13] Narayanan PS, O’Connor MJ, and Das AK. Ontology-

driven mapping of temporal data in biomedical databases. In: AMIA Annual Symposium, Washington, DC, 2006: p. 1045. [14] Weng C, Kahn M, and Gennari J. Temporal knowledge

representation for scheduling tasks in clinical trial protocols. In: AMIA Annual Symposium, San Antonio, Texas, November, 2002; pp. 879-83.

[15] Deshpande AM, Brandt C, and Nadkarni PM. Temporal

query of attribute-value patient data: utilizing the constraints of clinical studies. Int J Med Inform 2003: 70 (1): 59-77. [16] Terenziani P, Montani S, Torchio M, Molino G, and

Anselma L. Temporal consistency checking in clinical guidelines acquisition and execution: the GLARE's ap-

proach. In: AMIA Annual Symposium, Washington, DC, 2003; pp. 659-63. Address for correspondence Martin J. O’Connor Stanford Medical Informatics Stanford University 251 Campus Drive, MSOB X275 Stanford, CA 94305 USA