assertions of the link between a risk factor and a disease or a disease and a .... [1] Buckeridge DL, Izadi MT, Shaban-Nejad A, Mondor L, Jauvin C, Dube L, Jang ...
e-Health – For Continuity of Care C. Lovis et al. (Eds.) © 2014 European Federation for Medical Informatics and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License. doi:10.3233/978-1-61499-432-9-1125
1125
Addressing the Challenge of Encoding Causal Epidemiological Knowledge in Formal Ontologies: A Practical Perspective Anya OKHMATOVSKAIAa, Arash SHABAN-NEJAD a, Maxime LAVIGNEa, and David L. BUCKERIDGE a a McGill University Clinical and Health Informatics Research Group
Abstract. The paper presents an overview of approaches to encoding uncertain causal knowledge in formal ontologies and demonstrates how these approaches can be used in a semantic-driven application for public health using the Population Health Record (PopHR) platform as an example. Keywords. formal ontologies, causality, uncertainty, public health
Introduction The Population Health Record (PopHR) [1] is a web-based software infrastructure that automates the retrieval and integration of heterogeneous data from multiple sources, and supports intelligent analysis of these data to create a coherent portrait of population health. Focused on facilitating the planning and evaluation of population health interventions, the PopHR addresses common limitations of existing web portals for population health information, such as: outdated indicators computed at low spatial resolution, limited transparency of algorithms and data sources, unintuitive graphical interfaces, and presentation of indicators that ignores the relationships between them. To address these limitations, PopHR relies on a suite of OWL ontologies that encode relevant domain knowledge and constitute a semantic framework for organizing indicators supported by the system, understanding user queries, and providing queryspecific analytical tools. The ontologies also incorporate a body of epidemiological knowledge about causal links between health issues, their determinants and outcomes. The PopHR uses this knowledge to present data for multiple population health indicators in the context of known relationships between these indicators, thus enabling a knowledge-based analysis of the information derived from multiple data sources. Encoding of causal epidemiological knowledge in formal logic is essential in many medical informatics applications, but it presents a challenge in ontology engineering, due to the way epidemiology considers causality. In particular, the causal relationships are non-deterministic; there are multiple alternative causes for the same phenomena; and causal links established from the analysis of populations cannot be assumed to hold for all individuals. Ontology languages with rigid semantic structure, such as OWL, cannot accommodate these features. The problem has been recognized in the literature, and a number of solutions have been proposed ranging in quality and applicability [2, 3]. In this paper, we provide a practical perspective on the problem, drawn from our experience in encoding causal epidemiological knowledge in the PopHR ontologies.
1126
A. Okhmatovskaia et al. / Challenge of Encoding Causal Epidemiological Knowledge
1. Methods A vast amount of knowledge in the biomedical domain, particularly causal knowledge in epidemiology come in the form of uncertain statements, i.e. these statements are neither implied nor expected to be true in all cases, and the conditions under which they are true or false cannot be fully specified. The examples of such statements are assertions of the link between a risk factor and a disease or a disease and a symptom. In the PopHR, a user monitoring prevalence of type II diabetes in a certain region may need to retrieve data about health determinants that affect the risk of this disease in the same region. To achieve this, a query to the ontology shall return risk factors of diabetes, including obesity. Note that the link between obesity and diabetes is not always observed in individuals, but the association at the population level is undisputed. In general, it is important to be able to capture the notion of possible, typical or default in medical ontologies, but the languages and tools for ontology development cannot represent this notion naturally. In a standard ontology language OWL, as in any formalism based on description logics (DLs), any statement must be either true or false, and any statement about a class must hold for all instances of that class. So if obesity and diabetes are classes, and we want to make a general statement about causal relationship between their instances, this statement would have to be true either for all instances of obesity, or for all instances of diabetes, neither of which is the case. Several research groups have been developing extensions to DLs for probabilistic modelling, incorporating formalisms like fuzzy logic [4] or Bayesian probability [5] to overcome the limited expressivity of DLs. Unfortunately, most of the proposed formalisms are impractical due to their computational complexity. Efforts to improve reasoning in probabilistic DLs constitute an area of active research, but there are still no tools available for practical use by ontology developers. Luckily, in many medical applications including PopHR, the need to formalize probability is not critical. Our focus is on encoding the optional nature of causation in a way that is logically and ontologically sound and supports querying and logical inference. In the absence of the standard semantics for uncertainty in OWL, ontology developers have come up with a number of “work arounds” [2]. For example, instead of using an incorrect statement involving existential quantifier “A causes some B”, one would use a universal restriction in a form: “A causes only B”, or invert the property: “B is_caused_by some A”. Neither option correctly captures our previous example of obesity and diabetes. Another solution is to use the minimum cardinality restriction in a form: “A causes min 0 B”. Although formally correct, it has little practical value, since reasoners effectively ignore statements like that. With the similar effect, uncertain causal links can be encoded in annotation properties, where they can be accessed by human readers or retrieved through API, but do not add knowledge for logical inference. This approach may be justified for some applications, where ontologies are used for the mere purpose of knowledge representation. However in these cases the use of less formal conceptualization (e.g. SKOS) may be more appropriate [3]. Another popular solution is to express uncertainty using a predicate that sounds less deterministic, e.g. may_cause, possibly_causes, etc. While seemingly correct, the statement “obesity may_cause some diabetes” would be refuted by a case of obesity without diabetes in the same way as “obesity causes some diabetes” would be refuted, because it affirms that for every instance of obesity, there must exist an instance of diabetes, which that instance of obesity may have caused. For a logical reasoner, “may cause” still denotes an association, which, as we know, is not always present between
A. Okhmatovskaia et al. / Challenge of Encoding Causal Epidemiological Knowledge
1127
instances of obesity and diabetes. What is always present is a possibility of association, so to correctly encode the relationship between obesity and diabetes we need a concept denoting possibility. Schultz [3] has suggested using a concept of disposition defined as a property that may or may not be realized in some observed process (e.g. salt bears a disposition to dissolve in water). The relationship between a cause and effect is split in two parts resulting in the statement: “A causes some (Disposition and has_realization only B)”. This dispositional clause is obviously more complex than a relation expressed by a single predicate, so it may be difficult to define an inverse relation or to specify the transitive nature of causal links. It is, however, the only solution within the expressivity of OWL-DL that is ontologically correct and suitable for logical reasoning.
2. Results 2.1. Overview of the PopHR Ontology Suite The PopHR semantic framework includes an application ontology and three domain ontologies (one developed externally) briefly described below. Public Health Indicators Ontology (PHIO) is an application ontology designed to support the specific needs of the PopHR software, such as presenting health indicators to the user in an organized way, computing and visualizing the results, etc. [6] PHIO includes a taxonomy of health indicators by the Canadian Institute for Health Information (e.g. disease indicators vs. health system performance indicators), and properties allowing classification of health indicators according to other criteria (e.g. rates vs. proportions). PHIO also encodes concepts necessary for data specification and manipulation, and a limited set of relevant statistical methods and temporal concepts. To contextualize health indicators by public health knowledge, we have developed a Public Health Ontology (PHOnt) incorporating more general concepts and relations, which can be reused by other applications and ontologies. PHOnt is based on the Australian Classification of Public Health Activity [7], and covers a broad domain of issues and methods relevant to public health. PHOnt defines many associative relations, in particular, causal pathways from upstream health determinants to diseases. PHOnt imports an existing Disease Ontology [8] and a simple Geography Ontology that we developed to cover a small part of the geospatial domain essential for PopHR. 2.2. Characteristics of the Encoded Causal Relations Causal links among diseases and health determinants used by PopHR to contextualize health indicators are quite diverse: in some cases a change in the level of factor can hasten onset of a disease, in other cases one event triggers another, and in yet other cases, a factor has an effect on the severity of disease rather than on its probability of occurence. There are, however, several common properties of all causal links captured in PHOnt: a) they are probabilistic; b) they cannot be quantitatively compared or summed up; c) the beneficial and adverse effects of the same determinant on a health issue are not mutually exclusive, and d) they represent causality at the individual level, which may or may not be observed as a statistical association at the population level (due to ecological bias). These characteristics influenced our formal encoding of causal relations and they should be considered during the interpretation.
1128
A. Okhmatovskaia et al. / Challenge of Encoding Causal Epidemiological Knowledge
2.3. Encoding Examples Let us encode our example of the causal link from obesity to diabetes using a dispositional clause. We define the risk of a disease as a disposition to developing a disease, which may or may not be realized in a particular individual, and a property has_effect_on with two sub-properties for positive and negative effects: Obesity has_positive_effect_on some RiskOfDiabetes RiskOfDiabetes ≡ Disposition and is_realized_in only OnsetOfDiabetes OnsetOfDiabetes ≡ Process and results_in some DiabetesMellitus Here, the risk of diabetes is a personal characteristic (disposition), and it is asserted that for any instance of Obesity, there is an instance of RiskOfDiabetes, which is affected. In the second line, we assert that if the RiskOfDiabetes is realized, then it is realized in the OnsetOfDiabetes and nothing else; however it is not mandatory for the risk to be realized. The intermediate concepts RiskOfDiabetes and OnsetOfDiabetes do not have to be encoded in the ontology as named classes. In PHOnt, we combine the three expressions, expanding intermediate concepts with their definitions, and therefore avoiding the problem of proliferation of classes. In the example above, the consequent is an event. To represent the links, in which the consequent is a change in some measurable property (e.g. amount of physical activity, disease severity), we introduce a class ChangeInLevel – a process that realizes a matching disposition – with subclasses IncreaseInLevel and DecreaseInLevel. The next example shows how PHOnt encodes the link between caloric intake and BMI: CaloricIntake has_positive_effect_on some (Disposition and is_realized_in only (IncreaseInLevel and is_change_in_level_of some BMI)) The properties has_positive_effect and has_negative_effect can only be applied to monotonically increasing or decreasing functions, but not to complex relationships. Although we do not intend to encode the exact functional form of causal links in PHOnt, it is valuable to be able to deal with those cases where the sign of an effect changes for different value ranges of a cause. Consider for example a U-shaped relation between BMI and mortality. To encode a non-linear effect like this, we partition the value range of a cause into two intervals, over which the relationship is monotonically increasing or decreasing. These intervals can be represented as subclasses of the cause: BMI_