Architecture for querying heterogeneous data sources for on - CiteSeerX

Architecture for querying heterogeneous data sources for online decision-support in primary care Martin Walther ‡, Ken Nguyen ‡, Hugh Garsden ‡, Enrico Coiera ‡ and Nigel H. Lovell, Senior Member ‡†Ψ

‡ Centre for Health Informatics University of New South Wales, Sydney, Australia, 2052. † Graduate School of Biomedical Engineering, University of New South Wales, Sydney, Australia, 2052.

Ψ Author to whom correspondence should be addressed. e-mail: [email protected] phone: +61 (2) 9385 3922 fax: +61 (2) 9663 2108 Submitted on the 1 Sep 2002 as a journal paper to the IEEE Transactions on Knowledge and Data Engineering (Special issue on Mining and Searching the Web). Keywords: heterogeneous databases, information retrieval, decision-support, data source, wrapper, meta search engine

Page 1 of 29

ABSTRACT An online clinical decision-support system (Quick Clinical) for health professionals is described. The software architecture for Quick Clinical is discussed, along with preliminary technical results from a trial of the system in a primary care medical setting. Quick Clinical is a system that provides the clinician with a tool to facilitate the process of searching for health-related information on the Web. When a search is submitted, Quick Clinical gathers the clinical context and the specific question from the clinician. Using a context greatly simplifies the usage of this search tool, and increases the relevancy of the retrieved result set. The software architecture is based on a meta-search engine approach and comprises a number of components that connect to heterogeneous Web-based and local data sources. A Unified Query Language is proposed that addresses the issue of varying capabilities of diverse data sources. Using twelve data sources in a four-week technical feasibility trial of clinicians (N=21), the average response time of a search was 13.1 seconds. This search speed was perceived as good or excellent by 75% of the clinicians involved in the trial. The majority of clinicians expressed the opinion that the system has the potential to improve patient care (67%).

Page 2 of 29

I.

INTRODUCTION Those involved in the provision of health care need to remain current with respect

to the latest knowledge in their field. The problem for these professionals is that the amount of available literature stored in electronic form is increasing exponentially [1, 2]. With Web-based information retrieval systems being used by professionals to support routine work, the amount of information being accessed via the Internet is also increasing. For example, health care professionals in Australia regularly use on-line resources to support clinical care [3], and online access to legal case notes has transformed access to the literature on legal precedent. On-line journals, books and guidelines provide easier access to information, but not necessarily relevant or pertinent information. In order to return the most appropriate clinical information, skilful Web search strategies are required, but the number and variety of Web sites available, and the ways of searching them, can be overwhelming. Present generation Web search technologies suffer several technical deficits that challenge the sustainability of current approaches. When searches are conducted across multiple and heterogeneous knowledge sources, search results usually produce an excessive number of ‘hits’, many of them failing reasonable tests of relevance to the original user query. Given that professional use of information mostly occurs in timepoor settings, and the consequences of ill-informed decision making are substantial, there is a genuine need to deliver fast and accurate information targeted to the specific needs of information-dependent professionals. Further, as individual knowledge sources become more complex, straightforward indexing of Web pages needs to be replaced by more complex modelling of the capabilities and content of individual knowledge sources on the Web [4]. There is also a growing pressure on Primary Care Physicians (PCPs) to use Page 3 of 29

knowledge that is evidence-based at the point of care. Using quality information sources to obtain knowledge is one way to ensure the validity of the knowledge. However, the current mode of primary care has made it difficult for time-poor PCPs to work effectively with information support [5]. One generic approach has been the use of Meta-Search Engines (MSEs). Instead of searching the Web themselves, MSEs exploit existing search engines to retrieve relevant information, combining the results of the connected search engines and presenting them in a uniform way [6]. This relieves users from having to contact multiple search engines manually and from knowing their native query languages. While this approach has the advantage of integrating multiple Web search engines, their user interfaces use a "Least-Common-Denominator" approach and discard much of the rich functionality of the different knowledge sources they integrate [7]. This weakness is especially obvious when users want to search for scientific publications or specialized information that may require quite specific and complex strategies. Schmitt et al. [8] describe an MSE that adds certain capabilities to the LeastCommon-Denominator idea through the use of filters (e.g. Language=German), sets (e.g. Union to mimic a Boolean OR operation) and expansion operators (to follow up URL links). However, although this MSE provides a list of available data sources, the user still needs to specify which data sources to search in. Based on our study of the needs of the PCP [1, 5] and the lack of reported systems in the health domain that combine MSE approaches with customizable search features, we have designed and implemented the “Quick Clinical” system to provide just-in-time information for decision support. Quick Clinical (QC) is a general purpose, but highly customizable online search engine based on the MSE approach. This customization Page 4 of 29

process is transparent to the user and allows the system to address the needs of a specific group of users. The current version of QC is tailored for the information needs of PCPs during patient consultations. Traditional information retrieval systems are based on the library model where one can search for a specific title or document and then browse through the chapters or sections to find the required information. However, in a clinical just-in-time environment this model is inadequate on two counts. Firstly, one needs to know where to search, i.e. which of the various content sources to use. Searching all available sources is generally not an option because it returns far too many results, and many of the results will be irrelevant. Secondly, one needs to know how to search the selected source or sources. In the health domain these parameters constitute the clinical context of the search. In QC the problem of context is addressed by introducing “search profiles”. A profile is used to describe the context of the clinical question. In a primary care setting such contexts would be diagnosis, treatment, prescribing, etc. A Diagnosis profile, for instance, ‘knows’ the most suitable sources to find information on diagnosis. Additionally, the Diagnosis profile has an understanding of how to best search these data sources when looking for diagnosis information (as opposed to treatment, for example). This is particularly powerful for structured data sources, such as an online medical textbook. Having knowledge of the structure of such a source enables the system to restrict the search to the appropriate areas and therefore improve the relevance of the results. Naturally, this is only feasible for sources whose structure is known (and reasonably stable). However trustworthy medical sources tend to be mature and do not fluctuate frequently. Profiles also hide the complexity of the context from the user, thus simplifying the search definition from the user’s viewpoint. QC provides pre-defined

Page 5 of 29

user profiles that we have developed based on analyses of real-life library searches and user input. We describe in detail the QC system architecture and propose a Unified Query Language (UQL) and Unified Response Language (UReL) to facilitate the searching of the heterogeneous data sources. Results from a trial of the QC technology in the clinical setting are also presented.

II.

THE QUICK CLINICAL SYSTEM

System Overview QC is a Web application running on a Web server. It is used through a Web browser. PCPs submit information via links and forms, and view information returned to the browser.

Page 6 of 29

Figure 1. Simple search interface for QC where PCPs select a profile and enter keywords. A search is typically a two-stage process. Initially the user chooses, from a list, the context of the search; in QC this is a pre-defined profile. Being customized for a clinical setting, profiles include topics such as Diagnosis, Prescription, Review and Treatment. In Figure 1 we show the search interface in which the Treatment profile has been selected, which searches documents related to treatment. This profile will allow the system to formulate more effective queries and choose the most appropriate content sources to search. In a second step the user can enter the search keyword. An advanced search mode is available to support the formulation of more complex (Boolean) queries. Based on the information given, QC will select the most appropriate information sources to find documents that best answer the question. It then submits search queries to the sources and collects the results. Finally, the best results are combined and presented to the user as a list of documents (Figure 2).

Page 7 of 29

Figure 2. Composite list of documents retrieved from multiple data sources for the example search. QC uses an array of sources that include other Web sites [9, 10, 11, 12, 13, 14, 15, 16, 17, 18] and locally stored and indexed content [19, 20].

System Architecture When designing the architecture of the system we had to consider not only the functional requirements needed by the PCPs, but also a range of quality of service (QoS) requirements. These non-functional requirements include performance, extensibility, scalability, reliability and availability. A modular and redundant design with low connectivity (loose coupling between components) satisfies most QoS requirements. Low connectivity is achieved by employing XML (extensible Mark-up Language) for inter-component communication. Being an extensible language, XML allows us to define a data structure most suitable for the system. Performance is addressed by introducing parallel processing for slow components (Wrappers and Capability Managers, see below). Figure 3 depicts the architecture overview of Quick Clinical.

Page 8 of 29

User Interface

Mediator

(Web page)

Capability Manager

XML

Wrapper Harrison

XML

Wrapper PubMed

Wrapper TGL

HTML HTML

Internet Local copy of Therapeutic Guidelines

PubMed Web site

Harrison’s Online Web site

Figure 3. Quick Clinical system components overview: A search is initiated from the User Interface, which forwards a query (in XML) to the Mediator. The Mediator splits the query into several sub queries and sends these to the appropriate Wrapper (via a capability manager if required). Finally the Wrapper translates the query into the native query language of the data source (e.g. in HTML for web data sources). Similarly the result from the data source gets translated back into the system’s XML representation and sent back to the User Interface. i.

System Software Components

Data Sources

The information used by the system is comprised of Web sites, on-line texts and databases. We introduce the generic term “data source” to describe any of these. They are not part of QC but the system must connect to them. The system faces the problem that every data source uses its own unique query language for running searches. Additionally, because of the heterogeneous nature of the data sources, the system has to deal with a number of individual and proprietary search interfaces. Depending on which

Page 9 of 29

data sources are used in a search, QC must translate the search query accordingly. To simplify the search process, wrappers are employed that encapsulate the data sources and provide a unified search interface for use by the other components of QC. In other words, wrappers receive user queries in a unified query language (UQL) and translate them into a source specific language. Once a query is available in UQL it can be sent to any number of data sources for which a wrapper is available. Once the search result from the data source is available it has to be translated into a standard result format: the Unified Response Language (UReL). Having such a language is necessary so that results from several wrappers can be combined into a monolithic result set and presented to the user. Additionally it allows other components in the system to modify the result without knowing the source that produced it, e.g. to remove inappropriate documents (see section on Capability Manager).

Wrapper

A key aspect of QC is the ability to communicate with numerous heterogeneous information sources. Many of these sources, particularly the ones from the Internet, have their own search engine. In our current search system our MSE is connected to the search engines of knowledge sources by means of wrappers, which are software modules that take care of the source-specific aspects of the MSE [21]. For every search engine connected to the MSE, there is a specific wrapper that translates a UQL query into the native query language and format of the search engine. The wrapper also extracts the relevant information from the HTML result pages returned by the search engine. The wrappers achieve a technical homogenization of each heterogeneous data source. Figure 4 shows the basic architecture of the wrappers in our current system. Each wrapper is responsible for receiving the UQL, translating it to a query in the data Page 10 of 29

source’s native language, submitting the query, receiving the results as raw (usually HTML) data, and then extracting the data fields to return the processed data. Therefore a wrapper consists of three main components: a feeder, extraction rules, and a sieve (see Figure 4). The Feeder converts the user query into the native query language of the data source. The data source responds to the query and returns HTML raw data. The Feeder passes the raw data to the Sieve that converts it to the UReL in XML format by using the extraction rules for the data source. The UReL is then sent back via other components to the user interface, which can interpret the XML and display the results.

UQL

Wrapper

Native Query Language

Feeder

Data Source raw data (HTML)

raw data

Sieve

rules

Extractrion Rules

UReL

Figure 4. Wrapper components overview. The Feeder receives the UQL and Page 11 of 29

converts it to the Native Query Language of the data source. It then sends the native query to the data source, which returns raw data in HTML format. The Feeder passes the raw data to the Sieve that converts it to the UReL in XML format by using the extraction rules for the data source. Extraction of Search Result Text

The Sieve component produces XML from HTML based on rules defined within QC. The content of an HTML page can be described as being made up of labels [22]. A Left-Right (LR) delimiter surrounds each label. LR delimiters are just the text that appears to the immediate left or right of the label. For example, the LR delimiters href="” and “">”could be used to extract Web links. Therefore, the labels

“

Architecture for querying heterogeneous data sources for on - CiteSeerX

Architecture for querying heterogeneous data sources for on - CiteSeerX

Suggest Documents

Combining heterogeneous data sources for accurate ... - ScienceOpen

Clustering Genes using Heterogeneous Data Sources - CiteSeerX

A Distributed Architecture For Heterogeneous Multi ... - CiteSeerX

A Scalable Architecture for Autonomous Heterogeneous ... - CiteSeerX

A Scalable Architecture for Autonomous Heterogeneous ... - CiteSeerX

A Network Architecture for Heterogeneous Mobile ... - CiteSeerX

Integrating Heterogeneous Data Sources in

Querying Distributed Multimedia Databases and Data Sources for ...

Querying Heterogeneous and Distributed Data ... - Semantic Scholar

Ontology mapping for querying heterogeneous ... - DCC/UFLA

Rule-Based Querying of Distributed, Heterogeneous Data

Querying XML Data Sources that Export Very Large Sets ... - CiteSeerX

Querying Distributed RDF Data Sources with SPARQL - CiteSeerX

Sesame: An Architecture for Storing and Querying RDF Data and ...

Heterogeneous Data Integration Architecture ...

An Architecture for Heterogeneous Federated

RELIABLE ARCHITECTURE FOR HETEROGENEOUS ... - doiSerbia

Integrating heterogeneous data sources for better freight flow analysis ...

Fault-Aware Resource Allocation for Heterogeneous Data Sources ...

Integration of heterogeneous data sources for gene function prediction ...

Query Mediation for Heterogeneous Data Sources - Semantic Scholar

Contextualizing Heterogeneous Data for Integration and ... - CiteSeerX

Energy-optimized Data Serialization For Heterogeneous ... - CiteSeerX

Data Collection and Restoration for Heterogeneous ... - CiteSeerX