GEOCODING CORONIAL DATA: TOOLS AND TECHNIQUES TO IMPROVE DATA QUALITY
WORD COUNT: 5550 (152 = Abstract; 5398 = Introduction, Methods, Results, Discussion and Conclusion)
KEY WORDS: data quality, data collection, geographic information systems, classification, public health, prevention
ABSTRACT Clinical, administrative and demographic health information is fundamental to understanding the nature of health and evaluating the effectiveness of efforts to reduce morbidity and mortality of the population. The demographic data item ‘location’ is an integral part of any injury surveillance tool or injury prevention strategy. The true value of location data can only be realised once it has been appropriately classified and quality assured. Geocoding as a means of classifying location is increasingly used in various health fields to enable spatial analysis of data. This article reports on research carried out at the National Coroners Information System to identify and measure the factors that may affect the quality of geocoded data. A systematic analysis of the geocoding process identified source documentation, data cleaning, and software settings as key factors. Understanding and application of these processes can improve data quality and therefore inform the analysis and interpretation of this data by researchers.
ACKNOWLEDGEMENTS We would like to acknowledge the contribution of the staff at the National Coroners Information System in supporting this research. Without their support and guidance, the success of this project would not have been possible.
CORRESPONDING AUTHOR Darren Freestone BHIM (Hons) Associate Lecturer Department of Health Information Management School of Public Health & Human Biosciences Faculty of Health Sciences La Trobe University Bundoora VIC 3086 AUSTRALIA Tel: +61 03 9479 6627 Fax: +61 03 9479 1783 Mobile: 0421 280 978 email:
[email protected]
CO-AUTHORS Dianne Williamson BAppSc (MRA), GDipErg Senior Lecturer Department of Health Information Management School of Public Health & Human Biosciences Faculty of Health Sciences La Trobe University Bundoora VIC 3086 AUSTRALIA Dr Dennis Wollersheim BASc (Lethbridge), BSW (La Trobe), PhD (La Trobe) Lecturer Department of Health Information Management School of Public Health & Human Biosciences Faculty of Health Sciences La Trobe University Bundoora VIC 3086 AUSTRALIA Leanne Daking BBus (Info Tech), BHIM Quality Assurance Manager National Coroners Information System (NCIS) Victorian Institute of Forensic Medicine (VIFM) Southbank VIC 3006 Jessica Pearse BIM Manager National Coroners Information System (NCIS) Victorian Institute of Forensic Medicine (VIFM) Southbank VIC 3006
INTRODUCTION The Australian Institute of Health and Welfare (AIHW) describes health information as “… a fundamental component of the evidence base for developing and evaluating health policies and programs.” (Australian Institute of Health and Welfare, 2010: 12). To describe health status, health influences, health interventions, and the healthcare system in Australia, the AIHW uses a conceptual framework consisting of four components: the measures of Health and Wellbeing; the Determinants of health; the Interventions in health; and the available Resources (Australian Institute of Health and Welfare, 2010).
Mortality data is considered a vital measure of health and wellbeing (Abdelhak et al, 2011: 368). Analysis of the patterns of circumstances and causes of death helps to inform policy makers by explaining health status and enabling them to determine and evaluate preventative strategies (Australian Institute of Health and Welfare, 2010). Interventions such as education, immunisation and safety standards result are based on the collection and analysis of health information (Australian Institute of Health and Welfare, 2010); therefore valid and effective decisions about the appropriateness of these policies and programs are reliant on the quality of information.
Health information is created when data is made meaningful by placing it in context, for example, by linking the cause of death, activity and location of event for an individual person, or aggregating data on location of event for many patients to identify meaningful patterns (Roberts et al, 2002). Bowker and Star (2000) suggest that classification is the mechanism by which meaning can be achieved. Classification segments data or information, enabling manipulation to produce knowledge (Bowker & Star, 2000). In addition to identifying and classifying clinical information such as injury and causes of death, it is important to classify contextual data such as contributory factors, demographic details of the person, and the place of occurrence. Analysis of Australian Bureau of Statistics (ABS) mortality data has emphasised the importance of detailed documentation of injury events in hospital medical records and death certificates to support research and the development of preventive strategies (McKenzie et al, 2009; Soo et al, 2009).
Location is a vital component of administrative and demographic health information (Roberts et al, 2002; World Health Organisation, 2003). Geographic location or address has the potential to play a pivotal role in the planning, monitoring and assessment of healthcare policy and services. The ‘place effects of health’ is well recognised in the literature and has been proposed as a planning tool for community health services in identifying access and equity issues and causal links (Baum et al, 2010; Han et al, 2010).
Location and address data linked to disease incidence have proven to be fundamental in many epidemiological or public health studies to better understand the distribution or incidence of disease
Page 1
(Bonita et al, 2006; Krieger, 2003; Lake, 2002; McCloskey, 2007). The link has been demonstrated by the ‘fathers’ of medicine, epidemiology and demography, Hippocrates, Snow and Graunt respectively (Bonita et al, 2006; Rothman, 1996). In his treatise ‘Airs, Waters, and Places’, Hippocrates made the link between health and location (Hippocrates, c. 400 BC). The ‘Bills of Mortality’, recording the number and causes of deaths according to the location of parish, was first published in Britain by Graunt in 1592 in response to the plague (Rothman, 1996). In 1854, British surgeon Snow identified the link between a recent outbreak of cholera to the faecally contaminated water of a London well by plotting on a map all the locations where the patients resided and the locations of all the city wells (Dominguez, 2002; Khan, 2003; McCloskey, 2007; Snow, 1855).
Injury is a major cause of mortality and valuable linkages can be made by analysing location of deaths resulting from injury. The National Injury Prevention and Safety Promotion Plan: 2004-2014 describes injuries as preventable events and identifies access to quality injury data as a key basis of injury prevention strategy (National Public Health Partnership, 2005). In 2010, road deaths, a leading cause of injury related deaths, totalled 1366. Approximately 46% of these deaths were accounted for in the 17-39 age groups (Department of Infrastructure and Transport, 2009; National Public Health Partnership, 2005).
Geocoding Place may represent a location type (e.g. sporting field) or geographical location (e.g. a specific street address), and this location can be classified in several ways. The Australian Standard AS4590-2006, Geographic Information Systems - Data dictionary for transfer of street addressing information, describes the approved Australian address format (Standards Australia, 2006). An address (i.e. Number and Name of Street, Place Name, State or Territory, and Postcode) are means by which location may be classified. The Australian Standard Geographical Classification (ASGC) is another system (Australian Bureau of Statistics, 2010). Produced by the ABS, the ASGC classifies location “…for the collection and dissemination of geographically classified statistics…for understanding and interpreting the geographical context of statistics… [and] to improve the comparability and usefulness of statistics generally” (Australian Bureau of Statistics, 2010: v). The ASGC is a hierarchical structure that assigns Australian localities to statistical structures such as Statistical Regions and Districts, and Local Government Areas. Geocoding is another method of classifying geographic location. Geocoding refers to the process of assigning x and y coordinates based on the actual latitude and longitude of the earth’s surface where the location is sited (Summerhayes et al, 2006). Geocoding technology often resides within the more complex Geographical Information Systems (GIS) that manage location-based and related data.
Page 2
Many applications of GIS and geocoding are identified in the literature. The potential value of geocoding is underscored by various surveys which suggest that as much as “… 80% of all data stored by an organisation has a relationship to a geographic location” (Dominguez, 2002: 1). The technology has been employed in a variety of areas including criminology, politics, computer sciences, police work, archaeology, demography, finance, banking, sales and insurance (Dominguez, 2002; Donnelly et al, 2006; Goldberg et al, 2006; Khan, 2003; Krieger, 2003; McCloskey, 2007).
The biggest development in recent years has been the application of geocoding in the health sector (Goldberg et al, 2006; Khan, 2003; Krieger, 2003; McCloskey, 2007; Vine et al, 1997). This work has predominantly been in the public health and acute disease domains involving environmental health issues, infectious disease incidence and prevalence, sociodemographics, service planning, resource allocation and prevention and awareness programs (Ainsworth & Van Gaans, 2003; Khan, 2003; Kruger et al, 2008; Lake, 2002; McCloskey, 2007; National Coroners Information System, 2007a; Summerhayes et al, 2006; Vine et al, 1997). For example, in the U.S. state of Michigan, geocoding has been used to develop a community prevention program aimed at identifying residents at risk of developing type-2 diabetes (Kruger et al, 2008). In Western Australia, geocoding was used to identify and cross-reference the location and distribution density of dental practices throughout the state in relation to the socioeconomic status of each census district (Kruger et al, 2011). In New South Wales, a Bureau of Crime Statistics and Research report analysed location of liquor outlets and alcohol-related problems using geocoding (Donnelly et al, 2006: 1), while in Victoria, the Police Department uses geocoding to study the residential and intercept locations of drink drivers, to improve direct awareness campaigns within the community. Emergency Ambulance Services throughout Australia have utilised GIS technology to develop despatch systems (McCloskey, 2007). The Department of Health, South Australia, uses GIS technology in their service planning activities (Ainsworth & Van Gaans, 2003) while the Department of Human Services in Victoria uses it for their Cooling Tower Register (Dominguez, 2002). In 2006, the National Coroners Information System began geocoding place of incident and place of death for all cases within its database, to improve the research value of the data (National Coroners Information System, 2007a)
The geocoding process consists of three distinct phases: Parsing, Matching and Locating (Davis et al, n.d.; Goldberg et al, 2006) which are demonstrated in Figure 1. ‘Parsing’ is the technique of transforming input data, such as an address or location, into a standardized sequence of values that is recognised by the system. The first part of Parsing, or ‘data cleaning’, is the manual process of correcting spelling or adding missing data. The process where parsed data is compared against a reference address database is referred to as ‘Matching’. Complex algorithms are applied to identify the closest possible match between input data and the reference file (Davis et al, n.d.). In Australia, most geocoding systems use either the Geocoded
Page 3
National Address File (G-NAF) or a proprietary Street-Centreline file as the reference database. The G-NAF contains approximately nine million addresses each with a unique geographic coordinate and is considered the authoritative index of Australian addresses (Blanchfield, 2003; Christen et al, n.d.; Geometry Propriety Limited, n.d.; Mapdata Sciences, 2008). Street-Centreline files uses interpolation algorithms to find approximate address matches (Goldberg et al, 2006). The third two-part phase is ‘Locating’, which involves the assignment of a co-ordinate, usually longitude/latitude, to the matched reference file address, and the tagging or assignment of a geographic boundary to the address. In Australia, the ASGC is used with the Statistical Local Area (SLA) and Local Government Area (LGA) the most commonly assigned geographic boundaries (Australian Bureau of Statistics, 2010; Blanchfield, 2003; Christen et al, n.d.; Davis et al, n.d.; Goldberg et al, 2006; Mapdata Sciences, 2008; Yu, 1996).
[Insert Figure 1]
The biggest issue with the application of geocoding is accuracy. Two types of accuracy are discussed in the literature, positional accuracy and attributable accuracy. Positional accuracy, which refers to how close the assigned coordinate is to the actual location (Yu, 1996), is particularly important during the ‘Locating’ phase. If the assigned coordinate and the ‘real’ location are too far apart, an incorrect geographic boundary may be tagged, causing the case to be incorrectly classified. Attributable accuracy refers to the actual address, in text form, used to generate the coordinate (Yu, 1996). If data is incorrect at the point of collection or data entry, a mismatch in the geocoding process can result. Figure 2 demonstrates positional and attributable accuracy. Figure 3 outlines some other common issues associated with poor geocoding accuracy.
[Insert Figure 2]
[Insert Figure 3]
The National Coroners Information System The National Coroners Information System (NCIS), one of the key injury surveillance tools in Australia (Driscoll et al, 2003), is a database managed by the Victorian Institute of Forensic Medicine (VIFM) which collects information about all deaths reported to a Coroner in Australia. The NCIS contains a variety of information relating to each death, including demographic details, cause of death, intent, and place of incident and death. A review of sources of injury mortality data in Australia identified the NCIS as the ’…richest source of information about deaths and the circumstances that surrounded them…’ emphasising the importance of the NCIS as the basis of injury research (Kreisfeld & Harrison, 2007: 50). Kreisfeld and
Page 4
Harrison acknowledge the importance of the NCIS database because of the detailed data which is received from multiple sources and reviewed by a coroner (2007). For each recorded fatality on the NCIS, three addresses are mandatorily recorded: residential; incident; and death.
The true worth of location data in the NCIS can only be fully realised if it is of a high quality, with integrity in the value of the actual data and the mechanism(s) by which it was captured. In 2006 work began on geocoding the addresses on the NCIS database (National Coroners Information System, 2007a). A research project conducted in 2008 investigated how geocoding improves the quality of location or address data for deaths resulting from transport accidents. Factors that may affect the quality of geocoding were also investigated. The project focused on the geocoding of Motor Vehicle Accidents (MVA) due to the large number of such cases in the database, and because the NCIS identified this as a priority area.
The research project was approached from the concepts of Quality Management Theory and more specifically using the framework of the American Health Information Management Associations’ (AHIMA) Data Quality Management Model (DQMM) (Cassidy et al, 1998). The DQMM as shown in Figure 4 presents a model of how the functions and characteristics of data quality are related. These functions include data: application; collection; warehousing; and analysis. The DQMM also identifies ten data quality characteristics
including:
accessibility;
consistency;
currency;
granularity;
precision;
accuracy;
comprehensiveness; definition; relevancy; and timeliness (Cassidy et al, 1998).
[Insert Figure 4]
Page 5
METHODS An investigation of the nature of geocoded NCIS mortality data and the factors impacting on its quality was conducted in 2008. The primary focus of this research was on the DQMM functions of collection and analysis, and the data characteristics: accuracy, ensuring data validity and the values of the data; consistency, the characteristic of data values being reliable and remaining the same across applications; and granularity, the attributes and values of data being defined at the correct level of detail (Braden et al, 2007; Cassidy et al, 1998). These were selected due to their relevance to address or location data. The collection and analysis functions are of most interest here as they can have a significant effect on how the data is captured, classified and translated into information. Table 1 demonstrates the relationship of the DQMM elements to location data.
[Insert Table 1]
The aims of this research were to: 1. examine trends in the use of geographic location data extracted from the NCIS for research purposes; 2. establish the factors that impact on the quality of geocoded location data in terms of accuracy, consistency and granularity, and the extent of their impact; and 3. determine the extent to which these factors can be controlled or manipulated.
Due to the highly sensitive nature of this NCIS data (Coroners Act, 1985), and to meet legislative requirements, measures to ensure the privacy of the deceased were implemented (Information Privacy Act, 2000). Ethics approval for the project was granted by the Department of Justice Human Ethics Committee and the La Trobe University Faculty of Health Sciences Human Ethics Committee. The research was conducted under the guidance and full approval of the NCIS Unit. Privacy protection measures included the de-identification of all data at the NCIS Unit.
Several activities were undertaken to achieve the aims of the project.
Identifying the need for location data To establish the perceived value of geocoded location data, an analysis of requests from researchers for location-specific data from the NCIS database was completed, focusing on requests for mortality information by geographic location. The NCIS search request database was reviewed for the six-year period from 1 January 2002 to 31 December 2007. The type of location data requested, such as state or territory,
Page 6
locality, suburb, post code, Statistical Local Area (SLA), Local Government Area (LGA), rural, urban or coastal area, was then analysed.
Identifying factors that impact on geocoding quality To identify the factors or variables that have the greatest impact on the quality of geocoded location data, the current processes of the NCIS were reviewed to establish how address or location data is identified and then coded or geocoded. Relevant NCIS procedure documentation and source documentation such as coronial findings and police reports were reviewed. NCIS staff were consulted to establish jurisdictional coder activities which identify address data from source documentation and entry into the NCIS database. This data extraction and entry process was observed to identify activities including data cleaning, which may influence the way data are captured and then geocoded. Once all activities were identified, processes and data sources were assessed to identify any potential cause of error. These were then grouped thematically in the categories of source documentation, data extraction and transposition, data cleaning and geocoding.
Identifying the impact of the key factors on geocoding quality To determine the extent to which each of the factors identified in the previous stage of the research affect geocoding quality, a study of incident location data within the NCIS was carried out. Cases were sampled from the 2004 and 2006 calendar years. The NCIS unit had already completed geocoding all 2005 and 2006 cases, so the years 2004 and 2006 were selected to examine whether the commencement of geocoding had an influence on the manner in which the procedures for dealing with the data were applied, and therefore whether the quality of the geocoding was affected. Due to the size of the data and timeconstraints, it was determined that this part of the research would focus on just one type of death, that is, transport injury events or Motor Vehicle Accidents (MVAs). This focus was selected because of the consistent nature of this particular data set, and because MVAs are a major cause of injury deaths in Australia (Driscoll et al, 2003).
To ensure all relevant cases were identified, the following query protocol was followed. First, all cases from all Australian states and territories reported in 2004 and 2006, where the primary or secondary mechanism of death was ‘Blunt Force’ and / or ‘Transport Injury Event’ respectively, were extracted. Then, using a systematic sampling method, every tenth case from each of the two years was selected until two sample sets were fully populated with 100 cases each. Both the ‘Incident Address’ and ‘Death Address’ for each case were extracted. In most cases, the location or address given was the same, so a decision was made to analyse the data using only the incident address field.
Page 7
Incident addresses of the sample cases were geocoded following the established NCIS processes and using the geocoding software package licensed by the NCIS Unit, QuickLocate 3 Desktop Geocoder (Mapdata Sciences, 2008). The geocoding software produces a geocode or an x / y geographic coordinate after matching an address on the input dataset against the address on the software reference file. Each geocode has a match result attached to it, and if requested, a geographic boundary such as an SLA.
The software produces a numerical six-digit code to describe how well the geocoder was able to match the input address data against that in the reference file. Each of the digits represents a component of an address and the degree to which the corresponding component of the input address could be matched. The assigned SLA represents the geographic boundary in which the reference file address is located.
The quality or extent of the match can be affected by the data quality characteristics of the input data, and the degree to which the algorithms of the software have been set. These settings or delimiters are often referred to as geocoding options in most software packages. Geocoding options include checking for misspellings, searching in surrounding localities, ignoring street type, building, unit number and locality suffixes or prefixes (e.g. Ballarat versus Ballarat North) and accepting nearby building numbers up to a defined limit (e.g. up to 20 numbers away), or offsetting from the street centreline up to a defined limit (e.g. up to 10 metres). The geocoding options allow for variations in the matching process should a perfect match be unattainable.
Both samples of 100 cases were geocoded four times, each time with a slight variation in the settings of the geocoding options. At each step the geocoding, match results and tagged SLAs for each case were recorded.
1. Cases were geocoded without being cleaned and without any ‘delimiters’ activated. 2. Cases were geocoded again; still unclean, but with all the delimiters activated to match normal NCIS processes.
The cases were then cleaned. Cleaning is the manual process of correcting spelling or adding missing data to each case. This is a labour intensive and time-consuming activity, and several tools such as street directories, postcode books and online satellite mapping systems were used to check the accuracy of the data. Data cleaning also included: •
Ensuring that data which is displayed on an excel spreadsheet is located in the correct cells;
•
Removing all extraneous text which the geocoder cannot process, e.g. “… 100m from the intersection of…”, “… opposite the town hall…”;
Page 8
•
Reviewing and correcting all street types such as ‘road’, ‘street’, ‘avenue’, or ‘court’.
3. The cleaned data was geocoded for the third time, once more with all delimiters deactivated. 4. The cleaned data was geocoded a final time with all delimiters activated.
All Match Results and SLAs were then analysed to identify any differences in the data from the two sample years to see how the cleaning and setting of delimiters affected the quality of geocoding. Match Results from all eight geocoding episodes (2004 and 2006) were analysed to determine the distribution and percentage of each Match Result category for: •
All eight geocoding episodes combined
•
Each of the sample years (i.e. before and after NCIS geocoding was introduced)
•
Clean versus unclean data
•
Delimiters activated versus deactivated
SLAs were analysed to determine when they were more likely to change, that is, when the data was: •
Unclean and delimiters were either activated or not;
•
Clean and delimiters were either activated or not; and
•
Clean versus unclean regardless of the activation of delimiters.
Page 9
RESULTS The results of the research are presented under each of the three subheadings described in the methods section.
Identifying the need for location data An analysis of requests for reports of data from the NCIS database for the study period 2002 – 2007 identified an increase from 18 requests in 2002 to 173 requests in 2007. The proportion of requests specifying a requirement for location data ranged from 16.67% in 2002 (n=3) to 38.73% in 2007 (n=67). Since 2002, there has been an increase in requests for all location types. As the basis of analysis, the researcher categorised the location types of the identified requests. These categories were ‘State or Territory’, ‘Locality’, which included SLAs, suburbs, shires or postcodes and ‘Other’, which included Urban, Rural or Coastal areas. For the 2007 location requests, 74.63% required State or Territory (n=50), 20.90% required Locality (n=14) and 4.48% required Other (n=3) as the location type. See Figure 5. [Insert Figure 5]
Identifying factors that impact on geocoding quality The review of the data collection and recording processes of the NCIS identified many factors that could positively or negatively affect geocoding quality. The most critical of these factors were:
Source Documentation - This includes police narratives or summary reports, autopsy and toxicology reports, and coronial findings. Coroners and a variety of external sources including police and pathologists prepare the documents. They are attached to the record of the corresponding case on the NCIS database as they become available. This may occur during the initial creation of the record, or later if the documentation is not yet complete or ready for submission.
Data Extraction and Transposition - In addition to attaching whole source documents, individual state and territory coronial offices extract data from the source documents and add case details to their local case management systems (Driscoll et al, 2003). Data is uploaded on a nightly basis from these local systems to the NCIS database. Quality assurance activities including audits are carried out on a regular basis to ensure the quality and completeness of the data (National Coroners Information System, 2007a, b). Other measures in place to manage data quality include Coder Training workshops, an NCIS helpdesk, a ‘Local Case Management System’ data entry manual and an NCIS Data Dictionary.
Page 10
Data Cleaning and Geocoding - The process of data cleaning and geocoding performed by the NCIS is quite extensive. The process is repeated on multiple occasions to maximise the accuracy of the geocodes. Examples of these processes are shown in Table 2. [Insert Table 2]
Identifying the impact of the key factors on geocoding quality As described in the methods section, cases sampled from the NCIS database were geocoded under a variety of conditions to establish how geocoding quality is impacted by software option settings. After geocoding the sample cases, the result codes and tagged or assigned SLAs were analysed to determine under which of the test conditions these were more likely to change. The purpose of this was to identify any differences in the data from the two sample years and to determine how these factors had affected the quality of the geocoding. A summary of the results is presented below.
Result Codes - 27 of a possible 2880 result codes were produced when the sample cases were geocoded according to the methods previously outlined. A description of these codes is provided in Table 3. [Insert Table 3] Figure 6 illustrates the Result Codes produced when the sample cases were geocoded under the four conditions: clean, unclean, with delimiters and without delimiters. This process demonstrated the impact of data cleaning and geocoding system delimiters on the quality of the data output. Figure 7 shows the Result Codes produced when the sample cases were geocoded according to whether the input data being cleaned or uncleaned, regardless of whether delimiters were used. [Insert Figure 6] [Insert Figure 7] When the delimiters were not activated, the number of ‘Perfect Match’ and ‘Near Perfect Match’ codes increased between the input data being unclean and then cleaned, from n=7 to n=85 and n=14 to n=113 respectively. In contrast, the number of ‘Poor Match’ and ‘No Match’ codes decreased from n=73 to n=32 and from n=58 to n=15 respectively. Similar results were produced when the delimiters were activated.
Page 11
When the results were analysed according to clean versus unclean input data alone regardless of the delimiters, similar trends were found. When the input data was not cleaned, the ratio of ‘Perfect Match’ and ‘Near Perfect Match’ codes against that of ‘Poor Match’ and ‘No Match’ codes was 1.6 : 1 (n=243 : n=157). When the input data was cleaned, the ratio of ‘Perfect Match’ and ‘Near Perfect Match’ codes against that of ‘Poor Match’ and ‘No Match’ codes was substantially greater at 2.5 : 1 (n=286 : n=114).
SLA Assignment - The SLAs assigned were analysed to determine which conditions were present when changes in SLAs occurred. The results are shown in Figure 8. The SLA assigned did not change for 85.5% of the sample cases (n=342). When the input data had not been cleaned, activating the delimiters caused the assigned SLA to change in 25% of the cases (n=50). When the input data had been cleaned, activating the delimiters caused the assigned SLA to change in 4% of the cases (n=8). [Insert Figure 8]
Page 12
DISCUSSION AND CONCLUSION The examination of the NCIS search request database identified a trend of increasing usage of the NCIS since 2002, as the database has become more substantial and better known. This trend is expected to continue as more work goes into improving and maintaining the quality of the data within the database.
Of the requests for location based searches, requests by State were the most common. This is not surprising as analysis and comparison of mortality data by state is a common parameter for analysis and comparison. Until the introduction of geocoding of NCIS data in 2007, analysis of location data within the database, other than by state or postcode, has been extremely difficult. Geocoding of location data will enable researchers to study NCIS data by location more easily, and at more discrete levels of granularity. This will further improve the value of the database. A greater awareness of NCIS data quality is expected to result in an increase in requests for access to the data, further enhancing the value of the data as an injury surveillance tool (Driscoll et al, 2003; Kreisfeld & Harrison, 2007).
The second aim of the research was to explore the data collection and recording processes of the NCIS to identify the factors that can impact on geocoding quality. These processes include the extraction of geographic location from the source documentation, transposition of this data into the database, cleaning of the data and coding the data by geocoding.
Both the literature and the results of the research attest to the fact that source documentation could affect geocoding quality for the following reasons (Goldberg et al, 2006):
It may not be received by the NCIS
It may contain inaccurate information
The information in the source documentation may be incomplete or ambiguous
The information may be misinterpreted due to the use of abbreviations.
Clerks in each coronial office enter information about a death into the relevant ‘local case management system’ when they are first reported. Specially trained NCIS personnel then utilise a variety of mechanisms and quality tools including, coder training, manuals, data dictionaries, audits, validation checks and system edits to ensure the quality of this process (Driscoll et al, 2003; National Coroners Information System, 2007a, b). The potential for human error in extracting and transposing the data is to be expected and therefore could certainly affect geocoding quality. The quality assurance measures in place in each of these systems, however, are expected to identify most of these errors (National Coroners Information System, 2007a, b).
Page 13
The NCIS cleans all address data before it is entered into the geocoder for processing. The results of the research indicate that data cleaning, if not performed, will have a very significant impact on the quality of geocoding, especially if the data is quite ‘dirty’; that is, if it has not been extracted and transposed accurately or if the original source documentation was poor. It should be noted that the NCIS does perform regular audits on the quality of its data, although the focus is more on clinical rather than demographic data items.
The review of the processes found that each of the key elements of the geocoding process are factors that may have a varying positive or negative impact on the quality of geocoded location data. In terms of the data quality characteristics outlined earlier, it can be said that source documentation, data extraction and transposition, data cleaning and geocoding all influence the accuracy and consistency of the data. While the same can be said of granularity, this data quality characteristic is affected mostly by the quality of the source documentation and the data cleaning process.
The final aim of the research was to investigate the extent to which data cleaning and the use of geocoding system settings or delimiters plays a role in data quality. A large part of the research entailed extracting a sample set of cases from the NCIS and then geocoding it under varying conditions to see how cleaning and geocoder system settings impact on the quality of geocoding. The results revealed that data cleaning has a significant impact on geocoding quality. An examination of the match results for cases when the data was clean versus when it was unclean, irrespective of whether system delimiters had been set or not, revealed that data cleaning had a significant positive impact on geocoding quality. Clean data had a much greater proportion of cases with the ‘Near Perfect Match’ codes compared to unclean data. Clean data also had more ‘Perfect Match’ codes, while the unclean data had more ‘No Match’ codes. Data cleaning improved the ability for a location or address to be matched and with a greater degree of granularity.
Match results for cases where the data did or did not have delimiters set, irrespective of the cleanliness of the data, revealed that these system settings also have a significant impact on geocoding quality. Data with delimiters set had a similar proportion of ‘Near Perfect Match’ codes compared to data without the delimiters. Data without delimiters set, however, had a significantly greater proportion of ‘Poor Match’ and ‘No Match’ codes. These results demonstrated that without the setting of system delimiters, location or address are difficult to match unless the input data is near perfect.
These findings were confirmed when the results of the SLAs assigned were analysed. There was a change of SLA assigned in 25% of the cases when unclean data was geocoded; firstly, with delimiters deactivated,
Page 14
and then with them activated. When the data was then cleaned, only 4% of the cases had a change in SLA assigned between the deactivation and activation of delimiters.
The results clearly demonstrated that of all the factors that affect the quality of geocoding, data cleaning has the greatest impact. Data cleaning, however, is an extremely labour intensive and time consuming activity to perform. Many hours were spent during this research cleaning the data from the original data sets. It is quite daunting to consider undertaking data cleaning of the tens of thousands of cases within the NCIS yet to be geocoded. Yet, as the results have demonstrated, to achieve a high level of geocoding quality it is imperative to perform this task. Data cleaning can be made a little easier if address data is well documented in the source documents and is accurately extracted and transposed into the database by coronial clerks. Well developed and clearly documented procedures for data cleaning can also reduce the time spent on this activity. Ultimately, the time spent data cleaning needs to be balanced against the level of quality needed, assessing the level of quality that is reasonably attainable and the benefits of clean data to researchers. For instance, knowing that NCIS data is routinely cleaned as part of the geocoding process will give future researchers confidence in the data and therefore in the findings of their research.
The research also found that system settings or delimiters, to a lesser extent, play a role in the quality of geocoded location data. Source documentation and the process of extracting data from these documents can also have a big impact on quality, although controlling this is difficult. Well developed Continuous Quality Improvement protocols and practices can be used effectively in minimising the impact on data quality (Simpson et al, 2005).
The reference file type used by the software is also a factor which can impact on geocoding quality (Davis et al, n.d.). The NCIS uses a street-centreline or address range reference file. If another reference file type, such as G-NAF were used; different coordinates may be assigned to each address. These differences have the
potential for different geographic boundaries to be assigned, especially if the differences in
coordinates are far enough apart. This was highlighted in the literature, but not considered in this project. Further research into the impact of different reference files on geocoding is a potential area for future research. This is especially important when geocoding of the same original address may occur by different researchers using different software.
As noted earlier, the NCIS Unit geocode all external cause deaths (including MVA’s) using software with a street-centreline reference file. The Fatal Road Crash Database (FRCD), a database maintained by the Australian Transport and Safety Bureau (ATSB) geocode cases, relating to the exact same incidents, with software that uses the G-NAF reference file (Australian Transport Safety Bureau, 2008). Until there is
Page 15
further research we will not know how the output of these two systems compares, and the level of concordance. If differences do exist this may impact on the accuracy of conclusions reached by comparison of reports from the relative databases
Researchers and other users of these databases should be made aware of these differences in software and system settings so that they can consider them when analysing and interpreting the results of their research. Further investigation of these questions is required to better understand the quality of geocoded location data, especially if it is to be part of any injury surveillance mechanism or to inform any policy development.
Page 16
FIGURES AND TABLES Figure 1: The 3 Phases of Geocoding.
Input Address
PARSING Data Cleaning
Geocoder Parsing
MATCHING
Search Reference Database (G-NAF) or (Street Centre-line)
Find Match
Relax Search Parameters
Match Found?
No
Yes
LOCATING Assign Coordinate
Assign Geographic Boundary
Output Result
The 3 Phases of Geocoding. (Adapted from Goldberg et al, 2006)
Figure 2: Positional and Attributable Accuracy.
619
621
659
661
660
662
MAIN STREET
618
620
Actual Address:
Possible Input Errors (Attributable Accuracy):
Location of Assigned Coordinate Actual / Real Location of Address
641 Main Street Smithville Victoria 3333
♦ ♦ ♦ ♦
Positional Accuracy [Adapted from: Goldberg] Attributable Accuracy [Adapted from: Yu]
640 Main St, Smithton, Victoria, 3332 641 Main Road, Smithville, Tasmania, 3333 641 High Street, Smithville, Victoria, 3333 641 Mayne Road, Smithville, Victoria, 3000
Figure 3: Typical issues associated with poor geocoding accuracy.
1
(Bonner et al, 2003; Dramowicz, 2004; Goldberg et al, 2006; Mazumdar et al, 2008; Skelly et al, 2002).
2
(Goldberg et al, 2006; Hurley et al, 2003).
3
(Davis et al, n.d.; Krieger et al, 2001; Ratcliffe, 2001).
4
(Grubesic & Matisziw, 2006).
5
(Davis et al, n.d.; Goldberg et al, 2006; Ratcliffe, 2001).
6
(Bonner et al, 2003; Christen et al, n.d.; Goldberg et al, 2006; Ratcliffe, 2001; Summerhayes et al, 2006).
7
(Davis et al, n.d.; Hurley et al, 2003; Krieger, 2003; Mazumdar et al, 2008; Ratcliffe, 2001; Summerhayes et al, 2006; Whitsel et al, 2006; Yu, 1996).
Figure 4: The Data Quality Management Model.
Data Quality Management Model (Adapted from Cassidy et al, 1998).
Table 1: Application of selected DQMM data characteristics to elements of the geocoding process Adapted from Cassidy et al 1998
DATA CHARACTERISTIC Accuracy
DEFINITION (with location example)
APPLICATION TO GEOCODING PROCESS
Data are the correct values and are valid (Cassidy et al, 1998: 2 & 3)
Source Documents: Need to contain correct and valid address or location data; that is the address or location should actually exist and be recorded correctly. Handwriting and abbreviations should either be avoided whenever possible or standardised to avoid ambiguity
For example, a location or an address should actually exist and should correctly identify an exact geographic location to which it is associated. The geocoding process should accurately reflect this
Data Extraction / Transposition: The data in the source documents should be extracted accurately so that a true representation of the data is transposed into the database. Quality assurance measures are used to monitor and maintain data accuracy Data Cleaning: This process impacts on attributable accuracy. Data cleaning provides the opportunity to optimise the accuracy of the address data in order to get a ‘perfect’ or ‘near perfect’ match in the geocoding process Geocoding: The resulting coordinate and boundary tagged by the geocoder needs to be an accurate representation of the original data from the source documentation. The software must utilise accurate and up-to-date reference files and all system settings should be used appropriately for optimal performance
Consistency
The value of the data should be reliable and the same across applications (Cassidy et al, 1998: 4 & 5) For example, an address or a location should be recorded in a standard format on every occasion and from any data source
Source Documents: To maintain consistency in the database; source documents need to record address data correctly, in a similar format on every occasion. This may require standardised source documents and standards by which data is recorded or presented. This may include the development and use of Data Dictionaries, standard abbreviations and standard or agreed data sources. If consistency is not maintained at this stage of the process then it is difficult to maintain elsewhere Data Extraction / Transposition: The procedures and systems used to extract and transpose the data need to be well conceived, clearly documented and rigourously applied to maintain consistency in the data Data Cleaning: Documented procedures and systems for data cleaning will ensure adherence to agreed parameters and standards Geocoding: The procedures for the geocoding phase should also be well developed and applied.
Granularity
The attributes and values of data should be defined at the correct level of detail (Cassidy et al, 1998: 6 & 7) For example, an address or location should be in as much detail, or be as specific as possible. An address should consist of the appropriate elements - Number, Street, Suburb and Postcode. The resulting coordinate should link to all levels of recognised geographic boundary tags
Source Documents: Should provide for recording as much detail as possible regarding address or location data. This will promote accurate geocoding and will enable future analysis at finer levels of granularity. Therefore addresses should retain as much specificity as possible within the standard address format Data Extraction / Transposition: The detailed address or location data must be reliably extracted and transposed to maintain the value of the detail Data Cleaning: The cleaning process should not decrease the level of granularity within the address Geocoding: A greater level of granularity in the geographic boundary tags can be achieved by a high level of address detail to ensure the value of the data
Table 2: Examples of NCIS Data Cleaning Procedures Adapted from National Coroners Information System 2007a
•
Street Name – remove any information that is not relevant. For example ‘dam’, ‘ocean’, ‘30 km due east’ or ‘building / place name’.
•
Postcode – ensure that Northern Territory postcodes have a ‘0’ in front, as the geocoder cannot match 3 digit postcodes. Ensure all other postcodes contain 4 numerals and no other data.
•
State – ensure that spelling is correct and that each state corresponds correctly to the first number of the postcode. For instance Victorian cases should have a postcode beginning with a ‘3’.
•
If an address is not provided and the name of a place or building is listed instead, then the correct address needs to be determined if possible. This may require a search using the internet, local telephone directories and street directories. These places may include hospitals, nursing homes, yacht clubs or golf clubs.
•
Remove all punctuation and spaces.
•
Ensure that the address conforms to the approved Australian address format with all elements being applied in a consistent manner, i.e. Number and Name of Street, Place Name, State or Territory, and Postcode (Standards Australia, 1994, 2006).
Table 3: ‘Match Results’ achieved under each test condition TEST CONDITION
LEVEL OF MATCH
All Conditions
Clean Data
Unclean Data
Delimiters Set
Delimiters Not Set
%
No.
Perfect Match
5.63%
n=45
Near Perfect Match
60.50%
n=484
Poor Match
25.63%
n=205
No Match
8.25%
n=66
Perfect Match
7.75%
n=31
Near Perfect Match
63.75%
n=255
Poor Match
21.75%
n=87
No Match
6.75%
n=27
Perfect Match
3.50%
n=14
Near Perfect Match
57.25%
n=229
Poor Match
29.50%
n=118
No Match
9.75%
n=39
Perfect Match
6.00%
n=24
Near Perfect Match
70.75%
n=283
Poor Match
18.50%
n=74
No Match
4.75%
n=19
Perfect Match
5.25%
n=21
Near Perfect Match
50.25%
n=201
Poor Match
32.75%
n=131
No Match
11.75%
n=47
‘Match Results’ refer to the ‘Result Codes’ produced by the MapData Sciences QuickLocate Geocoder . These codes describe how well the software was able to match the input data to the address reference file. ‘Perfect Match’
Refers to all codes that describe a “perfect building-level match”
‘Near Perfect Match’
Refers to all codes that describe either a “building-level match with a minor issue such as spelling, street type or building number” or a “street-level match with an issue such as spelling, street type, suburb or postcode”
‘Poor Match’
Refers to all codes that describe a “suburb or postcode-level match with major issues relating to the street name, suburb or postcode”
‘No Match’
Refers to all codes that describe a “failure to match due to fatal issues with the data such as street name, suburb or postcode either not being supplied or not being found”
Figure 5 : Requests for ‘location’ based searches of the NCIS database
Other
Refers efers to search requests that included a request by location at other levels of granularity including Urban, Rural, Coastal etc e
Locality
Refers to search requests that included a request by location at the granularity granularity of SLA, Suburb, Shire or Postcode
State
Refers to search requests that included a request by location at the granularity of State and/or Territory
Figure 6:: The effect on geocoding output from Cleaned versus Uncleaned Input data and the use of software ‘Delimiters’.
Figure 7:: The effect on geocoding output from Cleaned versus Uncleaned Input data regardless of whether ‘Delimiters’ were used.
Figure 8 : The effect on SLA assigned from Cleaned versus Uncleaned Input data regardless regardless of whether ‘Delimiters’ were used.
REFERENCES Abdelhak, M., Grostick, S., Hanken, M. A. and Jacobs, E. (2011). Health information: Management of a strategic resource (4th ed). Philadelphia, Saunders. Ainsworth, S. and Van Gaans, G. (2003). Spatially Enabled Business Intelligence for the Health Industry. Melbourne, DHS (SA) & ESRI Australia. Coroners Act, 1985 (Vic). Information Privacy Act, 2000 (Vic). Australian Bureau of Statistics (2010). Australian Standard Geographical Classification (ASGC). Canberra, Australian Bureau of Statisitics. Australian Institute of Health and Welfare (2010). Australia's Health 2010. Canberra, Australian Institute of Health and Welfare. Australian Transport Safety Bureau (2008). Fatal Road Crash Database. Australian Transport Safety Bureau. Accessed 17th June 2008, Baum, S., Kendall, E., Muenchberger, H., Gudes, O. and Yigitcanlar, T. (2010). Geographical information systems: An effective planning and decision-making platform for community health coalitions in Australia. Health Information Management Journal 39 (3): 28-33. Blanchfield, F. (2003). Geocoding - Mesh Blocks, the Base Unit. ABS - Statistical Update (Queensland Office) 15: 2-3. Bonita, R., Beaglehole, R. and Kjellstrom, T. (2006). Basic Epidemiology (2nd ed). Geneva, World Health Organisation. Bonner, M. R., Han, D., Nie, J., Rogerson, P., Vena, J. E. and Freudenheim, J. L. (2003). Positional Accuracy of Geocoded Addresses in Epidemiologic Research. Epidemiology 14 (4): 408-412. Bowker, G. C. and Star, S. L. (2000). Sorting things out: Classification and its consequences. Massachusetts, MIT Press.
Braden, J. H., Demster, B., Grant, K. G., Just, B. H., White, M. and Wisdom, T. (2007). HIM Principles in Health Information Exchange (AHIMA practice brief): Appendix - Data Quality Attributes Grid. Chicago, American Health Information Management Association. Cassidy, B., Fenton, S., Fletcher, D. M., Koch, D., Stewart, M., Watzlaf, V. and Willner, S. (1998). AHIMA Practice brief: Data Quality Management Model. Chicago, American Health Information Management Association. Christen, P., Churches, T. and Willmore, A. (n.d.). A Probabilistic Geocoding System based on a National Address File. North Sydney, Centre for Epidemiology and Research, New South Wales Department of Health. Davis, C. A., Fonseca, F. T. and Borges, K. A. D. V. (n.d.). A Flexible Addressing System for Approximate Geocoding. Belo Horizonte, Brazil, Prodabel. Department of Infrastructure and Transport (2009). Road Deaths in Australia: 2008 Statistical summary. Canberra. Dominguez, G. (2002). Effective Enterprise Information Systems for the Health Industry: Geography is the key to unlocking the data. Melbourne, ESRI Australia. Donnelly, N., Poynton, S., Weatherburn, D., Bamford, E. and Nottage, J. (2006). Liquor outlet concentrations and alcohol-related neighbourhood problems. Sydney, NSW Bureau of Crime Statistics and Research. Dramowicz, E. (2004). Three Standard Geocoding Methods. Directions Magazine. Accessed 22nd May 2008, Driscoll, T., Henley, G. and Harrison, J. (2003). The National Coroners Information System as an information tool for injury surveillance. Canberra, Australian Institute of Health and Welfare. Geometry Propriety Limited (n.d.). Geocoded National Address File. Hobart, Geometry Pty Ltd. Goldberg, D. W., Wilson, J. P. and Knoblock, C. A. (2006). From Text to Geographic Coordinates: The Current State of Geocoding. URISA. Accessed 6th May 2008, Grubesic, T. H. and Matisziw, T. C. (2006). On the use of ZIP codes and ZIP code tabulation areas (ZCTAs) for the spatial analysis of epidemiological data. International Journal of Health Geographics 5 (58): 115.
Han, H. J., Sunderland, N., Kendall, E., Gudes, O. and Henniker, G. (2010). Chronic disease, geographic location and socioeconomic disadvantage as obstacles to equitable access to e-health. Health Information Management Journal 29 (2): 30-36. Hippocrates (c. 400 BC). Corpus Hippocraticum. Hurley, S. E., Saunders, T. M., Nivas, R., Hertz, A. and Reynolds, P. (2003). Post Office Box Addresses: A Challenge for Geographic Information System-Based Studies. Epidemiology 14 (4): 386-391. Khan, O. A. (2003). Geographic Information Systems and Health Applications. Hershey, Pennsylvania, Idea Group Publishing. Kreisfeld, R. and Harrison, J. E. (2007). Use of multiple causes of death data for identifying and reporting injury mortality. Canberra, Australian Institute of Health and Welfare. Krieger, N. (2003). Place, Space and Health: GIS and Epidemiology. Epidemiology 14 (4): 384-385. Krieger, N., Waterman, P., Lemieux, K., Zierler, S. and Hogan, J. W. (2001). On the Wrong Side of the Tracts? Evaluating the Accuracy of Geocoding in Public Health Research. American Journal of Public Health 91 (7): 1114-1116. Kruger, D. J., Brady, J. S. and Shirley, L. A. (2008). Using GIS to facilitate community-based Public Health Planning of Diabetes Intervention Efforts. Health Promotion Practice 9 (1): 76-81. Kruger, E., Tennant, M. and George, R. (2011). Application of geographic information systems to the analysis of private dental practices distribution in Western Australia. The International Electronic Journal of Rural and Remote Health Research, Education, Practice and Policy: Rural and Remote Health 11 (1736): 1-9. Lake, S. (2002). Applications for Spatial Technology in the Health Industry: Using Geography to link Patients, Illness and Healthcare Facilities. Melbourne, ESRI Australia. Mapdata Sciences (2008). QuickLocate 3 Desktop Geocoder User Manual. Greenwich NSW, MapData Sciences. Mazumdar, S., Rushton, G., Smith, B. J., Zimmerman, D. L. and Donham, K. J. (2008). Geocoding accuracy and the recovery of relationships between environmental exposures and health. International Journal of Health Geographics 7 (13): 1-18.
McCloskey, D. (2007). Geographic information systems and health: Key applications of place for health analytics. Australasian Epidemiologist 14 (1): 16-19. McKenzie, K., Cheng, L. and Walker, S. M. (2009). Correlates of undefined cause of injury coded mortality data in Australia. Health Information Management Journal 38 (1): 8-14. National Coroners Information System (2007a). National Coroners Information System: Business Plan 2006/2007. Melbourne, Victorian Institute of Forensic Medicine. National Coroners Information System (2007b). National Coroners Information System: Geocoding Guidelines - March 2007. Melbourne, Victorian Institute of Forensic Medicine. National Public Health Partnership (2005). The national injury prevention and safety promotion plan: 200414. Canberra, National Public Health Partnership. Ratcliffe, J. H. (2001). On the accuracy of TIGER-type geocoded address data in relation to cadastral and census areal units. International Journal of Geographical Information Science 15 (5): 473-485. Roberts, R., Robinson, K. and Williamson, D. (2002), in Gardner, H. and Barraclough, S. (2nd ed). Health Information Policy. South Melbourne, Oxford University Press. Rothman, K. J. (1996). Lessons From John Graunt. The Lancet 347 (8993): 37-39. Simpson, D. S., Roberts, T., Walker, C., Cooper, K. D. and O'Brien, F. (2005). Using statistical process control (SPC) chart techniques to support data quality and information proficiency: The underpinning structure of high-quality health care. Quality in Primary Care 13 (1): 37-43. Skelly, C., Black, W., Hearnden, M., Eyles, R. and Weinstein, P. (2002). Disease Surveillance in Rural Communities is Compromised by Address Geocoding Uncertainty: A Case Study of Campylobacteriosis. Australian Journal of Rural Health 10: 87-93. Snow, J. (1855). On the mode of communication of Cholera (2nd ed). London, John Churchill - New Barlington Street. Soo, I. H.-Y., Lam, M. K., Rust, J. and Madden, R. (2009). Do we have enough information? How ICD-10-AM Activity codes measure up. Health Information Management Journal 38 (1): 22-34. Standards Australia (1994). AS4212-1994 (Superceded): Geographic Information Systems - Data dictionary for transfer of street addressing information. Sydney, Standards Australia.
Standards Australia (2006). AS4590-2006: Interchange of Client Information. Sydney, Standards Australia. Summerhayes, R., Holder, P., Beard, J., Morgan, G., Christen, P., Willmore, A. and Churches, T. (2006). Automated Geocoding of Routinely Collected Health Data in New South Wales. Sydney, New South Wales Department of Health. Vine, M. F., Degnan, D. and Hanchette, C. (1997). Geographic Information Systems: Their Use in Environmental Epidemiologic Research. Environmental Health Perspectives 105 (6): 598-605. Whitsel, E. A., Quibrera, P. M., Smith, R. L., Catallier, D. J., Liao, D., Henley, A. C. and Heiss, G. (2006). Accuracy of commercial geocoding: Assessment and implications. Epidemiological Perspectives & Innovations 3 (8): 1-12. World Health Organisation (2003). Improving Data Quality: A Guide for Developing Countries. Manila, World Health Organisation. Yu, L. (1996). Development and Evaluation of a Framework for Assessing the Efficiency and Accuracy of Street Address Geocoding Strategies. PhD, State University of New York at Albany.