Chapter 2: Literature Review

THE UTILITY OF GEOSPATIAL DATA AND INFORMATION USED IN GEOGRAPHIC INFORMATION SYSTEMS (GIS): AN EXPLORATORY STUDY INTO THE FACTORS THAT CONTRIBUTE TO GEOSPATIAL INFORMATION UTILITY By William L. Meeks Bachelor of Science United States Naval Academy Annapolis, Maryland, 1976 Master of Business Administration The George Washington University Washington, DC, 1997 A Dissertation Submitted to The Faculty of the School of Business of The George Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Dissertation directed by: Subhasish Dasgupta, Associate Professor Department of Information Systems and Technology Management

UMI Number: 3291997

UMI Microform 3291997 Copyright 2008 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346

The School Business of The George Washington University certifies that William L. Meeks has passed the Final Examination for the degree of Doctor of Philosophy as of 10 May 2007. This is the final and approved form of the dissertation.

THE UTILITY OF GEOSPATIAL DATA AND INFORMATION USED IN GEOGRAPHIC INFORMATION SYSTEMS (GIS): AN EXPLORATORY STUDY INTO THE FACTORS THAT CONTRIBUTE TO GEOSPATIAL INFORMATION UTILITY

William L. Meeks

Dissertation Research Committee: Subhasish Dasgupta, Associate Professor of Information Systems and Technology Management, Dissertation Director Mary J. Granger, Professor of Information Systems and Technology Management, Committee Member Srinivas Y. Prasad, Associate Professor of Decision Sciences, Committee Member

ii

© Copyright 2007 by William L. Meeks All rights reserved

iii

Dedication To my wife Maxine, who has endured more lectures about geographic information systems and geospatial data than she would ever expect in a dozens lifetimes, but who lovingly supported me all the way through this journey; to my parents, Tom and Jacquie Meeks, and my brothers Don Meeks and Tom Marshall and their respective families; to my oldest daughters Katie and Megan, and their respective families (including my three granddaughters); to my youngest daughters Emma and Avery who wonder what all the fuss is about; and to my many close and distant friends too numerous to mention, all of whom stood by me through thick and thin and never wavered in their support, encouragement, and love, do I wholeheartedly dedicate this work. Of mention also are Joe and Claudine Donovan, Maxine’s parents, who relentlessly hounded me to finish and finish well, more than anyone else. I also dedicate this work to them. I completely “get” that ‘no man is an island’ and that ‘it takes a village to raise a child’ (this last is relevant because many say I have not grown up yet, though I have already passed my 5th decade); so I publicly acknowledge that my successes over the years are due in large part to the loving support of those who stand with me on the bulwarks and keep the lions at bay.

I love my life. I love my family.

iv

Acknowledgements I thank all the members of my committee for all their guidance, support, and “tough love” throughout this research, and more importantly, throughout my broader scholarly journey. I especially thank Dr. Subhasish Dasgupta for his support. Shuba has been much more than a PhD advocate; but truly a mentor in the fullest measure of the term. I will owe him an eternal debt of gratitude for his guidance, patience, support, and humor. I am honored to now call him a peer; though, I admit I will always be “Grasshopper” to his “Master” (from the old television show, “Kung Fu”). I look forward to a long and fruitful collaborative career together: publishing, speaking, and thinking great thoughts, advancing many different aspects of the information systems field.

I also thank the other examining committee members for their individual contributions to my learning and success; each contributed in different but complementary ways, in the right form, at the right time. Thanks to Dr. Mary Granger for her no-nonsense style. Thanks to Dr. Srinivas Prasad for his dispassionate queries and gentle prodding. Thanks to Dr. Meliha Haddad for her fresh insights. Thanks to Dr. Refik Soyer for his calm demeanor as the committee chair. And, thanks to Dr. Erik Winslow for his guiding hand over the years, during both my MBA days and my PhD journey. He has inspired me for a long time. Finally, this dissertation could not be completed without the patient and enduring support of Ms. Elizabeth Huff in the doctoral office; I owe her a debt of humility and grace. To the entire School of Business community, thanks for the ride!

v

Abstract of Dissertation The acquisition and use of data and information in information systems, particularly geographic information systems (GIS), can be costly in terms of time, money, and intellectual capital expended. In 1738, Bernoulli wrote of the importance of considering utility over price in determining a good’s value. In the geospatial information field, it is customary to evaluate data quality and data accuracy; however, till now little attention has been paid to systematic utility assessments of geospatial data and information ingested into or outputted from GIS.

A geospatial information utility assessment

methodology is needed for two reasons: to improve the efficiency of GIS resources applied to geospatial data acquisition activities and to increase the effectiveness of GIS users by providing them a method for measuring the utility of their GIS-based analyses. There is a plethora of literature on data or information quality, and on generalized models for utility. Contrasted by the dearth of literature on information utility, this exploratory research develops a validated survey instrument that uncovers users’ perceptions of the factors that make up the utility of information used in GIS. In the utility literature, abstract definitions of utility abound, as do methods (functions) for using utility; however, the components or factors of utility are not addressed. This research treats utility as a second-order construct. A research model that defines geospatial information utility in terms of quality (i.e., attribute ‘goodness’) and context (i.e., attributes in use) factors is presented. The research is conducted via a pilot study and a main study to validate the model and to refine the instrument. The data provide the correlations of the named quality and context factors.

vi

Table of Contents Dedication .......................................................................................................................... iv Acknowledgements............................................................................................................. v Abstract of Dissertation ..................................................................................................... vi List of Figures .................................................................................................................... ix List of Tables ...................................................................................................................... x Chapter 1: Introduction ................................................................................................... 1 Overview......................................................................................................................... 1 Theoretical Perspectives ................................................................................................. 8 Statement of Purpose .................................................................................................... 12 Organization of the Document...................................................................................... 16 Chapter 2: Theoretical Foundation............................................................................... 18 Overview....................................................................................................................... 18 The Relationship between Data and Information ......................................................... 21 The Concept of Information Quality............................................................................. 25 The Concept of Utility .................................................................................................. 34 The Concept of Information Utility .............................................................................. 39 Quality and Utility Issues Pertaining to Geospatial Data and Information .................. 41 Other Terminology and Related Concepts.................................................................... 46 Summary of Literature Review..................................................................................... 58 Chapter 3: Methodology................................................................................................. 60 Overview....................................................................................................................... 60 Basic Research Model................................................................................................... 62

vii

Developing a Candidate Factor Set for the Analysis .................................................... 67 Research Design and Methods...................................................................................... 73 Survey Design and Measurement of Preference Data .................................................. 74 Data Collection Approach............................................................................................. 76 Data Analysis Approach ............................................................................................... 77 Overview of Respondent Sources and the Sample Pool............................................... 79 Validity Issues and Concerns........................................................................................ 81 Expected Results........................................................................................................... 84 Limitations .................................................................................................................... 86 Chapter 4: Data Analysis and Results .......................................................................... 87 Overview....................................................................................................................... 87 Discussion of Pilot Study.............................................................................................. 88 Characterization of Main Study Sample Population..................................................... 89 Overview of the Main Study......................................................................................... 96 Discussion of Main Study Results ................................................................................ 97 Discussion of Instrument Validation: Validity and Reliability................................... 108 Chapter 5: Conclusions, Implications, and Recommendations ................................ 114 Overview..................................................................................................................... 114 Discussion of Research Results .................................................................................. 114 Benefits and Implications of this Study ...................................................................... 117 Limitations of this Study............................................................................................. 118 Recommendations for Future Research ...................................................................... 120 References...................................................................................................................... 122

viii

List of Figures

2.1 The Path to Studying Data and Information Utility in GIS………………….20 2.2 Analysis Levels of Information Transmission……………………………….22 2.3 English’s Model of Relationships between Data and Information…………..24 2.4 Information Pyramid………………………………………………………....47 3.1 General Research Model for Exploring the Utility of Geospatial Data….......62 3.2 Conventions for Labeling the General Research Model……………………..85 5.1 Assessed Research Model with Correlated Quality- and Context-based Factors……………………………………………………..115

ix

List of Tables

3.1 Candidate Utility Factors and Their Sources………………………………..68 3.2 Meeks and Dasgupta Modified Factor Set (2001)……….………………….71 3.3 A Preliminary Estimate of the Binning of the Initial Candidate Factor Set into Context- or Quality-based Factors……………………........84 4.1 Summary Statistics about the Respondent Pool……………………………..92 4.2 Distribution of Respondents’ Ages versus Years of Experience with GIS….95 4.3 Summary of Responses to Four Experience Questions About Cancelled or Modified GIS Analyses……………………………………………………..95 4.4 Labeling of Likert Scale Response Categories……………………………....96 4.5 Rotated Factor Pattern for 37-item Instrument……………………………..101 4.6 Communalities of the 37-item Instrument………………………………….104 4.7 Corrected Correlation Matrix for 20-item Instrument……………………...107

x

Chapter 1: Introduction

Overview

Geographic information systems (GIS) are a class of information systems (IS) devoted to answering a wide range of place-, space-, and time-oriented questions. Pressman (1997) describes information systems as being comprised of software, hardware, people, databases, documentation, and procedures; though, this description is based upon a software-engineering point of view.

Buede (2000), writing more generically about

systems, defines them to be “a set of components (subsystems, segments) acting together to achieve a set of common objectives via the accomplishment of a set of tasks.” He also specifically defines components such as hardware, software, other physical entities, and humans as working together as designed.

Zwass (1998) describes management

information systems (MIS) as being comprised of data input devices; data from internal and/or external sources; processors working on data based on analytical requirements, algorithms and procedures; storage devices and media; and output devices for information presentation or visualization.

Based on the above and the relationship

between IS and GIS, in this research, GIS are defined to be comprised thusly:

GIS = f {Software, Hardware, Operators, Tools, Procedures, Data, Analytical Requirements}

This functional view should generically be interpreted to mean GIS systems—like all other IS—are comprised of operating systems and application software hosted on one or

1

more hardware suites or platforms, able to support analyses by operators using software and non-software tools and procedures on primary and secondary source data, when driven by analytical requirements and constraints.

Most GIS also contain functionality to enable automated or semi-automated connectivity from the system to remote data sources containing multiple data types and formats. GIS differ primarily from other types of IS in the details of the data sources and types, and tools, procedures, operator capabilities and training, and the analytical requirements that drive their use, which will always have at their core spatial, geographic, or spatiotemporal (i.e., space, place, or space/place and time) components.

There are many operative and valid definitions for GIS.

The old aphorism, ‘to a

carpenter with a hammer, every problem looks like a nail’, is useful in considering the multi-perspective views of Burrough and McDonnell (1998), which state that we can take any or all of a tool-box, database, or organizational view of the use and utility of GIS. GIS can be viewed mechanistically, in terms of how we use them (i.e., the toolbox view); what we do with them (i.e., the database view); or how we interact with them and the results of their processes (i.e., the organizational view). Given the ‘carpenter with a hammer’ aphorism and Burrough and McDonnell’s views of GIS, GIS managers and practitioners are able to avoid the ‘every problem looks the same’ trap.

Though similar to all information systems in principle, GIS systems differ from other IS because of the types of data that they are able to input, process, and output. Whereas

2

most IS handle structured (and sometimes unstructured) textual and numeric information, GIS handle these data plus a broader array of data types and formats to aid in spatial manipulation and graphical representation of data inputs and outputs. In addition to handling textual and numeric data, which are principally used in GIS to add semantic attribute richness to spatial placement of data content, the main GIS-specific data types include raster and vector data. These two data types represent fundamentally different ways to: (1) represent spatial (primarily earth-surface or land cover) features in computer databases and (2) conceptualize how to represent the surface of the earth. Couclelis (1992) posits that raster data treat the earth’s surface as a jigsaw puzzle of different represented surface features. This representation is essentially one feature deep. Vector data, on the other hand, are able to represent the earth’s surface as stackable layers of multiple features. To illustrate the difference, it may be useful to consider a common example. A paper road map typically shows four, five, or six colors to represent standard features: green for vegetation, black or red for roads (depending on road size), blue for water (rivers, streams, lakes, etc.), black for labeling, yellow for areas representing cities and towns, and sometimes brown for topographical relief (i.e., contour lines of constant elevation for hills and depressions). The limitation of the medium is that the surface of the paper can only depict one color for any given space on the map (hence, the ‘jigsaw puzzle’ metaphor). However, in reality features ‘stack’ on top of one another. For example, a roadbed sits on top of the earth, and may cut through a forest. Where the road crosses a river, the roadbed sits on top of a bridge structure, which is anchored to the shore, and which also sits on ‘top of’ the stream beneath it; the stream even sits on top of the earth (streambed) beneath it. Interestingly, each of these ‘stacked’ features can and

3

often do contain attribute information (e.g., for the stream mentioned earlier, stream width, depth, average current speed, bank height and slope, bottom surface and fordability, etc.) within associated database tables that the GIS appropriate associates with the feature under review. Couclelis (1992) calls vector data a “club sandwich of data layers”, ergo, vector data models represent the earth more realistically. So, where the raster is more pleasing to view and intrinsically easier to interpret, a vector model may contain more information. It is important to note that this is not to say that one data type is inherently better than the other is. GIS capabilities and users’ analytical needs determine the best data type at any particular time for any particular analysis.

Raster data are formed by creating a standard grid surface (e.g., squares, rectangles, hexagons, etc.) where the grid cells are nominally coded to be of one type or another. There are several different coding schemes, each with its own benefits and limitations (Burrough & Frank, 1996; Burrough & McDonnell, 1998; Lillesand & Kiefer, 2000; Longley, Goodchild, Maguire, & Rhind, 2001).

Pixel (picture element) arrays are

examples of rasters. Therefore, a raster display could display land cover as vegetation of nominally standardized type, water areas color coded by depth, or areas of man-made development (as versus natural topography and vegetation, etc.). Despite positional and attribute errors induced by coding, there are many uses for raster data. Since raster displays more closely represent ‘traditional’ map displays (often created from digitally scanning in previously prepared maps), they are often easier to relate to by analysts, experts, and lay people alike. Rasters are often used in map background displays that allow other graphical, textual, or vector data to be overlaid upon them.

4

When

considering the differences between geographical or spatial ‘objects of interest’ displayed in either raster or vector form, whether they are areas (e.g., fields), lines (e.g., roads), or points (e.g., cell phone towers), rasters tend to be best at area objects (Longley, Goodchild, Maguire, & Rhind, 2001).

Vector data are an alternative data structure/data type. Vectors are points and line segments, often connected together to form geometric shapes (the term polygon is used frequently to describe vector-based area shapes). To make vectors conform more closely to ground truth, the length of the line segments necessarily become increasingly shorter, with the endpoints of the line segments conforming as closely as possible to a trace of the boundary of the shape (e.g. a stand of trees of a particular type, a body of water, etc.). As with raster coding schemes, there are very often errors induced in the coding process (often called “digitizing”) when real world physical objects are represented in vector model form (Decker, 2001; Raper, 2000).

Other forms of grid data include elevation data recorded in some regular or irregular “post spacing” intervals (i.e., tabular arrays). Elevation data allow GIS to display terrain maps in three dimensions by allowing either raster or vector data to be “draped” over the elevation grid. Elevation data are often represented in databases as digital elevation models (DEM) or elevation layers in digital terrain data (DTD) (Campbell, 2002; Lillesand & Kiefer, 2000).

5

In the geospatial information field, it is customary to evaluate data quality and data accuracy (Burrough & McDonnell, 1998; Congalton & Green, 1999; Congalton & Plourde, 2002; Longley, Goodchild, Maguire, & Rhind, 2001); however, till now little attention has been paid to assessing the utility of the geospatial data and information ingested 1 into or outputted from GIS.

Systematic geospatial information utility

assessment methodologies are needed for two reasons: to improve the efficiency of GIS resources applied to data acquisition activities and to increase the effectiveness of GIS users by providing them a method for considering how to increase the utility of their GISbased analyses.

As described in Meeks and Dasgupta (2004; 2005) the genesis of early geospatial information utility research was focused on the data-centric aspects of improving organizational decision making through examining data quality and utility issues that support decision making. Burgess, Gray and Fiddian (2007) use a similar algorithmic approach and similar treatment of relevant factors.

This data orientation does not

supplant other views of the use or utility of GIS analyses, but rather provides a context for understanding the limitations of ‘good’ or ‘bad’ data and information used in GIS, where ‘good’ and ‘bad’ have many different context-specific meanings. Some authors and practitioners see these comparisons as based on absolute, or externally derived and validated scales, whereas others (Nebert, 2001; Reichardt, 2001; VanDyke, 2001) note 1

In the GIS and geosciences fields “ingest” is the more common verb to describe, from the point of view of the ‘system,’ the act of inputting data into the system (GIS, etc.); i.e., more commonly, the system ingests the data rather than the operator inputs it. There is no definitive source of this verb usage, though it is very common; however, this is most likely because remotely sensed image data and geospatial map datasets are comparatively larger than most data sets that ‘average’ workstation users use within their systems and applications, thus the ‘system’ more often performs the verb-action part of ingesting (versus inputting) the data autonomously or semi-autonomously.

6

that distinctions between ‘good’ and ‘bad’ data or information are relative, dynamic, and highly correlated to the analysis-specific constructs used to guide GIS use and GIS-based decision making.

A comment on use of selected terminology commonly used within the geospatial sciences field: in the context of this research, data and information are often referred to as being ‘processed’ within GIS as ‘shorthand’ for the purpose of refining, focusing, or otherwise providing clearer meaning. Harvey (2002) provides the best and clearest description of this concept in Bossler (2002b):

“In many ways, processing geospatial data is analogous to processing wood. Both data and wood are raw materials for a vast variety of products. Processing transforms materials into products. In the carpenter’s case, the raw material is the diverse types of wood available and the tools are the sanders, saws, drills, and planers. Spatial data processing begins with data material and uses a vast array of Geographic Information System (GIS) tools to transform data into various products” (F. J. Harvey, 2002, p. 450).

This research uses the convention found in Bossler (2002b) to broadly use the term ‘geosciences fields’ to comprise all of: remote sensing (i.e., data collection from a distance), data geolocation (e.g., as exemplified by common and standard technologies involved in the Geographic Positioning System (GPS)), and spatial data analysis and presentation (i.e., as typified by the use of GIS or GIS-like mapping technologies and

7

tools). This is a context statement only as the scope of this research is limited to data quality and utility issues endemic to the use of GIS systems and technologies. A related topic is geostatistics. A discussion of geostatistics concepts is also out of scope for this research; however, it is acknowledged that geostatistics exerts an ever greater influence on trends and developments involving data ‘processing’ within GIS as many GIS analysis tools rely on geo-statistical methods.

Theoretical Perspectives

The use of GIS and a corresponding use of spatial and geographic analytical tools in nonGIS applications (e.g., mapping capabilities found in broad-based applications such as Microsoft Excel™) continues to rise:

“For GIS, the faster speeds have allowed much more refined databases, analysis, modeling, visualization, mapping features, and user interfaces. GIS applications and its user base grew rapidly grew rapidly in the 1990’s and the early 21st century. It has become connected with global positioning systems, the Internet, and mobile technologies. With multiplying applications, it continues to find new uses every year.” (Pick, 2005, pg x)

Where previously GIS users ingested and analyzed earth science data for specialized earth sciences purposes, with the growing recognition that as much as 80% of all data contain spatial information content in some form (Bossler, 2002a; Longley, Goodchild,

8

Maguire, & Rhind, 2001) the use of GIS is moving into mainstream use in organizations. With new emphasis on systematizing decision making (Clemen & Reilly, 2001; Hoch & Kunreuther, 2001), there is an awareness that analysis of spatial and temporal content in business data improves organizational decision-making.

The quality of GIS use—to users, analysts, and decision makers—is quite frequently considered on the basis of attractiveness and general usability of visual presentations (Burrough & McDonnell, 1998). However, there is more to developing substantive results from MIS and GIS than delivering the final information presentation. Further, acquiring data is costly (Lawrence, 1999). Therefore, one critical issue for MIS is evaluating the quality of input data used to support analyses and decision making. With rapid growth in the availability of data ingested into GIS, users of GIS—from all levels of sophistication and experience—face the data quality issue. And, though not explicitly addressed in the literature, beyond the data quality issue, if a user must learn something about the quality of the data he or she would obtain, then perhaps some aspect of a quality assessment should be based on how useful these data will be in use. In GIS, data quality is most often considered consistent or synonymous with data accuracy and data resolution (Congalton & Green, 1999; Congalton & Plourde, 2002).

Accuracy is

comprised of content accuracy and locational accuracy. Resolution pertains to a system’s ability to resolve detail (e.g., ability to discriminate fine spatial, spectral, or radiometric detail) (Campbell, 2002).

9

Therefore, data quality and utility assessments are both important. Spatial data quality assessments are normally conducted as data accuracy assessments (Congalton & Green, 1999; Congalton & Plourde, 2002). However, data accuracy assessments are limited in scope.

Where data accuracy focuses on assessing data versus a fixed standard,

information utility focuses on the relevance of data and information to users, on a use-byuse basis, considering what is needed to effectively solve an information “problem” (i.e., a GIS-based analysis, in this context) against what is reasonably available within all resource constraints. Currently there exists no comprehensive and systematic way for users of GIS to assess the utility of their source data with respect to their planned analyses. This is a topical area ripe for exploration (Bruin, Bregt, & Ven, 2001) and remains the driving basis for this research stream.

As mentioned, the most common proxy for geospatial information utility remains data quality (in its many forms). Users have been trained for years that assessing data quality is sufficient to determine the utility of source data within their GIS.

However, a

mismatch commonly arises between the quality of data available for use and the quality of data needed. This mismatch may result whenever available data are of either higher or lower quality than that needed. Both cases impact users and managers because of the cost to acquire and use data and information.

Two papers (Meeks & Dasgupta, 2004, 2005) explore a data-centric view of GIS use and develop two themes:

10

(1) Data and information are used in ever more innovative and creative ways, driving the need for rigor and structure when considering the difference between quality and utility as they apply to IS- or GIS-based or –supported analyses.

(2) Geospatial data types (e.g., vector, raster, and grid) and sources (e.g., remotely sensed data), once the exclusive realm of the geosciences fields, are migrating to more mainstream uses to support organizational decision making.

The cautions for managers and analysts when using data types and sources with which they are unfamiliar, or using data types and sources with which they have grown overly familiar, is in not being aware of the limitations of the data based on different quality and utility constructs.

In order to conduct this exploratory research, the following

assumptions are necessary and will be explored through the research:

A1: Users would benefit from a systematic or algorithmic method to assess the utility of their geospatial data and information.

A2: One or more systematic and algorithmic methods to assess or estimate users’ information utility ‘scores’ based upon available geospatial data as compared against GIS-based analytical requirements can be developed once the key factors and components of utility are determined.

11

Statement of Purpose

The purpose of this research is to explore geographic information systems (GIS) users’ attitudes about the dimensions the data and information they import into their GIS, particularly as they may affect either static and dynamic attribute values for the data sets used. The research is focused on parsing apart quality- and context-based utility issues as they pertain to the data and information used in GIS. The research is limited to the input side of the data/information into/out of GIS.

The output problem is related, and

interesting, but out of scope of this research to provide clarity and focus to the survey questions tendered to GIS users and other GIS stakeholders.

To further define the scope of this research, this study is further narrowed to determine the attributes of data or information utility that pertain to geospatial data to support the development of a validated survey instrument that examines quality and utility factors amongst GIS users. Therefore, the central focus of this research is to determine users’ attitudes about how useful and relevant their source data are to their geospatial analyses.

Given the assumptions addressed previously, this research explores the concept of utility applied to geospatial information and data through interaction with targeted sets of GIS users. This approach leads to two overarching research questions:

12

Question 1: What are GIS users’ and stakeholders’ opinions about the semantic differences between quality and utility as they pertain to geospatial information and data used in GIS?

Question 2: What are GIS users’ and stakeholders’ opinions about the components of utility as they pertain to geospatial information and data used in GIS?

To initiate a community-wide discussion on quality and utility, this research begins with the dictionary definition 2 of utility derived from the etymology of the Latin derived from utilis to mean useful. For something to be useful, it must also mean relevant to an intended use. Therefore, in this research, utility is defined to mean useful and relevant. As described in the literature review, utility is a broad and inclusive concept. To further narrow the scope of this research, the concept of utility is further restricted to its application to information as a good. In this research, information utility is abbreviated as IU. This leads to defining and exploring concepts, issues, and terminology related to geospatial information utility (GeoIU).

In Meeks and Dasgupta (2004) information

utility was defined to be an estimator of the adequacy, relevance, and usefulness of geospatial information to the users of GIS.

The dearth of research into the utility of information or into the utility of geospatial data and information used in GIS provides a research gap. Though the absence of something does not prove its opposite, but since there are no current geospatial information utility 2

http://www.m-w.com/cgi-bin/dictionary, searched on “utility.”

13

estimation tools, three de facto user practices are suggested with respect to the degree to which users of geospatial data do, or are able to, assess the utility of their geospatial data and information:

(1) Some GIS users ignore quality and utility issues of their ingested data, with respect to their GIS analysis (i.e., available data is used irrespective of its pedigree, quality, or utility).

(2) Some GIS users apply ad hoc or other user-defined heuristics to attempt to account for their data and information quality and utility issues.

(3) Some GIS users use proxies for estimating the utility of their ingested data and information, as these data and information support their GIS analyses.

These three de facto practices form a continuum of sorts from doing nothing with respect to assessing and acting on the utility of data and information ingested into GIS for subsequent analysis, to using current ‘best practices’, such as performing an accuracy assessment as a proxy for conducting a more rigorous or systematic utility assessment. This research explored these statements by exploring users’ attitudes about quality and utility applied to geospatial data used in GIS-based analyses. The aim of this research is to begin to lay the groundwork a robust and systematic geosciences community dialogue about quality and utility issues.

14

Chapter 2 provides the theoretical foundations of information, quality, and utility constructs. This examination leads to two broad conclusions: (1) quality is the current dominant paradigm: utility is largely ignored in the literature; and, (2) quality is expressed in many different ways, which seem to be converging in two separable streams (e.g., attribute-based measures of quality and use-based measures of quality). Though not explicitly addressed in the literature, the emergence of these two quality approaches (i.e., focusing upon either the inherent attributes of the data and information or upon the uses of the data and information) are the basis for this approach to defining utility along the use-based or context-based quality lines (as thought of within the existing quality paradigm). As stated earlier, one of the thrusts of this research is to extend these constructs from broader (i.e., more generic) data and information realms, as they are used in any “____ information system”(xIS), to the more specific case of geospatial data and information ingested, processed, analyzed, and reported using GIS. This approach is consistent within a general systems construct of inputs, processes, outputs, and feedback, where ingested data refers to inputs, processed or analyzed data refers to processes, and reported data refers to outputs (and perhaps feedback).

Within any “xIS” case, one function of the “x” system is to transform data (in permitted forms) into information (in usable forms). In the case of GIS, spatial, geographic, and geospatial data are presumed to be ‘processed’ (i.e., having a contextual construct applied to the data for the purpose of imputing or extracting additional meanings from the data within their context) into geospatial information for many different uses.

15

This is important because there is little published research about the utility of geospatial data and the application of utility theory to data and information issues within GIS (Nebert, 2001; Reichardt, 2001; VanDyke, 2001). To summarize, this study will help GIS users and geosciences practitioners and researchers to address the need for, or the impact of not having, systematic methods to consider the utility of their GIS source data to their analytical requirements.

Organization of the Document

This dissertation is organized as follows: Chapter 1 introduces the topic of quality, context, and utility applied to information used in GIS. The introduction includes an overview of key functionalities, information requirements, and limitations of GIS; the statement of purpose for this research; and an overview of the theoretical perspectives and key concepts required in this research.

Chapter 2 provides the theoretical

foundations for this research by summarizing the literature that addresses information quality, utility, and information utility, and the relationship between quality and utility issues pertaining to geospatial data and information. Chapter 3 provides the research methodology. This chapter provides a basic research model relating utility, quality, and context together, considering utility to be a second order construct comprised of quality factors and context-based factors. This chapter also addresses the rationale and methods for conducting an exploratory factor analysis as a means for uncovering the factors that correlate with either quality or context. Finally, the details of the research design and the data collection and analysis are addressed in Chapter 3. Chapter 4 provides the results

16

from both the pilot and main studies.

This chapter includes pilot study results,

characterization of the sample population and descriptive statistics from the main study, a discussion of main study results, including validation of the basic research model, and a discussion of instrument validation results. Chapter 5 presents the research findings, provides conclusions, offers implications for the GIS data community and GIS related organizations, provides information about the study’s limitations, and proposes opportunities for future research.

17

Chapter 2: Theoretical Foundation

Overview

This research is important because users and managers of GIS use the outputs of GISdriven or -supported analyses to support many types of decision making.

The old

aphorism from the early days of computer programming and automated data processing, about the main limitation of computers being, “garbage in, garbage out,” has never been more apt. In the information age, in the Internet society, in the days of widely-available always-on wideband data connectivity, more and more people are accessing or creating ever-increasing stores of data and information. This is true in the broader domain of information systems (IS) as reflected in widespread adoption of information and communications technologies and within the geospatial data analysis domain. Redman (1998) reported on some of the practical impacts (such as impacts to revenues and expenses) in enterprises when pointing out the data and information quality impacts for enterprises. According to Isaaks and Srivastava (1989, p. 109), “Information extracted from a data set or any inference made about the population from which the data originate can only be as good as the original data.” It turns out there are some data scrubbing and data estimating routines that mitigate the complete bad effects of bad or low quality data, however, the point is made: bad data may make for bad analyses.

In this research, the increased access issue is defined to be comprised of: (1) increased ease of access to data and information sources and (2) increased quantities of data and 18

information accessed, ingested into, processed by, and outputted from GIS. GIS users and managers are using larger volumes of easily accessed and manipulated data in their analyses and decision making with little or no assurance about the quality or utility of these data and information.

Thinking ‘garbage in, garbage out’, these GIS users,

analysts, and managers should then have little confidence about the quality of their analyses and decisions if they cannot make a declarative or probabilistic statement about the confidence they have in the data and information their analyses are based upon. A byproduct of the growth in access issue is the broadening of the base of users now using GIS and other MIS able to ingest and manipulate geospatial data and information. The geosciences fields (i.e., sensing, positioning, and analyzing) were once the near-exclusive domain of highly-trained or specifically-educated geospatial professionals, academics, and researchers. However, over the last 10 years or so with increased accessibility to data (and information) and increased usability of tools and software applications, which have greatly lowered the barriers to entry, each of these fields has seen rapid growth of entrants at the low and middle tiers (Bossler, 2002b; Pick, 2005).

This research approaches several streams of thought about information quality and utility, considering them as a coherent whole, and then applies this stream of thought to the geosciences domain. Specifically, this research seeks to unify three orientations:

(1) Extending prior research into data and information quality for the generic information systems (IS) or management information systems (MIS) domains to the more narrowly defined domain of geographic information systems (GIS);

19

(2) Extending research into data and information utility for the generic IS and MIS domains to the GIS domain, and;

(3) Examining the relationship between quality and utility at the GIS and geospatial information levels.

At the outset, considering the unit of analysis of data and information types, these extensions of data quality (DQ) and information quality (IQ) to the GIS and geospatial data and information domains commence at a type-independent, generic ‘data are data’ and ‘information is information’ levels. This generic view of unit of analysis is useful. This research also uses two data-to-information metaphors: the “information pyramid” and the “information puzzle.” The purposes of these metaphors are to provide context to a more detailed level of analysis when considering the specific issues of data and

Domain Specific

Domain independent

information quality and utility within GIS. Figure 2.1 depicts this relationship:

Information systems

Use and produce

Use and produce GIS

Data and Information

Geospatial Data & Information

Need to know

Need to know

Quality & Utility

GeoQuality & GeoUtility

Figure 2. 1 The path to studying the quality and utility of geospatial information in GIS

20

Therefore, the first step is to examine data and information quality and utility issues at the highest level (i.e., IS-level), which is essentially a data type-independent review, then the next step is to research data and information quality and utility issues for specific data and information types specifically endemic to the GIS/geospatial sciences domain. The key terms and concepts found in this chapter include: data quality, information quality, information utility, information relevance, data accuracy, and information accuracy. Figure 2.1 above shows the higher order (parent) source of concepts to be reviewed and researched above the dotted line and the more detailed (domain specific) application of those concepts to this research.

The two arrows that descend from “Data and

Information” and “Quality & Utility” to “Geospatial Data & Information” and “GeoQuality & GeoUtility” respectively indicate the parent-child relationship used here.

The Relationship between Data and Information

Capurro and Hjorland (2003) focus on the semantics and etymological roots of information. They offer semantic precision where it seems to lack. And they seem to side with Machlup and Mansfield (1983) that the context and constructs required and applied by the users of information need to be defined and considered. It is not the errorfree transmission of data and information that matters most to organizational scientists, managers and decision makers, but rather that message content be received, understood, and assimilated by the recipient. Shannon’s seminal works (Shannon, 1948; Shannon & Weaver, 1949/1972) on information theory, as brilliant as they were and enduring as they are, focus too narrowly on the mechanics of information (transfer).

21

Professor John Artz at George Washington University is fond of pointing out the difference between techne and telos. Techne refers to the mechanical or technical aspects of the thing; how something is achieved. On the other hand, telos refers more to the final understanding of, or of achieving, a larger purposeful goal. These are useful concepts here. Reichwald (1993) in (Wigand, Picot, & Reichwald, 1997) also provides a useful semiotic framework that links senders and receivers of information shown in Figure 2.2:

Figure 2.2 Reichwald’s (1993) Analysis Levels of Information Transmission. Source: Wigand, Picot and Reichwald (1997)

These three semiotic components are important.

The syntactic level represents the

physical transmission of the “symbols” (i.e., text, data, information, etc.) of the communication system in place.

The semantic level represents the assignment of

meaning to the transmitted symbols. And, the pragmatic level represents the inclusion of the sender’s intention (i.e., context) along with the semantic information. Wigand et al (1997, p. 61) make a distinction between data and information: “Data represent meanings which are not directly applicable to a purpose. Information, in contrast, is relevant for

22

certain actions.” Not explicitly stated, but inferred by this author, is this clarification: data are the transmitted symbols, without context, and information is the meaning derived from context applied to transmitted symbols. Given either the Wigand definition or the Meeks clarification, one point is that the contextual meaning of a message may be accurately communicated in spite of some level of errors induced in transmission, if the context is understood (from a prior communication, or from elsewhere in the currently transmitted message).

Stamper (1996; , 2001) takes a semiotic approach similar to Wigand, Picot, and Reichwald (1997). However, Stamper considers six semiotic levels that parse more finely the meaning contained in the Wigand model.

In addition to the Syntactic,

Semantic, and Pragmatic levels of Wigand et al (1997), Stamper adds the Physical, Empiric, and Social levels. Stamper also explicitly organizes his six-level semiotic model into two categories: the Technical Layer and the Human Layer.

The Technical Layer levels (i.e., Physical, Empiric, and Syntactic); Wigand, Picot, and Reichwald’s Syntactic level; and Shannon’s mathematical theory of communication all fall outside the scope of this paper; however, they provide a context for focusing on the content-specific issues inherent in the data and information themselves. As implied in the previous chapter, the purpose of this research is to explore users’ interaction with GIS and their needs for changing or static data dimensions (e.g., currency, accuracy) with their associated attribute values, of each of the researched dimensions for the geospatial data and information used in GIS.

23

Bovee (2004) posits the information field remains rather open and inconclusive. He provides a useful treatment of the issue of defining information. Where Bovee (2004, p. 5) declares, “Central to any discussion of information quality is the definition of information.

However, few authors have attempted this, leaving the meaning of

information to be assumed or inferred by the reader,” he seems to contradict himself as he spends a score of pages offering multiple definitions and perspectives on the meaning of information. dominate.

It seems many attempts have been made; though no clear definitions Bovee (2004) further identifies several broad information-defining

perspectives: the use of information in the sciences (Machlup & Mansfield, 1983), information as a socio-political force (Braman, 1989), information as a property of the universe (Stonier, 1991), data as indistinguishable from information (Machlup & Mansfield, 1983), and a product-oriented MIS view (L. P. English, 1999). Citing English (1999), Bovee provides the graphic (Figure 2.3), illustrating the flow from data to wisdom:

Figure 2.3 English’s (1999) model of the relationships between data and information. Source: Bovee (2004)

24

The descriptors at the bottom of the Figure 2.3 graphic provide an excellent source for this research’s focus on a context-based approach to evaluating information.

This research provides two metaphors about data and information: the information pyramid and the information puzzle. The former is the more relational in nature, and is more useful in describing the relationship between data and information. The latter is more descriptive in nature and asserts the assessment of data or information quality (and later, once established, utility) is not binary in any ‘on-off’, ‘good-bad’, or ‘useful-notuseful’ modes. This metaphor for information and information quality also accepts and even agrees with the reality of multiple definitional frameworks for data and information. Linking the problem of relating data to information, Longley et al (Longley, Goodchild, Maguire, & Rhind, 2001) provide a useful framework for decision making support using GIS examples: data are raw geographic facts and information is comprised of assembled and analyzed facts (data).

The Concept of Information Quality

Eppler (2003) provides a treatment of the multiple issues associated with analyzing information quality. He summarizes eight explicit IQ frameworks published between 1986 and 2000. These frameworks include:

25

A model of value-added information services comprised of five information quality criteria.

(Taylor, 1986).

Taylor’s quality criteria include: accuracy,

comprehensiveness, currency, reliability, and validity.

A framework oriented on information production/publication, comprised of seven quality criteria (Russ-Mohl, 1994). Russ and Mohl’s quality criteria include: comprehensibility, currency, objectivity, relevance, reduction of complexity, transparency/reflexivity, and interactivity.

Information as product comprised of seven quality criteria or dimensions, versus Information as process, comprised of five additional quality criteria (Lesca & Lesca, 1995). Lesca and Lesca’s information as product quality criteria include: comprehensibility,

relevancy,

coherence, and clarity.

usefulness,

completeness,

representation,

The information as process quality criteria include:

objectivity, interactivity/feedback, trustworthiness, accessibility, and credibility.

A more complex framework that parses information quality criteria into 4 toplevel categories and 15 subordinate dimensions (R. Wang & Strong, 1996). Wang and Strong’s quality criteria include: accuracy, interpretability, timeliness, objectivity,

relevancy,

ease

of

understanding,

completeness,

concise

representation, believability, accessibility, reputation, security, value-added, amount of information, and consistency.

26

An even more complex framework that parses 27 criteria into eight dimensions and three views (T.C. Redman, 1996).

Redman’s quality criteria include:

accuracy, comprehensiveness, currency/cycle time, relevance, essentialness, completeness, precision of domains, clarity of definitions, obtainability, consistency of values, consistency of representation, semantic consistency, structural

consistency,

homogeneity,

minimum

attribute

granularity,

redundancy,

naturalness,

robustness/reaction

identifiably, to

change,

flexibility/reaction to change, appropriateness of formats, and interpretability.

Another complex framework that parses 18 information quality criteria into six quality dimensions (Koniger & Rathmayer, 1998). The quality criteria include: preciseness, objectivity, trustworthiness, accessibility, security, relevance, added value, timeliness, information content, interpretability, understandability, conciseness, consistency, existence of metadata, appropriateness of metadata, existence of structure, appropriateness of structure, and understandability.

The first framework that explicitly mentions the relationship between information quality and its value to its target group (audience) is very narrow in scope. This framework is comprised of just six quality criteria (Alexander & Tate, 1999). The quality criteria include: accuracy, currency, objectivity, audience orientation, authority, and interaction/navigation.

27

The final framework is a two-dimensional framework with 15 quality criteria. The focus is on inherent and pragmatic information quality (L. English, 1999). English’s quality criteria include: accuracy to reality, timeliness, validity, usability, rightness/fact completeness, completeness of values, definition conformance, contextual clarity, accessibility, accuracy to source of data, precision, non-duplication, equivalence of redundant or distributed data, and derivation integrity.

In a broad sense, the essence of each of these frameworks is its contribution to the lexicon and taxonomy of the somewhat fuzzy field of information (and data) quality. As with most works, they have shortcomings; either the biases of the creators or the limitation of what was then the extent of the body of knowledge. But also as with most efforts to extend the body of knowledge in any field or enterprise, they each nudge the information “ball” a little closer to the goal line. Additional amplifying information about information quality is provided in the paragraphs that follow; however, it is appropriate to point out that for the above eight frameworks, there is frequent commingling of terms, concepts, and perspectives.

The eight Eppler assembled frameworks provide 105 quality criteria in one form or another; however, because of expected duplicate uses of the same (or very similar) terms, there are only 37 rows in a term-mapping matrix. Each row represents the first use of a term or concept—as the mechanism to identify a specific information quality criterion—

28

therefore, 37 rows suggests there are only 37 “original” ideas represented. This mapping proved useful for several reasons:

•

To examine the included frameworks in temporal order

•

To examine the totality of the criteria listed (i.e., 105 terms)

•

To examine the breadth of the criteria listed (i.e., 37 ‘first-use’, or different terms)

•

To examine the depth of the criteria listed (i.e., how many terms are repeated)

•

To examine the terms themselves for insight into the over-time evolution of concepts related to quality

•

To examine the terms for being overextended (i.e., a term that is purported to be a “quality” term, whereas it may actually serve another purpose better)

In looking for agreement and disagreement (or originality) between the authors of the eight frameworks, these key terms recur, irrespective of the individual framework’s particular focus:

•

Timeliness or currency is listed 7 of 8 times;

•

Comprehensiveness or comprehensibility is listed 6 of 8 times;

•

Accuracy (the first term in the matrix) is listed 5 of 8 times;

•

Objectivity is listed 5 of 8 times;

•

Relevance (or relevancy) is listed 5 of 8 times;

•

Completeness is listed 5 of 8 times;

•

Accessibility is listed 5 of 8 times;

29

•

Usability or usefulness is listed 4 of 8 times;

•

Clarity or contextual understanding is listed 4 of 8 times;

•

Credibility or accuracy of the source is listed four of eight times.

Beyond their use in these eight information quality frameworks these terms occur in the broader information and information quality literature. The above 10 terms (concepts) represent < 30% of the 37 entries in the matrix, but nearly 50% of the 105 separate instances of term occurrence in the matrix.

The point is not necessarily these

rudimentary statistics about term usage. After all, there is certainly some room for interpretation as the contextual meaning of the some of the terms is suspect in certain instances. The point of the frameworks’ groupings of terms is that there is some good degree of agreement on many of the key attributes, but there are also several different ways (read, quality criteria) in which individual authors more narrowly approach the information quality issues pertaining to problems within their particular areas of interest. For example, in other instances, there must be intended very specific parsing of meanings of the terms provided. In two frameworks, there are multiple terms provided each with meanings very close to other terms in the framework, implying some frameworks mean the terms in very narrowly defined senses, whereas others provide terms within a very broad context.

A similar review of information quality literature by Parssian (2002) reveals that accuracy, consistency, completeness, and timeliness are widely considered to be four of

30

the most important attributes of data or information quality (Padman & Tzourakis, 1997), (Gardyn, 1997), (Kahn, Strong, & Wang, 1997), (Burzynski, 1998).

However, other terms on the top 10 list and elsewhere in the information quality framework matrix provide context of how a piece of information may be used. Timeliness, the term listed the most frequently, is also one that is highly subjective and could yield variable quality ratings dependent on the use of the piece of information.

For example, while currency (or timeliness) could be quantified in objective terms: when was a particular piece of information last updated? The meaning attached to any score such as “yesterday” or “25 years ago” would be highly subjective and dependent on the need (or use of, which now begins the utility discussion) for the information. A regional historian may want to chronicle the population and land use growth in the region over the last 200 years, using approximately 25-year increments (where the data are available) to illustrate how, when and where substantive changes occurred. While some GIS datasets are updated with key attributes (e.g., new roads, construction, etc.) on a near-real time basis, allowing local planning or road maps to be published several times per year. The historian does not need “several times per year” currency, while the municipal tax planner would certainly need plats and other maps that support tax planning and collection updated far more often than once every 25 years. This leads to taking a perspective of information as a product or commodity. As with other products and commodities, some information is highly perishable (e.g., daily newspaper accounts), “aging” very rapidly, and other information ages much more slowly.

31

Kahn, et al (2002) take both a product and service-oriented framework approach to data quality wherein a product-oriented approach focuses on data whose quality conforms to product specifications and a service-oriented approach focuses on meeting customers’ needs or expectations. Though this research does not necessarily note the difference between product and service orientations, the inclusion of a focus on satisfying users’ needs is useful. Further, Price and Shanks (2004) explore the fundamental nature of information quality as it pertains to organizational decision making.

Both these

approaches are useful to the information quality/utility researcher and though thought to be quality-focused by considering the user’s needs, they aid the move toward developing an information utility model. Burgess, Gray, and Fiddian (2006) discuss the link between information quality criteria and users information needs, albeit in a service oriented construct.

Much of what drives information quality derives from the more general quality domain. This is a natural extension of going back to the fundamentals in any given domain. However, another challenge that evolves is that a large portion of the quality literature comes from the manufacturing world. For example, Garvin (1987; , 1988) writes about quality for the management community, but mostly writes with a manufacturing focus. However, useful to this research is Garvin’s (1988) assertion that when taking a productbased view, quality should be considered to be objective and measurable.

32

Hakim (2007) refers to the multidimensionality of information quality and emphasizes that individuals must certainly have differing needs and wants with respect to the information they process, or that they ingest into their systems, and thus, their quality needs differ with their uses.

Based on the foregoing, in this research, information quality is defined to mean the collection of objective attributes that describe or score the goodness of a piece of (or source of) information. These attributes must be objective in order to permit the quality scores to be transferable. These attributes must also only describe intrinsic capabilities of the information in question in order to remain neutral of the “sender” and “receiver” of the information, and to remain independent of the use to which the information will be put.

Though geospatial data and information are a subset of the broader IS data and information quality and utility issues, a special case of geospatial data (most often thought to be maps and map-like data sets) are electro-optical and radar-generated image data. Chapters 14 and 15 of Barrett and Saleh (2004) focus on image quality issues. Though their focus is image quality assessments, this research deals with image quality and utility within the research construct that follows in the next chapter.

33

The Concept of Utility

Utility is written about most extensively in the contexts of economics and operations research.

Economists use the language of preferences, happiness, rationality, and

satisfaction. These are all value-laden terms: utility only has meaning within the context of satisfying the needs, desires, or wants of the person or organization that must be satisfied.

Giddings (1891) provides a very early theoretical discrimination between

subjective utility that focuses on wants and objective utility that focuses on the factors of production. Lange (1934) points out two fundamental assumptions needed to develop utility functions as the vehicle for quantifying human choices: (1) that even if utility is not directly measurable, an individual is able to make binary comparisons between different bundles of goods and to indicate which is preferable, and (2) that an individual can determine whether changed utility due to a transition in their bundle of goods is greater than, less than, or equal to a similar change in another bundle of goods.

These descriptors lead one to ask and (try to) answer, what is utility? A definitional construct is natural; but difficult to achieve in certain topical domains. Fisher (1968) explores many of these concepts, positing that utility is hard to pin down, yet averring that there is no suitable substitute. Therefore, utility is often left deliberately abstract in many economics classes. In fact, economics professors often create their own noun: “utils,” to aid group discussions of the concept of utility without ever having to define exactly how to compute tradeoffs. One trend: utility is about tradeoffs. One semantic trade-off is that to provide clarity in the literature, many authors prefer to describe these

34

tradeoffs in binary, “this or that” terms; however, rarely in the ‘real world’ are choices left simply to two variables. Baumol (1951) addresses some of the key concepts posited in Neumann-Morgenstern (1947), specifically pointing out not just the trade-off dynamic as the core of utility preferences, but also the importance of context:

“People speak of the utility of a hat, but a hat can only have utility…if it is neither complementary with, nor a substitute for, any other item (including another hat?). This implies that at best we can hope only to get a unique measure of utility of (or, indeed, an ordering of preference of) the totality of one’s possessions in various situations.” (Baumol, 1951, p. 62,)

Freidman and Savage (1948) and Neumann and Morgenstern (1947) argue that maximization of utility is a core tenet of utility research into consumer/individual decision maker behavior. Neumann and Morgenstern (1947) focus on numerical utility based upon maximized expected value. Friedman and Savage (1948) focus the Neumann and Morgenstern work by narrowing some of the phrasing of the arguments. Also of interest are discussions in both works about not just the assessment of utility, but also the consideration of the differences in utility between different sets of choices.

DeGroot (1983) identifies two components of decision problems under uncertainty: decision makers’ subjective probabilities, which reflect knowledge and beliefs associated with the decision being made; and subjective utilities, which reflect preferences and

35

choices. DeGroot (1983) assumes that the decision maker can define his subjective utility functions.

Economists also use the language of defining and measuring marginal utility, or of maximizing utility. These extend utility even further into the realm of relativism: utility is not as important as is the change in utility. To define utility so that it can be measured and compared, economists then use the language of utility functions. These functions are in fact, binary trade-off curves illustrating how as the desirability of product x declines, it can be substituted by increasing product y. Jacob Viner (1968) ties utility theory to value: “The utility theory of value is primarily an attempt to explain price determination in psychological terms.” As previously mentioned, value, value theory, and information value are out of scope for this research. Given that “value” is out of scope, the reason this link is relevant is psychological impact: users’ psychology…extended to users’ wants, desires, and needs, again point to the value of users’ context in determining the utility of something. Extending this relationship, Handy (1970) writes of the impact of rationality on utility, which leads back to the general economic premise that consumers act rationally to maximize their overall utility in product consumption. Handy (1970) defines utility in economic theory as the “ability to satisfy a want.” Bernstein (1996) points out that utility theory requires people to be able to measure their utility (preferences) and to guide their decisions based upon these measurements.

Schmidt (1998) and Hakansson (1970) provide examples of structured, mathematical approaches to defining utility both semantically and formulaically. Bell and Farquhar

36

(1986) provide a similar treatment and analyze the axiomatic approaches of Neumann and Morgenstern (1947) as a foundation for much of what passes for theoretical expected utility theory. Schmidt (1998) provides a good survey of (then) recent utility research, and further defines from the literature the basic components of utility theory under risk as being consequence (impact of an occurrence), probability (the likelihood an occurrence will occur), and preference (the choices in the tradeoff). The latter factor is the one most relevant to this research; the first two factors link to the specific application of Schmidt’s lecture notes to the insurance industry.

Luce (2000) describes the classical approach to measuring utility: through observed behavior. He relates the differences in “classical” versus new “fashionable” approaches to measuring utility theory across the psychological, economic, and other community disciplines. This treatment is an extension of the numeric expression of utility found in Schmidt (1998). Luce (2000) and Quiggin (1993) address rank-dependent utility as a method for structuring the utility problem while providing tradeoff consistency. This concept is related to the concept of rank-order selection represented in the psychological literature (Thurstone, 1959).

Iverson and Luce (1998), Regenwetter (1996), Falmagne (1978), and Block and Marschak (1960) discuss probabilistic utility models, termed random utility models. The details of these approaches are too structured for this exploratory research; however, the implications are relevant: once the factors comprising geospatial information utility are

37

uncovered and correlated (i.e., this research), subsequent research into the measurement of utility must certainly center on defining probabilistic utility functions.

Aleskerov and Monjardet (2002) provide a treatment of context-dependent threshold functions for determining and maximizing utility in choice making.

The treatment

includes the notion of n-ary (non-binary) preferences. The centerpiece of the discussion is the notion of choice thresholds; more relevant to this research is the inclusion of the context-specific application.

Keeney and Raiffa (1976) provide a five-step utility assessment approach in the support of unidimensional decision making. Related aspects of utility assessment methods and concepts are further articulated in Fishburn (1965; , 1967), and Hull et al. (1973). Farquhar (1984) extends the Keeney and Raiffa (1976) approach and assesses several utility assessment methods. These methods include: preference comparison, estimation of indifference points, probability equivalence, value equivalence, certainty equivalence, paired-gamble, and some hybrid methods that combine more than one method. These methods provide different tailored approaches to mathematically consider how to assess utility once preference or indifference curves (functions) are developed, but not which factors ought to be included in the assessment in order to make domain-relevant utility judgments.

Farquhar (1975) identifies some of the key features of multiattribute utility models, illustrating how these models are used to support decision outcomes involving several

38

attributes and how these utility functions can be decomposed for analysis. Farquhar (1978) also addresses the relationships between interdependent decision criteria as reflected in multiattribute, multi-criteria decision making. Payne, Laughhunn, and Crum (1984) provide a useful treatment of the relationship between multiattribute utility models and the framing of the decisions they support. The note with special interest the role of reference points for aspiration towards particular utility goals. Harvey (1993) takes a methodological approach to discussing the relationship between utility functions, in this case between the assumptions that underlie multiplicative and additive utility functions.

Finally, Freidman (1955) posits that broadly speaking, utility is not measurable; to be useful, utility must be more narrowly and specifically defined. He declares utility to be a “neutral concept” that only becomes useful when defined within a specific context.

The Concept of Information Utility

As the close of the section on information quality describes, some of the information in the quality domain is contradictory or unfocused.

However, many of the named

attributes (or ‘quality criteria’ in the language of the Eppler framework survey) that violate the information quality definition proposed for inclusion in the (emerging) geospatial information utility domain.

As with no consensus definitions for data and information, according to Wang et al (R. Y. Wang, Storey, & Firth, 1995), the subjective nature of how data quality dimensions and

39

associated criteria are perceived or are represented in many different contexts leads to the same situation for coalescing the information quality and data quality fields. Richard Wang’s articles suggest (R. Wang & Strong, 1996; R. Y. Wang, 1998; R. Y. Wang, Reddy, & Kon, 1995; R. Y. Wang, Storey, & Firth, 1995) his and his colleagues’ focus on data quality and their approaches to understanding the subjective nature of attributebased frameworks for data and information quality. The coverage of the subjective nature of data and information quality provide credence to the notion that it may be time to begin a parsing of the information quality domain into (two or more) smaller but more narrowly (and accurately) defined domains:

•

The information quality domain to articulate and clarify the objective measures and attributes of the quality of data and information; specifically, quality should be about intrinsic grading of data and information, independent of its uses.

•

The information utility domain to articulate and discuss the subjective measures and attributes of data and information as they pertain to the utility, or use of, data and information in organizations. The utility domain should be context-oriented.

In this way the parsing of information quality into (1) information quality, based on objective measurements and attributes (motivated by (Garvin, 1988)), and (2) information utility based on subjective attributes is explored. Garvin supports this concept, even if not this terminology with his user-based definition of quality that focuses on “fitness for use” (Garvin, 1988, p. 43).

40

Motro and Anokhin (2004) take a similar utility-oriented focus when assessing the quality of information used in information systems. They also look at the most prevalent quality attributes (e.g., accuracy, currency, availability) and others, such as costs; then they focus on being able to say something about the performance of source data. This line of thinking leads to being able to develop utility functions for evaluating source data. And being able to develop sets of utility functions for evaluating source data logically leads to being able to develop a model for generalizing utility about information and data.

Beyond thinking about “what” information utility might be (or ought to be) some authors (Schlee, 1990) posit “how” information utility might be composed. Kifer and Gehrke (2006) approach the problem of measuring the utility of information from a data management perspective. They acknowledge the paucity of structured approaches to measuring information utility and provide some insight into metric development.

Quality and Utility Issues Pertaining to Geospatial Data and Information

Burrough and McDonnell (1998) provide one of the seminal texts in the GIS field. In chapter 9 on errors and quality control, they provide their own quality framework pertaining to spatial data and information.

Reminiscent of the eight frameworks

described in Eppler (2003), the Burrough and McDonnell (1998, p. 223) framework includes seven factors (previously called dimensions or criteria) and offers detailed components or explanations to put these factors in context. Five of the seven factors

41

(criteria) appear repeatedly: currency, completeness, consistency, accessibility, accuracy and precision (some authors parse these apart and some combine them together).

Fotheringham et al, (A. S. Fotheringham, Brunsdon, & Charlton, 2000; A. S. Fotheringham, Wegener, & European Science Foundation., 2000; S. Fotheringham & Rogerson, 1994) point out data and information complexities analysts face when using GIS. Fotheringham and Rogerson (1994, p. 88) outline seven data space interactions:

1. geographic space only 2. geographic space and time space 3. geographic space and attribute space 4. geographic space, attribute space, and time space 5. time space only 6. time space and attribute space 7. attribute space only

Where geographic space refers to spatial or geographic data: i.e., locational information within the data set in question; time space refers to temporal iterations of the geospatial data or additional temporal attributes relevant to GIS analyses: i.e., temporal details contained within the dataset; and attribute space which refers to all other attribute data.

It is important here to point out two possibly conflicting perspectives with respect to attribute information. Properly, most or all of these attributes should be considered as

42

attributes such as are normally contained in the metadata about any particular dataset. Metadata are commonly defined to be data about data. The point here is that it is common to provide information (admittedly to widely varying degrees of completeness) such as is contained in the previous section about accuracy, currency, etc., within geospatial datasets’ metadata. Whether actually performed or not for structured and unstructured non-geospatial data and information, it is becoming more and more common to standardize and populate metadata fields for geospatial data and information. The “other” perspective is about not being one step removed from the attribute information about the dataset (or piece of information) as a whole, but rather about understanding that particularly within geospatial (and spatio-temporal) datasets the “features”3 contained within the dataset have their own attributes.

Referring back to the Fotheringham and Rogerson (1994) seven data spaces, they point out the ordering of these data spaces is not relevant. Rather, the important point is that GIS as a subset of MIS are ingesting, using, and outputting different types of data (and information by extension). Each of these data types has its own quality and utility issues, and the GIS user must eventually be able to make quality and utility assessments in an integrated way.

McCullagh (1998) points out analysis challenges with data sets of different types applied to different kinds of problems. Only a limited amount of published work on information quality focuses on the form the data or information take. Geospatial analyses using GIS, 3

Geospatial features could be the streams, roads, buildings, etc. found on maps and embedded within electronic map datasets. Each of these features has its own attributes: e.g., streams have width, depth, current flow rate, bank height, among many other attributes.

43

as with other knowledge work and problem solving activities, often permit multiple approaches or interpretations. McCullagh (1998) compares analytical requirements (e.g., size of spatial structure) and reports on advantages or disadvantages of using one data type or another.

In this instance, the comparison is between rectangular grids and

triangulated grids, but the point is made, for many different analytical requirements, there may be many different preferences for one data type or another, or one data format or another, or one accuracy assessment or another.

Morrison (2002, p. 507) provides a mapping of key spatial data quality issues with their corresponding elements within two international spatial data quality initiatives (Comite Europeen de Normalisation Technical Committee (CEN/TC) and International Organization for Standardization (ISO) technical committee 211).

The six quality

dimensions Morrison discusses are: attribute accuracy, completeness, logical consistency, positional accuracy, temporal accuracy, and lineage.

He further identifies, for the

ISO/TC 211 framework, the first four elements as quantitative, and the last as being nonqualitative. Although the chapter is titled Spatial Data Quality, the focus is on resolving both quality and utility issues.

As discussed throughout this chapter, some of these so-called quality criteria also address issues of utility. Marsden (2000) provided key inspiration about some of the mechanics and motivation on researching geospatial data utility. Obermeier (2001) also provided key data about users’ utility information, albeit from a government agency-data producer’s point of view. Duncan et al (Duncan, Heidbreder, Hammack, & Szpak, 1997)

44

reports on the use of Product-Source Prediction Capability (PSPC) as a model for analyzing RADARSAT data.

Though the data sets are limited in scope, the paper

provides insight into some preliminary government attempts to address the geospatial utility problem.

Thinking back to general utility functions and their limitations or uses, Keeney (1971; , 1972) explores the multiattribute problem in detail. This approach is useful for this research insofar as the preponderance of utility literature addresses unidimensional or binary (paired-choice) utility approaches as simplified means of addressing the utility function problem. Meeks and Dasgupta (2004) used a multiattribute model to propose a weighted, multiattribute utility assessment schema. Hull et al. (Hull, Moore, & Thomas, 1973) point out the justifications for linear and additive utility functions. Some models are very restrictive and not necessarily representative of the kinds of attributes that decision makers find themselves considering when making organizational decisions. For example, for the simplest linear utility model, all attributes must be measured in the same units: this is useful when all attributes can easily be denominated in the same terms (e.g., monetary terms). However, this restriction is not particularly useful in the geospatial information domain. Meeks and Dasgupta (2004) specifically limited the nominated set of utility factors to accuracy (both horizontal and vertical), currency, data type, and datum. This scope limitation was knowingly non-inclusive and was done to explore simplified versions of the linear multiattribute utility model.

45

Other Terminology and Related Concepts

This section provides key terminology related to the conduct of this research. As this research is first and foremost a study about the effective use of geospatial data in GIS, this research and this section focus on communicating geospatial data-related issues. Kuhn (1979) wrote: “Metaphors play an essential role in establishing links between scientific language and the world.” Kuhn (1962; , 1973) also pertains. This research is not groundbreaking, nor is it necessarily paradigm shifting in a Kuhnian sense. However, this research does address a problem unaddressed in the information systems and GIS literature: to begin exploring “how” to measure information utility by defining the factors that ought to be included in these measurements from the user community’s point of view. This allows GIS users to gain greater insights into the usefulness and relevance of the data and information used in their GIS so they can plan to be more effective in the execution of their GIS analyses through the maximization of their geospatial information utility. From this insight, GIS users and stakeholders can communicate more effectively. Maasen and Weingart (2000) provide an excellent treatment of the use of metaphors in the social sciences. Two important metaphors guide this research: the “information pyramid” and the “information puzzle.” The information pyramid is an information processing metaphor and the information puzzle is an information context metaphor.

46

The Information Pyramid: A Data–to–Information–to–Knowledge Paradigm The concept of the information pyramid (as found in several information technology textbooks

and

other

popular

books

on

information systems) holds there is a rough pyramidal hierarchy that links data, information, and knowledge together.

Transformation via “processing” by providing additional context

Wisdom Insight

Decker (2001)

provides a useful treatment as a prelude to

Knowledge

discussing geospatial data types, sources, and

Information

their uses.

In the pyramid metaphor (Figure

Data

2.4), the height dimension refers to increasing degrees of abstraction and complexity (attained

Noise

Figure 2.4 Information Pyramid

through the processing and aggregation of more detailed, transactional data from levels below). The base (horizontal) dimension refers to the breadth or volume of occurrences of problems/objects (e.g., data, information, etc.).

Thus, at the lowest level, the data is more atomic (e.g., individual transaction data), which means there is less abstraction while there is significantly more volume of these types of data. This metaphor specifically addresses:

9 Data are pieces of “information” at the most atomic level 9 Data can occur without any specific context 9 Data most frequently occur at the transactional (lowest) level of operations 9 The most atomic data occur with the greatest volume or frequency

47

9 Data are transformed into information by applying one or more contexts 9 Some data are eliminated through processing into information 9 Some data are combined together when processed into information 9 Information can be processed into knowledge by the application of more complex contexts (constructs) 9 Information most frequently occurs at the managerial (middle) levels of organizations 9 Knowledge results from complex information processing and abstraction 9 Knowledge most frequently occurs at the strategic or decision making (highest) levels of organizations

For example, a data element with a value of “20” could mean many things: the length in feet of a room where people have gathered, the number of minutes a meeting should last, the number of people gathered, the number of months’ worth of sales being discussed at the meeting, the temperature of the room in Cº, or the number of dollars of spare change found in peoples’ pockets or purses.

As shown, context is critical to applying

understanding to raw data. By applying contexts—common or highly specialized—such as space, time, population, temperature, currency, etc., raw data are processed into useful information. The three levels described here: data, information, and knowledge are not the only semantic labels applied, though they are the most commonly used. Others often include intelligence, insight, and wisdom (normally found “above” knowledge in the pyramid) and sometimes “noise”, which is found “below” data on the pyramid, and is

48

interference with the collection of data or the distortion of data so that some values or meanings are not readily discernible.

The Information Pyramid, as a data–to–information–to–knowledge paradigm is a useful construct in general and when applied to GIS and geospatial sciences specifically. Whenever the information “processing” term is used in this research, it relies on this construct: to add meaning through the application of analytical contexts.

This is

important because through finding or developing semantic and synaptic relationships between information contexts (most often by user definition), grouped into concepts and constructs (either discovered or created) in organizations, knowledge to support organizational decision making is created and used. This provides a logical basis for specifically calling out data, information, and knowledge separately—whenever needed and appropriate—or of combining these elements back together when it is convenient or necessary to do so for clarity.

The Information Puzzle: A How-Data-Fits-Together Paradigm The information puzzle is another useful metaphor for communicating the ways in which data, information, and knowledge are “pieced together” to form ever more complete “pictures” of one or more concepts or ideas, which are used to support individual or organizational problem solving and decision making. This is a useful metaphor because:

9 It reflects the reality that in problem solving and decision making—as with puzzle building—it is often possible to get a sense of the picture that is emerging before

49

all the pieces are found (discovered) and placed into the puzzle. Astute observers are able to make probabilistic estimates or judgments of what will finally emerge before it finally does so permitting early or preventative action towards an end.

9 It reflects the reality that in problem solving and decision making—as with puzzle building—not all pieces are equal in value, quality, or utility; nor are they equally accessible, thus forcing data and information tradeoffs because of these limitations.

In table-top puzzle building, the edges and corners are easy to

identify by their distinctive shapes, which often stand out from the other pieces. Finding these pieces permits a framing of the puzzle-building problem, thereby conferring some extra value or utility on these pieces. Other pieces may also be distinctive in shape or color, making them easy to spot (the accessibility problem) and easier to insert into the emerging puzzle solution (perhaps increasing their value or utility to the user). These kinds of pieces (framing or high-value by distinction) often offer crucial aids to advancing understanding about the underlying “picture” being worked on: permitting understanding about what the puzzle means. Likewise, in organizational problem solving and decision making, it may be possible to identify some or all of the critical pieces that comprise the solution to a problem or decision, or that bound (frame) the problem or decision.

9 It reflects the reality that in problem solving and decision making—as with puzzle building—not all pieces are equal in the cost to acquire or use.

In puzzle

building, cost to acquire may mean the time to find or fit a particular piece; that is,

50

some pieces are harder to find or fit than others. Likewise, in organizational problem solving and decision making some pieces are more difficult (or more costly) to acquire than others, and once acquired, their use may require additional investigation or other analytical effort before they can be put to use.

The information puzzle metaphor is powerful because of its aptitude in simplifying organizational data and information problems related to problem solving and decision making; however, it has two artificialities. First, in table-top puzzle building the number of pieces of information needed to complete the puzzle is fixed: if you purchase a 1,000piece puzzle, you can reasonably assume that it will require finding and correctly using exactly 1,000 pieces of information to complete the puzzle. In organizational problemsolving or decision-making, that is rarely the case. The number of pieces needed to complete an organizational “puzzle” may not be known exactly or without any a priori certainty. In fact, in many organizational problem solving situations, since there are multiple ways to solve the problem, there may be multiple solution sets of required pieces of information—whether the domain of those solution sets is known or not. Secondly, in organizational life, it rarely occurs that even if the solution set of all required pieces of information is completely known, that all known pieces can be acquired.

These two data-to-information processing paradigms (metaphors): the information pyramid that describes a frequency-of-occurrence versus the abstraction-complexity relationship between data, information and knowledge; and the information puzzle that describes the information piece “completeness” problems constitute the epistemological

51

foundation for pursuing this research. And, in fact, both are representative of the use of data and information within GIS-based analyses.

Spatial Data Spatial data are data having any type of “where” (i.e., place or space) orientation. Spatial data is a very broad category of data types, sources, and uses. Spatial data can be onedimensional and describe the location of objects as points in space (e.g., lighthouses, small buildings on small scale maps, cell phone radio towers, etc.) or lines (e.g., roads, building boundaries, etc.). Spatial data can also be two-dimensional and describe area objects (e.g., city boundaries, land plat diagrams, the extent of a grove of trees or a crop planting pattern, etc.). Or finally, spatial data can be three-dimensional and support development of more ‘realistic’ objects such as those used in modeling constructs. These data can be represented over time, also providing up to four-dimensional representations. Spatial data that include a time component are often called spatiotemporal data. Examples of spatial data include: astronomical star charts, a photograph of the Apollo 11 lunar landing site, satellite weather maps, city street maps, organizational sales territory boundaries, building blueprints, department store layout diagrams, telecommunications network topology diagrams, surgical diagrams of a body prepared for surgery, whorls in fingerprints, and protozoan cells on microscope slides. Spatial data—in this instance, attribute information about spatially arranged objects—exist independent of scale, dimension, scope, and time; these are all representation artifacts.

However, to be

visualized and perceived, they are often represented in these ways (e.g., scale must be applied) for ease of use and improved understanding.

52

Geographic Data Geographic data are a class or subset of spatial data, which address “where” and “how far” kinds of questions and which specifically include an absolute or relative georeferencing frame. Geo means earth referenced. Therefore, Georeference means to relate one spatial object to another by using some form of common coordinate system that is ties to the earth in some way. Examples of geographic data include: locations of planned cellular telephone towers, delivery trucks enroute being tracked with wireless global positioning system (GPS) transmitters, latitudes, and longitudes of oil tankers at sea enroute to petroleum refineries from Middle Eastern oil fields. Unlike spatial data that can (but may not) exist independent of scale, dimension, and time, geographic data can only exist within the boundaries of the georeferencing construct applied to them.

Geospatial Data Geospatial data have both spatial and geographic components. Geospatial is a term becoming more commonly used, supplanting the term geographic data because the term (geospatial) is more inclusive and comprehensive. This is the default term within this study, except where one of the other terms is specifically more appropriate.

Temporal Data Temporal data are data having a time-based orientation. In a GIS context, temporal content is normally used in conjunction with geospatial data. Temporal data can answer “when” questions (e.g., start and stop times) or “how long” questions (e.g., duration

53

intervals). These data are very useful in GIS analyses, and non-GIS business analyses because of the need by analysts and managers in many disciplines to detect and report on changes over time.

Spatiotemporal Data Spatiotemporal data are data that combine spatial (or geospatial) and temporal aspects of the subject being investigated or analyzed. This is more a semantic concept than one that can be directly quantified; the term is becoming “shorthand” for considering spatial (or geospatial) and temporal data within a common construct.

Usefulness From Merriam-Webster Online 4 : “Usefulness is the quality of having utility and especially practical worth or applicability.” In this research, usefulness means able to use (i.e., within a GIS for a specific analysis or other purpose). An analogy illustrates the point: an auto mechanic may find (all) the tools in his toolbox useful (at one time or another) for performing different kinds of repairs to cars; in fact, he may not be able to perform the repairs without the tools. To GIS users, the tools in their “toolboxes” are the hardware (workstations); software (operating systems, applications, etc.); connectivity to data sources and other analysts and users; primary and secondary data sources; and analysis-specific algorithms and analytical procedures. These components comprise the constituent parts of GIS. All these tools are useful at one time or another.

4

http://www.m-w.com/cgi-bin/dictionary

54

Relevance From Merriam-Webster Online: “Relevance means related to the matter at hand. The ability (as of an information retrieval system) to retrieve material that satisfies the needs of the user.” In this research, relevance means want to use because it is related to specific GIS analysis or other user-defined purpose.

Continuing the mechanic analogy,

depending on any specific repair to be made, all tools in the toolbox may be useful (at some time or another), but only certain tools in the toolbox may be relevant for particular repairs. For GIS users, selecting the appropriate “tool” or the specific use of the tool may vary from analysis to analysis. Depending on the complexity of any given analytical task, more or less of the GIS tools in general, and the data construct tools specifically, may be relevant to any given analysis.

Data Type Burrough and McDonnell (1998) identify three types of data formats: the media format (e.g.,, tape, floppy disk, etc.); the data format (e.g., the data refer to the way in which entities are spatially represented: points, lines, and polygons of vector data; continuous fields or grids or raster data); and the locational and attribute format, including scale and projection. In this research, the second format is called data type, which includes but is not limited to: vector, raster, elevation, and textual data types.

Data Coverage Data coverage pertains to the areal extent or span of coverage for a GIS user’s particular area of interest for any given use of geospatial data in GIS. In GeoIU terms, area of

55

interest (AOI) is a useful construct for permitting users to identify the geographical areas for which they desire dataset coverage.

Data Currency Data currency pertains to the age of the data that is available or being sought. A map or digital dataset will normally not have a single date (age) associated with its production or distribution. Data currency is made complex by the nature of multiple dates of source collection, data processing and finishing, archiving, editing/updating, and map/dataset reproduction and distribution. Though it seems that there are different dates because of different functions/activities involved in data production, in practice, this is made more complex by the fact that a map is rarely made from a single source dataset collected (and then subsequently processed) at one time. GIS users need to be able to specify different currency needs based on different analytical needs (uses). Some users need highly current data and some users need highly accurate data and some users need both. However, due to costs and limitations of data acquisition and processing there normally exists a dynamic tension between allowing more time to obtain or process more accurate data (or information) versus less time permitted (for acquisition and processing) in order to obtain more current data (or information). Thus, an accuracy versus currency tradeoff is often one of many binary utility/preference tradeoffs encountered in GIS analyses.

Data Lineage Data lineage pertains to sources and methods involved in data collection, processing, production, and distribution for use (Smith, 1993). Specific issues in determining data

56

lineage include the sources or identities of organizations or individuals performing data collection, verification, processing, archiving, and any other value-adding process associated with providing or using the data that form or transform those data. Beyond the credentials of the people or organizations who are the sources of data, the methods by which they perform their value-adding activities are also part of the lineage (which speaks to quality) of the data.

Data Quality In this research, when applied to geospatial data, data quality focuses on attribute-based quality factors (see chapter 2), and specifically includes geospatial data accuracy and resolution issues. Geospatial data accuracy is normally comprised of both locational and content accuracy. Locational accuracy is normally specified in both horizontal and vertical dimensions, which are normally represented in circular error probable (CEP) and linear error probable (LEP) terms respectively. Content accuracy pertains to the ability to correctly assign attribute values or other semantic labeling. Geospatial resolution is normally thought to pertain to the degree of fineness to which a sensor can resolve detail, or to which a dataset can describe an object in question. As with data currency, GIS users must be able to make choices about where their data quality priorities lie.

Data versus Datum In normal English usage, data is the plural form of datum, which is defined by MerriamWebster Online to mean: “something used as a basis for calculating or measuring.” However, in the geospatial sciences fields, datum has a very specific meaning that is not

57

necessarily the singular form of data. A datum is a mathematical representation of surface of the 3-dimensional geoid (i.e., the surface-varying oblate spheroid) that is the earth (Lillesand & Kiefer, 2000).

By convention in the geosciences field and this

research, more than one datum is not data, it is datums. Similarly, though a little bit less of a rigid field convention, as within the broader information systems and technology fields, data can be both singular and plural.

Summary of Literature Review

There is a dearth of literature pertaining to information utility as a separate construct, or of utility as a set of criteria or factors or of any articulated theories for researching and analyzing utility as it pertains to data and information used within IS and GIS. This dearth presents an opportunity that this research addresses.

The relatively rich field of information quality often presumes to encompass utility as an embedded concept, normally through the identification of utility or relevance (a proxy for utility) as one of many criteria for determining information quality. As GIS are a subset of all IS, this research uses constructs from IS in a contextual way to shape the GIS quality – utility dialogue. The extant information quality literature takes a “quality is comprised of multiple criteria” approach.

In this research, the ‘information value as loose proxy for utility’ construct is considered out of scope. This research seeks to add to the existing body of literature on quality and

58

utility by making a definitional distinction between the two and of surveying users about their perceived quality – utility needs. Value is a construct that will be dealt with in subsequent research.

59

Chapter 3: Methodology

Overview

This chapter describes the methodology and methods employed in this research in order to study the core research problem of exploring GIS users’ and stakeholders’ perceptions about the utility of the geospatial data and information used in GIS-based analyses and decision making. This chapter is organized into the following sections: description and discussion of the basic research model, a candidate set of exploratory quality and context factors considered relevant to a discussion of geospatial information utility, research design and methods, survey design and measurement, data collection and analysis approaches, a discussion of the sample respondent pool, an examination of validity issues, expected results, and limitations.

After reviewing the literature, it is apparent researchers and academics have drifted into defining information quality as the sum of its attributes, without ever adequately addressing the concept as a whole: very much reminiscent of the story of the six blind men attempting to describe an elephant after each only encounters one key feature (appendage, etc.). The field is relatively new and wide open, so anyone with something to say can say it. The shortcoming of this situation is patchwork way in which progress moves forward; that anyone with something to say can develop a framework that perhaps fits within narrowly described, specific instances, but which fails to help develop consensus in the field. 60

Thinking like an engineer performing an iterative analysis, this research successively iterates between broadening the term scope and narrowing it, cycling back to the higher order term quality to avoid missing anything. For example, when people commonly think of the quality of something, they think about the goodness, or lack thereof. This reflects several of the eight definitions of quality found on Merriam-Webster Online.5 These definitions (e.g., 1. essential character, 2. grade (as in degree of excellence), 3. rank (as in social status), and synonyms such as property, character, feature, and attribute speak to inherent (intrinsic) descriptors that let us grade the thing under review. By extending the dictionary definitions of quality to information quality, the information quality framework summary terms such as accuracy and precision seem like descriptors of inherent properties that could be used to grade the quality of a piece of information. This grading of the quality of a piece of information should be independent of any use of the information. Something could simultaneously be either of high quality and low utility, or vice versa.

Economists write of the utility of a good or service in the context of happiness or pleasure, where “pleasure is the satisfaction or utility derived from the consumption of goods or services” (Peterson, 1973, pg 279). A way to consider the benefit of a piece of information is to measure or estimate the utility of the use of the piece of information. This leads to the utility of the informed decision: what benefit does the organization (or the manager) derive from the addition of one or more pieces of information in making a decision? Economists speak of maximizing utility. This research assumes GIS users 5

http://www.m-w.com/cgi-bin/dictionary?book=Dictionary&va=quality

61

want to maximize the utility of the answers that result from their analyses, which leads to the assumption that they would seek to maximize the utility of the source data and information they use in their analyses.

Therefore, information utility relates to the

usefulness or relevance of a piece of information. This paper defines information utility to mean the collection of objective attributes that describe or score the use of a piece of (or source of) information. These attributes may, and most probably will always be, subjective in nature in order to permit the utility scores to be directly applicable to userdefined problems and analyses.

Basic Research Model

The general research model for this research is shown in Figure 3.1:

Geospatial Information Utility

Quality

Context

Attribute-based factors

Use-based factors

Figure 3.1 General Research Model for Exploring Utility of Geospatial Data

62

Using the model shown in Figure 3.1, this research explores the concept that the utility of geospatial information is comprised of both quality and context factors. Quality factors are considered to be attribute-based. That is, these factors describe objective, directly measurable features of the data or information ingested into, processed within, our outputted from a GIS. These objective features constitute the qualities of the different ‘pieces’ of data or information used in a GIS construct. automobile contains a 320 cubic inch engine.

A simple example: an

The engine size is a feature of the

automobile that can be objectively measured and reported. On the other hand, context factors are considered to be use-based, where use-based means ‘in-use’ or ‘intended-use’. Recalling the automobile engine, information describing the automobile’s engine’s size can be a defined quality (attribute-based) feature of the car; however, at the same time, it can be assumed the owner or operator of the vehicle obtained or operates the vehicle for one or more intended purposes.

It is natural to further assume there exists a many-to-many relationship between many vehicle features (as discrete components of the vehicle as a whole ‘system’) and the one or more intended purposes for which the vehicle is operated.

Is the vehicle a

transportation system conveying the owner/operator or one or more passengers from place to place? Is the vehicle a delivery system used by the owner/operator to convey goods or other materials from place to place? Is the vehicle a recreational system of some type (e.g., a race car)? Can the vehicle be operated for more than one purpose, either concurrently or sequentially? The reason these kinds of questions are important is that systems and data users need to examine their rationale for wanting to know the

63

content of the data and information they ingest into their systems. This is at the heart of the meaning of utility and leads the manager or the user to deciding how much trouble or cost to expend in order to obtain and use the data content.

For all of the uses or intended uses of geospatial data and information, any specific use may focus both on: (1) selected sets of qualities (attributes) germane to any specific use at hand, and (2) specifically-desired values (i.e., data or information content) for the selected attributes, for the particular use at hand. The combination of any or all selected attributes (e.g., accuracy, currency, data format) and specifically desired attribute values (e.g., CEP-90, data less than 6 months old, vector versus raster data type) constitute the degree to which qualities and contexts combine together to provide either more or less satisfaction with the data and information available, depending on the need for use. This use-based satisfaction for data and information is the basis for the exploration of the concept of information utility and geospatial information utility. In the foregoing, where attributes are described, this is what the factor analysis literature refers to as factors to be examined and condensed or summarized. What this research calls the ‘factors’ of quality and context, as they relate to geospatial information utility, Hair et al (Hair, Anderson, Tatham, & Black, 1998) describe as perceived dimensions. They are perceived because they are not directly measurable and because their use and utility are based on the perceptions of the users.

This is the point of this research: to explore GIS users’

perceptions about the candidate factors (identified in the next section) as they pertain to quality, context, and utility used in GIS analyses. This conforms to the utility literature that focuses on user-derived utility functions based on users’ needs for particular qualities

64

or attributes and particular attribute values. Hair, et al (1998) also point out that several different terms, such as factors or dimensions are more or less specifically used, depending upon the multivariate technique used to explore the constructs.

In the

multivariate data analysis sense, factor is a better term. In the data, information, and database realm, attribute is a better term. This research refers to the same things (e.g., accuracy, currency).

For the previous automobile-in-use example ‘Passenger-transportation,’ the set of attributes that are most useful and relevant may center on passenger carrying capacity and then passenger comfort and safety (for private transportation) or vehicle durability and economical operation (for commercial or public transportation). This is not to say there may not be many other attributes (qualities) to help describe the quality, context, and utility of the vehicle in this particular use, but the example is about finding or describing the most useful or relevant attributes (qualities). For automobile-in-use example Cargodelivery, a completely different set of attributes (qualities) may be more germane. This different attribute set will most likely center less on people comfort and more on cargo capacity or movement. From this discussion, it should be clear a system’s intended-foruse purpose or purposes, and their attending uses in operation, bear direct relation to the attributes (qualities) that are most germane to the associated attribute values needed (and measured in use) for each attribute. Any attribute, such as engine size, has much more meaning in a utility context once it is known if an engine of certain size is required or desired (where engine size is a proxy for ‘available power’) for any specific intended use under study. If so, for example, given the two examples, passenger transportation and

65

cargo delivery, for each different use there may be certain minimum engine sizes desired or required so that the vehicle has sufficient power to accomplish its primary or other tasks (i.e., transportation, delivery, etc.). On the other hand, too much engine size may be inefficient when compared against economical operation of the vehicle (e.g., due to fuel costs, maintenance costs, etc.)

The foregoing provides a simple illustration of the hypothesized differences between attribute-based and use-based factors. However, this dissertation is not about vehicles as systems, but rather about exploring GIS users’ and stakeholders’ perceptions about their quality and utility needs as they pertain to the data and information ingested into, processed within, or outputted from their GIS. Therefore, to put these distinctions back into a GIS context, for a ‘piece’ of information ingested into a GIS to support a specified decision making analysis, an attribute (quality or context factor), or objective feature, might be the can be objectively measured and reported. However, how ‘old’ is too old? And, how ‘fresh’ need a piece of information be? These are more examples of perceptual questions GIS users must answer for any specific analysis in question. If a geologist seeks to study the topography and rock formations of the Grand Canyon in Arizona, given the age of the rock formations and the underlying topography, a map (in either paper or digital form) could be years or decades old and it may still be suitable for the geologist’s current needs. On the other hand, a municipal tax planner, in preparation for conducting an annual tax assessment of private, commercial, and municipal lands within the municipality’s jurisdiction may require a map that has been issued/reissued or updated no less than annually or semi-annually. The fact of the age of any given map or

66

digital geospatial data is an independent, objective, knowable, and reportable factor, without regard to its use: this is the attribute-based aspect referred to as the quality of the data. For any given GIS-based analysis, data currency may or may not be a critical factor. And, if it is a required or desired factor for a specific analysis, any given specific attribute value (or preferred range of values) may vary analysis by analysis. As shown, the intended use of the data provides context, which is necessary in order to make a subjective judgment whether or not the data is useful or relevant for the intended purpose. This discussion is pertinent in the methodology chapter, even if presented in similar form in preceding chapters, because it provides the logic for how the survey instrument is developed.

An essential fact of this research driving the exploratory approach is the relative dearth of utility literature pertaining to geospatial data and information that describes the factors that comprise quality, context, and utility. However, this is not an obstacle, this is an opportunity: an opportunity to contribute something useful to the geospatial sciences fields by initiating a dialogue amongst users, academics, and other stakeholders about how data and information used in GIS may be viewed and evaluated in terms of its potential fitness for use (within GIS or in GIS-based decision analyses).

Developing a Candidate Factor Set for the Analysis

Though the basic research model appears very simple, the complexity (and interest) enters into the dialogue at the point of nominating candidate factors for the exploratory 67

factor analysis.

As discussed in Chapter 2, the eight information quality models

presented in Eppler (2003) provide an excellent and comprehensive starting point for exploring quality and context factors leading to insight into the construct that is utility. By aggregating the eight models together and removing duplications, the initial survey instrument was developed with a candidate factor set comprised of the following factors shown here in Table 3.1:

Table 3.1: Candidate Factor Set and Their Sources Factor

Source(s)

Variable Code

Accessibility

(Lesca & Lesca, 1995),

ACC

(includes obtainability)

(R. Wang & Strong, 1996), (T.C. Redman, 1996), (Koniger & Rathmayer, 1998), (L. English, 1999)

Accuracy

(Taylor, 1986),

(includes content accuracy, fact completeness, and definitional conformance)

(R. Wang & Strong, 1996),

ACU

(T.C. Redman, 1996), (Alexander & Tate, 1999), (L. English, 1999)

Amount of information (context)


AMT

Clarity


CLR

(T.C. Redman, 1996), (L. English, 1999) Coherence

(Lesca & Lesca, 1995)

COH

Completeness


CPT

(includes content context)


68

(T.C. Redman, 1996), (Koniger & Rathmayer, 1998), (L. English, 1999) Comprehensiveness

(Taylor, 1986),

CPR

(T.C. Redman, 1996) Comprehensibility

(Russ-Mohl, 1994),

CMP

(includes ease of understanding) (Lesca & Lesca, 1995) Conciseness


CNC

(Koniger & Rathmayer, 1998) Consistency


CNS

(T.C. Redman, 1996), (Koniger & Rathmayer, 1998) Currency

(Taylor, 1986),

(includes timeliness)

(Russ-Mohl, 1994),

CUR

(R. Wang & Strong, 1996), (Koniger & Rathmayer, 1998), (Alexander & Tate, 1999), (L. English, 1999) Interactivity

(Russ-Mohl, 1994),

ITA

(Lesca & Lesca, 1995) Interpretability


INT

(Koniger & Rathmayer, 1998) Minimum redundancy

(T.C. Redman, 1996),

(includes non-duplication)

(L. English, 1999)

Objectivity

(Russ-Mohl, 1994), (Lesca & Lesca, 1995), (R. Wang & Strong, 1996), (Koniger & Rathmayer, 1998), (Alexander & Tate, 1999),

69

RED

OBJ

Precision

(T.C. Redman, 1996),

(includes resolution, attribute granularity, and preciseness)

(L. English, 1999),

PRE

(Koniger & Rathmayer, 1998), (Meeks & Dasgupta, 2004)

Relevance

(Russ-Mohl, 1994),

(includes relevancy)


RLV

(R. Wang & Strong, 1996), (T.C. Redman, 1996), (Koniger & Rathmayer, 1998) Reduction of complexity

(Russ-Mohl, 1994)

SIM

Reliability

(Taylor, 1986),

REL

Representativeness


REP

Security


SEC

(Koniger & Rathmayer, 1998) Transparency

(Russ-Mohl, 1994)

TRN

Trustworthiness


TRS

(includes believability, credibility, reputation, and source lineage)


Understandability

(Koniger & Rathmayer, 1998)

UND

Usability

(L. English, 1999)

USA

Usefulness


USF

Validity

(Taylor, 1986),

VAL

(Meeks & Dasgupta, 2004)

(L. English, 1999) Value-added context


70

CNT

Meeks and Dasgupta (2004) proposed an abbreviated utility construct comprised of: data type, area coverage, currency, quality, lineage, and “other”. These six factors were further comprised of 14 measurable variables, each of which has multiple attributes. These six prior factors are expanded to the next level of detail here to illustrate how GIS users may be thinking about the present factors being surveyed. Where appropriate, these prior factors were incorporated into the current factor set: Table 3.2 Meeks-Dasgupta Modified Factor Set (2001) Factor

Variables

Sample Attributes

Data Type

•

Spatial

9 Vector, raster, grid, elevation, spectral

•

Non-spatial

9 Structured, textual, taxonomical

•

Temporal

9 Past, present, future

•

Environment

9 Land, sea, air, space, urban, special

•

Type

9 Hydrography, land use, land cover, boundaries, population, transportation, utilities, soils

•

Scale/View

9 Small, medium, large, high resolution

•

Data currency

9 Collection, processing, archiving

•

Metadata currency

9 Created, updated, reported

•

Accuracy

9 Content, horizontal, vertical, temporal, spectral

•

Resolution

9 Spatial, spectral, radiometric, temporal

•

Sources

9 Collection, processing, archiving

•

Methods

9 Remote sensing, field collection, data vendors, digitized analog data, converted databases

•

Datums

9 Global, regional, local

•

Projections

9 Universal Transverse Mercator, Lambert Conic Conformal

Coverage

Currency

Quality

Lineage

Other

71

The organization of this prior construct evolved from two focus groups conducted under contract to the National Imagery and Mapping Agency (NIMA), now known as the National Geospatial-Intelligence Agency (NGA), in 2001 and 2002.

The choice of

constructs to be evaluated under contract was directed by Mr. Joe Obermeier (2001), the Contracting Officers Technical Representative (COTR).

That investigation was

undertaken as a proof-of-concept demonstration that a notional utility model could be developed and turned into a prototype web-based application for use by governmentsupported data producers and users desiring to estimate the utility and adequacy of their GIS-based analyses. That research was severely limited; its purpose was to use a small, manageable set of four to six “most common” factors as a mechanism for communicating information utility issues with government personnel involved in either:

(1) Supply-side data production or validation (2) Demand-side data delivery or use

This contract-sponsored research was then spun off as preliminary research for Meeks and Dasgupta (2004). This is relevant because the previous work laid the ground work for exploring geospatial data and information quality and context factors as users use them to support their analyses. The prior research, heavily literature and focus groupbased, is extended here to provide greater structure, depth, and rigor to the previous work. This extension requires an approach that considers a much broader and complete set of factors for the analysis.

72

Research Design and Methods

As mentioned previously, this research is a survey building research based upon exploratory factor analysis oriented on GIS users’ and other stakeholders’ attitudes about the factors comprising their geospatial data and information quality and context needs. To summarize, through the GIS users’ attitude-based factor analysis, this research develops and refines a survey instrument for follow-on use within the field.

As shown in the basic research model, this research treats information utility to be a second-order construct. The basic model shows Geospatial Information Utility to be comprised of both quality factors and context factors. The research design used a twostage process to gather and analyze respondent data:

(1) The pilot study employed two focus groups of six and seven GIS users, respectively, each to examine the candidate factor set to winnow the 28 factors into a more manageable set of 18 – 20 factors. The purpose of the pilot study is to assist in developing a more user-oriented set of questions for the survey instrument. The winnowing of factors was designed to occur two ways: by eliminating factors not suitable for analysis and by combining factors that are too similar to be discernible by the GIS users surveyed. Bottom line: the pilot study focused on developing and refining the survey instrument.

This aided in assessments of convergent and

discriminant validity.

73

(2) The main study was conducted as survey research, surveying respondents about their attitudes concerning the assignment of the factors to either the quality construct or the context construct. This research used a 7-point Likert scale for capturing responses. Bottom line: the main study conducted as web-based survey focused on validating the binning of quality and utility dimensions (factors) and refining the survey instrument for future use in GIS information utility and quality research.

This research initially focused on the 28 factors shown above, and then as modified by the pilot study into the final survey instrument contained in Appendix A. The survey questions (items) were designed so that a factor analysis of responses would determine via the correlation matrix if any given factor is a quality factor or a context factor.

Survey Design and Measurement of Preference Data

The survey instrument elicited attitude/preference responses from GIS users and other GIS stakeholders related to their use of various GIS data attributes—as the variables under study—as either being predominantly quality-oriented or predominantly utilityoriented. This survey instrument uses a 7-point Likert scale for assigning responses to questions. Jolliffe (1986) points out the importance of question quality in achieving response quality for analyses.

74

Since this is a survey instrument-building and testing research, the survey design is critical. Jolliffe (1986) provides an excellent overview treatment of survey design issues. Critical in any research is an understanding of the types of variables being studied, and the type of data they represent when the survey is conducted and analyzed. Of the four types of survey data: nominal, ordinal, interval, and ratio, this research will primarily collect and analyze ordinal data.

Stouffer, et al (Stouffer et al., 1973) discuss the

limitations and challenges with collecting and analyzing attitude data from respondents. Despite the natural limitations of attitude and opinion measurement and analysis, in the context of this exploration into the meaning and composition of utility applied to geospatial information used within GIS, this is a valid and useful approach as users’ attitudes about what constitutes something of utility to them will be based on ‘educated’ or ‘experienced’ opinion.

Mitchell (1999) addresses the conduct, utility and limitations of factor analysis in psychological research.

In the section on attitude measurement, Thurstone (1959)

cautions researchers on creating bias through the way in which attitude questions are framed. However, he affirms the validity of the techniques. Rea and Parker (1992) discuss the steps in moving from the draft survey instrument to the pretest (called pilot study in this research), wherein three factors are critical: questionnaire clarity, comprehensiveness, and acceptability. Osgood, Suci, and Tannenbaum (1957) identify a common criticism of attitude research is the inability to predict behavior. They point out supporters of attitude research use the language of “a disposition toward certain classes of behaviors” (Osgood, Suci, & Tannenbaum, 1957, p. 198) as the benefit of this type of

75

research.

Shaw and Wright (1967) evaluate several sample scales and attitude

measurement constructs, which this research applies to development of a Geospatial Information Utility survey instrument. The preliminary survey instrument is contained in Appendix A.

Data Collection Approach

As mentioned, the data collection approach is based upon a two-stage pilot study and a main study. Based upon guidance found in (Trochim, 2001, p. 132) this research relied on survey questionnaires offered to selected respondents via email announcements (i.e., the link to the web-based survey) and URL link postings on professional or trade association web sites. The respondents come from trade and professional association membership or listserv member lists. The rationale for this approach is based upon the need to compute a large number of responses to cover the range of responses for all included factors. According to (Hair, Anderson, Tatham, & Black, 1998) a good rule of thumb is to get responses in the ratio of no less than five to ten times the number of factors involved in the study, and if possible, to get up to 20 times the factors in number of respondents. The respondent pool is based upon the estimate of reducing the factor set to 18-20 factors during the pilot study. Therefore, using a ratio of 6-10 times the number of factors (Kerlinger, 1978) as the target respondent number, for a high-end estimate of 20 as the number of final factors resulting from the pilot study, this research needs no less than 120 and prefers 200 respondents to the main study survey. Then, given historically modest response rates to mailed-out surveys, this research uses a best case 7.5%

76

estimated response rate and approach no less than 3,000 respondents. According to Somers, Nelson, and Karimi (2003) and Rai, Lang, and Welker (2002), Somers et al’s 12.19% response rate to their mail out survey was typical. Response rate remains a concern.

Data Analysis Approach

Harman (1968) provides a useful treatment of factor analysis to be used it in this research, pointing out the different ways this multivariate method is useful. According to Harman (1968, p. 145), “…a principal aim is to attain scientific parsimony or economy of description.” As mentioned before, through the pilot study, will the candidate factor set be reduced and the survey instrument re-drafted.

Oppenheim (1966) discusses the

reliability of the Likert scale and supports its use, despite its limitations (e.g., no interval or ratio measurements possible). As mentioned, this research uses a 7-point Likert scale.

Doll and Torkzadeh (1988) provide an excellent example of a survey instrument-building research in the related field of End-User Computing Satisfaction (EUCS). One could argue that measuring end-user computing satisfaction is somewhat akin to measuring GIS user attitudes towards the quality and utility of the data they have to ingest into or output from their GIS. Doll and Torkzadeh (1988) conducted an exploratory factor analysis, used their pilot study to modify their instrument, examined many forms of validity, and assessed the reliability of the instrument. Somers, Nelson, and Karimi (2003) provide an excellent example of how the Doll and Torkzadeh EUCS instrument is applied to a

77

specific research problem or domain, in this case, surveying enterprise resource planning (ERP) users.

According to Hair, Anderson, Tatham, and Black (1998) and Harris (2001) a component factor analysis can be used for either data summarization or data reduction. “From the data summarization perspective, factor analysis provides the researcher with a clear understanding of which variables may act in concert together and how many variables may actually be expected to have impacts in the analysis” (Hair, Anderson, Tatham, & Black, 1998, p. 96).

Data analysis will include descriptive statistics and demographic statistics summarizing the respondent pool, showing their distribution by industry and their position within their industry. Through the factor analysis, a rotated factor matrix showing the correlations between the survey items within factor categories is the basis for developing a correlation matrix of all measures, to include the means and variances by item. A concern is being able to adequately control or characterize the nature of GIS use by respondents. Doll and Torkzadeh found their 618 respondents were responding about 250 different applications hosted on their PC-based workstations.

However, they were able to develop good

reliability. Because the total set of possible GIS uses is a much smaller set of response options available to Doll and Torkzadeh (1988) respondents, this research does not foresee a problem with respondents’ GIS use variability, expecting far less than the 250 different applications represented in the Doll and Torkzadeh (1988) response set.

78

A final issue is statistically (analytically) dealing with non-respondents. Both not having enough respondents to adequately perform the factor loading, and experiencing increased bias from high proportions of non-responses must be dealt with. In the previous section on data collection, the issue of minimizing non-responses was addressed from a mechanics point of view (e.g., using follow-up reminders (email, postcard, etc.) to respondents.) The effect of missing data is complicated by the degree of randomness of missing data.

Van Belle (2002) identifies three degrees of ‘missingness’: missing

completely at random (MCAR), missing at random (MAR), and non-ignorable missingness in which the data analysis results are affected by the missing data. For this research, the persistence with which non-respondents re pursued will depend on the nature of the missing data.

Overview of Respondent Sources and the Sample Pool

GIS (and the other geosciences technologies) have become near-ubiquitous technologies. At

the

‘low-end’,

web

applications

such

as

mapquest.com,

mapblast.com,

googlemap.com, and randmcnally.com, among others, make rudimentary mapping and map analysis technologies available to anyone with internet access. Likewise, mapping functions found as add-ins in basic workplace productivity software (e.g., MS Excel

TM

)

provide relatively easy-to-use mapping functionality to the office workers seeking to provide geographically-oriented displays out of their business data.

These users of

mapping-like software applications are strictly speaking not GIS users for this study’s purposes. These so-called “low-end” applications, and their user pool is too large and

79

diverse to be useful, and are thus beyond the scope of this research. Further, casual users of geospatial data using these low-complexity applications most likely do not have the necessary background or understanding of “true” GIS functionalities and capabilities to be able to answer geospatial data quality and utility issues well. This research considers GIS to be built-for-purpose software applications (e.g., Environmental Systems Research Institute’s

Arc-GIS

™

family

of

GIS

software

applications)

or

dedicated

hardware/software workstation combinations. The users of these kinds of GIS are the targets for this research.

Just as there are many different uses for GIS-supported analyses and decision making, there are likewise many different kinds of GIS users and stakeholders. To provide some control over user credibility, this research makes an a priori generalization about user ‘type’ based on the source. This research will use organizational sources such as the Institute of Electronics and Electrical Engineers (IEEE) Geoscience and Remote Sensing Society (GRSS) as a source of theoretical (i.e., academic and professional) users/stakeholders. On the IEEE GRS website 6 is this description of its membership 7 :

“Members of GRSS come from both engineering and scientific disciplinary backgrounds. Those with engineering backgrounds often support geoscientific investigations with the design and development of hardware and data processing techniques, thereby requiring of them familiarity in areas such as geophysics, geology, hydrology, meteorology, etc. Conversely, discipline scientists find in GRSS a forum for the dissemination and evaluation of remote sensing related work in these areas.

6 7

This fusion of geoscientific and engineering disciplines gives GRSS a unique

http://www.grss-ieee.org/menu.taf?menu=aboutus Disclaimer: I am also a member.

80

interdisciplinary character and an exciting role in furthering remote sensing science and technology.”

Secondly, this research will also target commercial practitioner sources such as the Open Geospatial Consortium (OGC) to reach practicing GIS users/stakeholders and the system or data vendors that support them. Finally, this research will approach local engineering and technology organizations or firms involved with GIS-supported analyses and decision making. Many local defense contractors have organizations performing GIS analyses for government customers.

There are also several commercial firms that

provide GIS analytical or data production or evaluation support to other industries.

The

survey

instrument

contains

sufficient

demographic

questions

to

elicit

user/stakeholder type characterization as a control for the response set.

Validity Issues and Concerns

A critical aspect of the research design and of the survey instrument design is the concept of validity.

This research is concerned first with the validity of the geospatial

information utility construct as posited in the basic research model. The soundness of this model is critical to final development and validation of a useful, validated survey instrument. It follows that a critical part of any survey building research is the instrument validation process. The pilot test constitutes a pre-test of the instrument, aiding in both content and face validity (Ali, 2005). Construct validity, the degree to which the research

81

design reflects reality, is addressed through the literature review and particularly through the use of expert panels (Lawshe, 1975), or in this case through the use of the focus groups in the pilot study. Further, in the analysis of the main study data, a modified multi-trait multi-method (MTMM) correlation matrix will provide a mechanism for further assessing construct validity.

Two specific construct validity concerns are

convergent and discriminant validity (D. Straub, Boudreau, & Gefen, 2004; Trochim, 2001). Convergent validity addresses the degree to which operationalization reflects (or converges on) other related constructs. Discriminant validity addresses the degree to which the operationalization does not converge on other dissimilar things that it should reflect as being different. As the prevailing information quality paradigm is to cluster all factors uncovered in the literature within the broad category, “quality”, this research is seeking to parse this previously one construct (quality) into two components of the new second-order construct for utility (quality and context). The MTMM correlation matrix will permit assessment of convergent and discriminant validity by examining the clustering and variability on the correlations.

As mentioned earlier, with many different types of geospatial data users and stakeholders in the field, getting to a broad variety of them in a meaningful, representational way may pose a limitation of this study, limiting its generalizability or external validity. This limitation is addressed by using multiple types of sources (e.g., professional and trade associations and domain-specific listservs) to populate the sample pool. To provide respondent characterization and control, demographic questions are included in the general information in the survey to identify the breadth of the survey respondents by

82

user type. To make the results more generalizable, the study will seek to reach to broadest set of GIS user types possible.

Doll and Torkzadeh (1988) review validity and reliability issues in survey instrumentbuilding research, pointing out the weakness of ambiguous factor terms. They also point out in the case of Ives, et al (Ives, Olson, & Baroudi, 1983) a survey instrument that has gained wide acceptance and use, but that has experienced some criticism over the reliability impacts of small sample size compared to the factor count (7:1 ratio) and the use of indirect users. This research must also deal with indirect users—people who have a stake in the use of GIS or geospatial data and information—and must account for or deal with their effects. On the first count, sample size, this research is targeting a response to factor ratio of 10:1. On the indirect user issue, this research will collect demographic information to be able to determine if proximity of the respondent to direct GIS use (e.g., manager is indirect, user is direct) has a moderating effect on survey responses.

Straub et al (2004) provides and excellent treatment of validity in IS research based on a supposition that rigor in information systems suffers from a field-wide inattention to validity issues. Their paper provides many valuable heuristics connected to validities and analytical techniques. For example, MTMM can be used for discriminant validity where there are a relatively low number of matrix violations.

For internal reliability,

Cronbach’s alpha should be above .60 for exploratory research such as this.

83

For

convergent validity, Eigenvalues of 1 and loadings of at least .40 (.50 preferred) (Hair, Anderson, Tatham, & Black, 1998), items that do not load properly will be dropped.

Expected Results

This research is expected to eliminate several preliminary factors, either through the pilot study/pre-test, or through insufficient factor loading in the main study. Further, this research is expected to find the factors explored will successfully cluster as either quality or context-based factors. Finally, though this research may have some difficulty with respondent variability; it is expected respondent experience level will be a moderating factor on the responses received. Table 3.3 depicts the preliminary expectation for the distribution of the 28 candidate factors as being loaded as context-oriented (use-based) or quality-oriented (attribute-based), or as likely to be deleted through the pilot study:

*

Accuracy

Amount of information

Completeness

Comprehensiveness

Conciseness

Consistency

Currency (including timeliness)

Interpretability

Minimum redundancy

Relevance

Reduction of complexity

Security

Transparency

Usability

Usefulness

Accessibility Coherence

* *

* *

Context-oriented

*

* *

* *

Clarity Comprehensibility Interactivity

Objectivity

Precision

Reliability

Representation

Trustworthiness (e.g., lineage)

Understandability

Validity

Value-added context

Quality-oriented

Likely to be deleted

Table 3.3 Preliminary Estimate of the Binning of the Initial Candidate Factor Set into Context- or Quality-oriented Factors 84

Factors annotated with an asterisk in Table 3.3 above are those top-10 factors that occur most frequently in the literature.

Each of these factors is defined in the glossary

contained in Appendix B. A model labeling convention used herein, for the purposes of this research only, is as shown here: Concept of interest Contributing constructs

1st-order constructs

Subordinate constructs

2nd-order constructs

Figure 3.2 Conventions for labeling the General Research Model

It is important to note the factor analysis techniques used in this research, so heavily based upon Doll and Torkzadeh (1988) and Hair, Anderson, et al (1998), explicitly seek to load the dimensions shown above in Table 3.3, which are defined here as the 2nd-order constructs in the research model, upon the 1st-order constructs posited in the research model. This point is important. In some research, the items are loaded to characterize the dimensions (constructs) of the lower part of the model. In this research, survey items were specifically developed to identify these loadings as a means to appropriately bin the 2nd-order dimensions on the 1st-order factors (constructs) extracted from the analysis.

85

Limitations

Limitation 1: Sampling limitations based on GIS user type variability. As there are many (tens or hundreds of) types of GIS users and many millions of GIS users, this research is limited on how representative of all GIS users the sample can be, which in turn limits generalize-ability of the research.

Limitation 2: Sampling limitations based on respondent variability. As the pool of GIS users and stakeholders is widely diverse due to user background experiences and education, GIS users of the same type (e.g., agricultural modeling) but different experience and education backgrounds may give widely different responses, confounding data correlation when clustering the factors.

Limitation 3: Geospatial data construct. The geosciences field is complex and rapidly changing. The components that comprise any research-structured data construct must necessarily face scope limits.

86

Chapter 4: Data Analysis and Results

Overview

This chapter provides the results from both the pilot and main studies. This chapter includes pilot study results, characterization of the sample population and descriptive statistics from the main study, a discussion of main study results, and a discussion of instrument validation results.

As described in the previous section, a pilot study and a main study were conducted. The pilot study was conducted in the form of a focus group of industry professionals and managers to focus on items to be surveyed and the refinement of the language used to define the items.

The main study was conducted as a survey of GIS industry

practitioners, managers, and in some cases academics. The use of these two studies in tandem permitted an MTMM analysis.

The main study was conducted via the use of a web-based survey located on the George Washington University web survey server. 8 As mentioned in previous chapters, the outcome of this research is a GIS user-defined binning of geospatial data and information factors (dimensions) into the two constituent parts of geospatial utility, as posited in the model in the previous chapter: quality-based static factors and context-based dynamic factors. The method used to achieve this examination of geospatial utility factors was the 8

See: https://www.gwu.edu/~survey/index.cfm?SURVEY_ID=5810

87

exploratory factor analysis. This method permits development and validation of the proposed survey instrument, which is the essence of this research: to develop a survey instrument focused on identifying context-based factors as a means to exploring Utility as a construct comprised of Quality- and Context-based factors.

Discussion of Pilot Study

The pilot study was conducted as two focus groups with current GIS professionals. The focus group format was chosen in order to interview the pilot study participants with respect to factor selection (inclusion and exclusion) and item wording (clarity). The Appendix B Glossary, though based on GIS community literature, was specifically discussed during the pilot study, both in terms of common usage of the selected terms and to eliminate confusion between factor definitions and item purposes (i.e., type of response desired or the purpose for including any given item). As discussed in the previous chapters, several of the factors derived from the literature review reflect slight differences in meaning, or presume nuanced differences in approaches or interests of prior information quality frameworks’ authors.

While these subtle differences in

meaning (terminology) or purpose were certainly important to prior researchers, in the context of vetting factor definitions and item wordings and purposes with the study group participants, it became obvious there is some modicum of skepticism amongst practitioners about the practicality to them of the nuances of language used in academiccentric research. In each instance of the focus group, time at the beginning of the session was necessarily spent on overcoming a practitioner-academic language hurdle. 88

In addition to clarifying language in the items, the pilot study also helped reduce the survey instrument from 28 dimensions (candidate factors to be binned into either quality or context factors) to 19 dimensions to survey explicitly and from 92 items to 75 items.

Characterization of Main Study Sample Population

Announcement emails were sent to trade (e.g., Open GIS Consortium (OGC)) and professional (e.g., IEEE Geoscience and Remote Sensing Society (GRSS)) associations, to GIS-related listservs (e.g., GIS-Café, GIS Talk, and GeoList), and directly to GISrelated companies (456 email addresses, representing 178 companies).

After three

weeks, 137 respondents had completed the survey. As the GWU web server survey tool requires all submissions to be complete (does not permit unanswered questions if set up that way), there were no missing data points for the cases collected. However, 137 respondents was considered to be a low response rate, so reminder emails and additional emails to GIS-related companies were sent out in a second wave, with an added line in the announcement email emphasizing the request for email recipients to forward the announcement emails to other GIS professionals in their organizations. This constituted adoption of a “snowball” sampling methodology in order to increase the total number of respondents while staying within the target population.

It is difficult to precisely calculate the response rate as the exact number of potential respondents is not known. However, some organization sizes and listserv memberships

89

are known. In the previous chapter, the stated sample size goal was to reach 3,000 GIS professionals in hopes of getting 200-300 respondents.

Given published email list

memberships and the use of organizational contact emails at GIS firms around the globe it is likely that more than 3,250 potential respondents were reached, either in the original mass mailing, or in the follow-up and expansion to additional GIS companies. Using the low-end estimate of 3,250 professionals reached out to, with 163 respondents’ results, an estimated response rate of ≤ 5.0% was achieved. Kaplowitz, Hadlock, and Levine (2004) note that multimode contact (outreach) methods improve web response rates. When researching 19,890 students at Michigan State University, they found web/email-only response rates greater than 20% as compared to mail survey response rates of 31.5%. They also noted higher response rates correlated with younger respondents. Further, their population was organizationally homogenous (i.e., a university student body) and very youthful (70% of respondents under the age of 24). Contrasted with their results, the achieved response rate of ~5% for this research was disappointingly low. However, the respondent population in this research is likely more diverse than a university student body, certainly more geographically dispersed, and considerably older given the demographic data in table 4.2.

Additionally, as IEEE is an international professional organization and through the adoption of direct contact with GIS-related companies it was possible to know something about the companies (e.g., services provided, where headquartered or principally located, size, and customer focus). The sample pool was very international in flavor. Email announcements were sent to companies or organizations in 32 countries; however, the

90

US, India, UK, Canada, and Australia predominated. It is likely the majority of the respondents are predominantly from the US. However, country of origin was not a demographic factor surveyed.

Demographic data of two types were collected: (1) Attributes about the respondents (e.g., age, education level, industry, work level), (2) Attributes about the respondents’ GIS analysis experiences (e.g., estimated annual workload, estimated number of analyses affected by geospatial data or metadata quality or utility issues, specifically as pertains to characterization of analyses cancelled or modified because of data quality issues).

Summary characteristics of the respondent pool are included in Table 4.1 below.

91

Table 4.1 Summary Statistics about the Respondent Pool Freq

%

AGE

>45 40-44 35-39 30-34 25-29 20-24

35 39 30 30 25 4

21.5% 23.9% 18.4% 18.4% 15.3% 2.5%

EDUC

Doctoral Masters Bachelors Associates No Resp

11 80 66 0 0

7.0% 51.0% 42.0% 0.0% 0.0%

YRS EXP

>25 20-24 15-19 10-14 5-9 0-4

5 5 44 59 39 11

3.1% 3.1% 27.0% 36.2% 23.9% 6.7%

ORG LEVEL

Senior Mgt Middle Mgt Line Supervisor Entry level Other prof staff Other

2 27 16 8 110 0

1.2% 16.6% 9.8% 4.9% 67.5% 0.0%

Agriculture Civil Engineering Defense or Intell Emerg Mgt Environmental Health, Planning or Admin Housing Human Services Land Use Plan Law Enforcement Parks/Recreation Tax Planning Telecomms Transportation Other

22 32 8 0 36 1 0 0 11 7 0 0 0 1 45

13.5% 19.6% 4.9% 0.0% 22.1% 0.6% 0.0% 0.0% 6.7% 4.3% 0.0% 0.0% 0.0% 0.6% 27.6%

0 39 38 57 10 8 9 2

0.0% 23.9% 23.3% 35.0% 6.1% 4.9% 5.5% 1.2%

DOMAIN

SECTOR

Public - Federal Public - State Public - Local Private - Data User Private - Data Prov Academic Other No response

92

Given the federal government’s interest in digital mapping and the use of GIS, it is a curious fact that no survey respondents identified themselves as part of the federal workforce. However, this curiosity may be an artifact of one of three conditions:

A natural inclination to expect there should be a noticeable or significant number of federal employees as part of the respondent pool because the US federal workforce is located so close to the university that sponsored this research. However, any possible bias towards expecting federal workers in detectable numbers in the respondent pool may be ameliorated by the non-local, (i.e., national and international) way in which survey respondents were solicited via emails and the web-based survey instrument.

The demographic questions at the front of the survey do not explicitly allow for the condition where a private contractor is employed by a private organization but who works directly for a government agency. In fact, in the geospatial sciences field, most federal agencies or departments that use GIS in the conduct of their agency or department missions do so with a high percentage of outsourced contractors located either on-site or off-site. It remains a very real possibility that private contractor-employed GIS users responded to the demographic sector of employment questions by identifying with the organization that legally employs them rather than providing information about the federal agencies or departments that contractually (and functionally) guide their work output.

93

Trade associations, professional web resources (e.g., GIS Café), and mailing lists of commercial GIS organizations were the foundation of the outreach to solicit survey respondents. It is possible these resources are more attractive to industry professionals or their parent organizations, which are dependent on networking to provide visibility useful in finding new work, than they were to government workers who do not “need” the visibility because they do not face the “find new work” paradigm. If there were few government employees that were available via the employed outreach mechanisms used in this research, then given the low (~ 5%) response rate, it is entirely believable that none would respond to the repeated survey solicitation requests.

However it occurred, no survey respondents identified themselves as belonging to the federal workforce. Therefore, it is a limitation of the research that affects generalizability to the degree that it is possible federal GIS users have different geospatial data and information context and utility needs, and thus these results do not adequately reflect their needs or desires for characterizing geospatial data and information within the posited model. In follow-on research, detailed demographic responses about employer– contractor–customer relationships will be elicited to ensure functional sector or domain representations are not missed or misrepresented.

94

When comparing respondents’ ages versus their years of experience, nothing unexpected about how people progress in the GIS field was detected. A distribution of years of experience with GIS or geospatial data versus respondent’s age is show in Table 4.2: Count of YRS EXP YRS EXP AGE 0-4 20-24 4 25-29 7 30-34 35-39 40-44 >45 Grand Total 11

5-9

10-14

15-19

20-24

>25

18 8 10 1 2 39

20 16 7 16 59

2 3 28 11 44

1 2 2 5

1 4 5

Grand Total 4 25 30 30 39 35 163

Table 4.2 Distribution of Respondents’ Ages versus Years of Experience with GIS Respondents’ ages and years of experience are interesting insofar as the more experienced a respondent is in the field, the more likely it is they have will have encountered different aspects of the data and metadata quality and utility issue, as it pertains to their abilities to successfully complete their GIS-based analyses. Conversely, the more junior a professional is in the field, particularly the youngest ones in the 0-4 years of experience, the more likely it is they perform a more limited set of analysis types, or work on a more limited set of domain problems. The problem of satisfactorily completing assigned GIS-based analyses based on available data and metadata is a problem nearly all respondents reported.

To the four “experience” questions about

cancelled or substantively modified GIS-based analyses due to limited availability or substandard data and metadata, some surprising results were obtained: % of GIS analyses cancelled due to data availability % of GIS analyses substantively modified due to data availability % of GIS analyses cancelled due to substandard data quality % of GIS analyses substantively modified due to substandard data quality

10% 25% 8% 19%

Table 4.3 Summary of Responses to Four Experience Questions About Cancelled or Modified GIS Analyses 95

These results themselves are important insofar as they highlight the data quality/data utility problems GIS practitioners face.

Overview of the Main Study

The survey instrument as shown in Appendix A was administered as described above. The 75 items proved too many; however, one aspect of exploratory factor analysis is to reduce the dimensionality of the data. That proved an essential part of this research. While all survey responses were collected anonymously, several respondents elected to email additional textual comments after completing the survey.

Some were

complimentary and others disapproved of the rigid wording used in some of the items. In one email, the respondent reported perceiving the 1 – 7 Likert Scale as not being useful, believing many of the questions could have been reduced to binary, “yes or no” questionresponses, though no specifics about which questions could have been asked differently was provided. The Likert Scale, as administered permitted these 7 responses: 7 6 5 4 3 2 1

Strongly agree Moderately agree Mildly agree Neutral Mildly disagree Moderately disagree Strongly disagree

Table 4.4 Labeling of Likert Scale Response Categories Appendix C provides a summary of the survey questions contained in the administered instrument, including the field codes used in the SAS JMP 5.1 software. Appendix D 96

provides distributions of responses to all 75 survey items. For each of the 19 dimensions queried for categorization in the factor analysis, there was one criterion question. These criterion questions directly addressed context-based, dynamic (variable) responses based upon dynamic GIS analysis needs. That is, based upon the basic research model for this research, it is expected that for several of the 19 surveyed dimensions, for the Context/use-based factor, the actual attribute values desired (i.e., what should be in the mind of the respondent when he or she answers the question) will vary based upon the specifics of any given GIS analysis assignment. These questions generally took the form, “My needs for _____ (fill in for each of the 19 dimensions being examined) change based upon my changing GIS analysis needs.” These 19 questions form the core of the analysis to differentiate between quality-based and context-based factors. Further, there are 18 other key questions, which when added to the 19 core questions make-up 37 items as the first iteration of reducing the original 75-item survey instrument to a more manageable size. Through factor analysis described in this dissertation was the 37-item survey instrument further refined and reduced in scope to include a final 20 items.

Discussion of Main Study Results

The pilot study reduced 28 dimensions and 92 items to 19 dimensions and 75 items. Using the sample of 163 responses, the data were analyzed; principal component analysis was the extraction technique and VARIMAX provided the rotation method.

An

important design and analysis decision when performing an exploratory factor analysis is the choice of how many factors to extract and upon which to load the items being

97

analyzed. The University of Texas Research Consulting web site 9 approaches this objectively, providing five methods in their Frequently Asked Questions (FAQ): (1) select only those factors with Eigenvalues greater than 1.0, (2) examine a scree plot (which graphically plots the Eigenvalues for each factor number) and select the number of factors that present to the left of the "elbow" in the curve (i.e., once the plot flattens, there is no useful gain in variance explained by increasing the number of factors), (3) decide a priori what a sufficient amount of variation explained is needed for the study at hand, then during analysis select the number of factors which satisfy this threshold, (4) use the low error approach, which advises to continue to extract factors as long as the residual values for each new factor are greater than .10, and finally, (5) use a chi square test; this approach has several limitations or constraints. Both Doll and Torkzadeh (1988) and Hair, Anderson, et al (1998) take a more subjective approach to the number of factors to extract, expecting the researcher to make research-based or model-based judgments that simultaneously consider percent variance explained, while also considering the content and construct validity issues inherent in making sure the number of factors extracted comports with the model under study and is thus theoretically explainable. For example, in Doll and Torkzadeh (1988), they began their research by extracting three factors because that was the number that presented Eigenvalues greater than 1.0; however, they found labeling three extracted factors to be too ambiguous, thereupon they increased the number of extracted factors to five, resulting in a structure that was more interpretable. Likewise, to quote Hair, Anderson, et al (1998, p 114), “…the ability to assign some meaning to the factors, or to interpret the nature of the variables, becomes an extremely important consideration in determining the number of factors to extract.” 9

http://www.utexas.edu/its/rc/answers/general/gen16.html retrieved 4/15/2007

98

Based on the basic research model, content and construct validity, as well as the competing validities: convergent and discriminant validity, drove this research to a preliminary, default number of higher order factors to extract (i.e., initially and finally two extracted factors, though three and four extracted factors were also examined) based upon the two theoretical branches of the geospatial utility model. This choice permitted one extracted factor for Quality-based dimensions (i.e., those quality- or context-based dimensions that "prefer" (load on) static attribute values, the needs for which are not expected to change much over a wide range of GIS-based analyses) and another for Context-based dimensions (i.e., those quality- or context-based dimensions that "prefer" (load on) dynamic attribute values that are expected to change dynamically across a wide range of GIS-based analyses). Though the model seems most theoretically "pure" with just two factors extracted, based on guidance contained in Doll and Torkzadeh (1988) and Hair, Anderson, et al (1998), as mentioned, three and four extracted factors were also considered and rejected.

Thus, the exploratory factor analysis conducted in this research successfully served two purposes:

(1) It provided the mechanism to bin respondents’ inputs regarding their use of, or need for, the surveyed dimensions of geospatial data and information used most commonly in their GIS analyses into principal factors based upon the correlated loadings of their collective responses.

99

(2) It permitted further modification of the survey instrument through rejection of dimensions that did not adequately load on the principal factors extracted as described previously. The decision to use two principal factors, vice three or four, was predicated first on the model being explored and secondly on the distribution of the dominant factors extracted as additional principal factors were either included or excluded. At two extracted factors, a total of 60% total variance is explained by the model. The marginal increase in total variance explained (an extra 3 – 5 %, depending on three or four factors, respectively) was gained at a cost of increasing ambiguity in the dimensions contributing to model. There seemed to be slight statistical improvements at the cost of decreased semantic clarity. Upon reviewing the data, the choice to revert to two principal extracted factors made the most sense given the model and the research design.

Table 4.5, the Rotated Factor Pattern Matrix for the 37-item instrument follows. The analysis of this table reveals the clear positive correlations between the items that load on either factor 1 (the Context factors) or factor 2 (the Quality factors). For those items that correlate positively with either Context or Quality factors, they mostly also exhibited negative correlations for the opposite factor. These results agree with the model as well. Given the definitional construct of the model, it is unlikely that an item can load strongly positive on more than one principal factor. This result is a little more explicit than Doll and Torkzadeh (1988) observed: due to some multiple items – factor loadings, they observed some ambiguity of certain factors loading simultaneously on multiple principal

100

factors. The factor loadings for two principal factors extracted for 37 items loaded on either factor are shown in Table 4.5: Rotated Factor Pattern for 37-item Instrument ITEM CODE

CONTEXT

QUALITY

RLV3 CPT3 CMP3 PRE3 CNS3 INT3 PRE6 CUR9 REP3 USF3 COH3 OBJ3 SCY3 REL3 USA3 CUR4 TRS6 ACU5 ACU7

0.8854 0.8834 0.7981 0.7971 0.7881 0.7715 0.7693 0.7505 0.7070 0.7066 0.7061 0.6645 0.6586 0.6580 0.6514 0.5996 0.5729 0.5705 0.4941

-0.0632 0.1285 0.1662 -0.1339 0.0893 -0.2041 -0.1438 -0.1835 -0.2106 -0.2647 0.0454 0.0161 -0.2123 -0.1736 -0.1640 -0.0396 -0.1720 0.1210 -0.0266

RLV1 CPT1 CNS1 INT1 COH1 CMP1 USF1 REL1 TRS1 CUR1 PRE1 PRE4 OBJ1 SCY1 REP1 ACU2 AMT2 CUR6

0.0382 -0.3052 -0.3584 -0.1762 -0.2444 -0.1063 -0.2572 -0.1525 0.1058 0.1717 0.0181 -0.3896 0.1508 0.0267 -0.0783 -0.0961 0.3070 -0.1235

0.7651 0.7580 0.7523 0.7224 0.7074 0.6734 0.6532 0.6324 0.5782 0.5592 0.5187 0.5029 0.4938 0.4851 0.4762 0.4669 0.4421 0.3653

Table 4.5 Rotated Factor Pattern for the 37-item Instrument

101

Selecting two factors to rotate on was the logical starting point given the basic research model posited for this research. It is a truism of factor analysis that the fewer the factors the less the total variation explained by the factors, but the more distinct the differences between the factors. Conversely, if incrementally more factors are sought and rotated, the more variation is explained; however, the differences between the factors begin to diminish. This trend was observed in this research; after examining three and four extracted factors, this research reverted to two extracted principal factors.

Hair, et al., (1998) discuss both practical significance and statistical significance in assessing the significance of factor loadings. Factor loadings greater than ±.30 are considered acceptable, loadings of ±.40 are considered more important, up to loadings greater than ±.50 which are considered practically significant. In assessing statistical significance, a technique of Hair, et al., is to compute the statistical power, assessing factor loadings versus sample size.

With a sample size of 163, with a level of

significance (α) of 0.05, factor loadings over .45 are acceptable. From Table 4.5 above, all the items listed, except AMT2 and CUR6 are significant.

The XXX3 items (e.g., REL3, CPT3, CMP3, and PRE3) are the core questions focused on determining respondents’ perceptions about the changing nature of these dimensions in their GIS analyses as their GIS analyses change over time. These are the contextoriented dimensions as reflected in the basic research model. These items loaded as expected going into the research, and as reflected in qualitative inputs provided during the pilot study. Conversely, the XXX1 items (e.g., RLV1, CPT1, CNS1, and INT1) tend

102

to be the attribute-oriented quality dimensions, the responses to which tend not to vary dynamically over time; or at least much less so. The loading of these dimensions on the quality factor also occurred much as expected.

Where the rotated factor pattern matrix displays the per-item correlation as the items load on the principal factors, it is also important to examine total variance explained by the model. Therefore, the next step was an examination of item communalities in order to consider total variance explained.

This permits further refinement of the survey

instrument through the elimination of items not contributing to the model in a significant way, as measured by the total variance explained.

Table 4.6 below provides the

communalities resulting from the 37-item instrument. As the original 37-item factor model initially only accounted for 46% of the total variance, the elimination of “low performing” items was checked in order to improve total variance. This step permitted eliminating other items, further simplifying and focusing the final survey instrument. Highlighted in table 4.6 in grey shading are the 20 items retained after assessing the rotated factor pattern matrix and the table of communalities. As recommended in Hair, Anderson, et al (1998), a communality threshold of .50 was used to select out those items not worthy of keeping in the final instrument. However, in four cases, variables with communalities less than .50 were retained: REL3, CUR4, ACU5, and ACU7. Hair, Anderson, et al (1998) point out that the researcher must make the final decision about the items to retain based on the model, sometimes independent of the statistical responses of the items. These four items were retained because in prior qualitative discussions with practitioners, accuracy (ACUx) and currency (CURx) are determined to be two of the

103

most commonly cited dimensions. Further, the information quality literature supports this decision as these dimensions are among the 10-most commonly cited dimensions in the information quality literature.

Communalities ITEM CODE CPT3 RLV3 CNS1 CPT1 CMP3 PRE3 INT3 CNS3 PRE6 CUR9 RLV1 USF3 COH1 INT1 REP3 COH3 USF1 SCY3 CMP1 REL3 USA3 OBJ3 REL1 PRE4 CUR4 TRS6 TRS1 CUR1 ACU5 AMT2 PRE1 OBJ1 ACU7 SCY1 REP1 ACU2 CUR6

0.7969 0.7880 0.6944 0.6677 0.6645 0.6533 0.6369 0.6290 0.6125 0.5970 0.5868 0.5693 0.5602 0.5529 0.5442 0.5006 0.4928 0.4789 0.4648 0.4631 0.4512 0.4419 0.4232 0.4047 0.3611 0.3578 0.3455 0.3422 0.3402 0.2897 0.2693 0.2666 0.2449 0.2360 0.2329 0.2273 0.1487

Table 4.6 Table of Communalities for the Interim 37-Item Instrument

104

The resulting list of 20 items (shaded in grey in table 4.6) constitutes the final output of the factor analysis to develop a preliminary Geospatial Information Utility survey instrument. By reducing the instrument size to these 20 items, the revised factor model accounts for 60.5% of variance detected. The Cronbach’s alpha representing the total inter-item reliability is 0.8981.

Since this research posits a 2nd-order model for geospatial information utility, an obvious question is: what are the loadings of the individual items in the survey instrument onto the 2nd-order dimensions? Certainly, if this question were relevant, it would come before assessing the loadings of the 2nd-order dimensions onto the higher-order factors extracted. An important, if nuanced, point to make about this research design and conduct, based upon the model posited in the previous chapter, is that this research is entirely about the loadings of the 2nd-order dimensions (e.g., currency, accuracy, reliability, etc.) as represented by the survey items, on the 1st-order factors (e.g., quality, context) and nothing at all about the loadings of items on the 2nd-order dimensions. The reason this point is important is that it speaks to both the basis for the research design and to the nature of outcomes empirically assessed. The reason this point is nuanced is that in a different kind of research, a researcher might be interested in developing survey items that through their direct or indirect approach to the concepts studied, might probe the specific nature of each 2nd-order dimension being examined. For example, developing or obtaining items from other survey instruments that examine the meaning of accuracy, or currency, etc., and then assessing them empirically to see how these items load on

105

accuracy, currency, etc. However, this research has taken the nomological approach that makes this step irrelevant to this current research for two reasons: (1) via the literature review (addressing face and content validity), and as provided in the Appendix B glossary, which follows current information quality and geospatial science domains’ use of the constructs of accuracy, currency, etc., these constructs are definitionally provided versus empirically assessed, and (2) there is an important distinction between developing and using survey items that assess what something is (i.e., the nouns and adjectives that characterize the traits of the construct) versus how something is used (i.e., the verbs and adverbs that characterize the nature of the trait (construct) in action). On the latter point hangs this research: to explore GIS practitioners’ uses of metadata dimensions and their need for static (one value is truly “best” or of high quality) or dynamic (attribute values for accuracy, currency, etc., change over time, based upon the individual GIS analyst’s need for the metadata attribute information) that contains information about the 1st-order dimensions (i.e., how accuracy, currency, etc., are used, and how important those uses are to the survey respondents as a sampling of the GIS practitioners’ community). To summarize, the survey items developed for this research aimed solely at this aspect of the research problem, to answer ‘use of the dimension’ questions, not to characterize the dimensions themselves, nor the relationship between the principal factors and the higher order construct, Utility. This research was explicitly designed to explore only the loading of the items on the lower-order factors, leaving to future research the testing of the relationship between the intermediate factors and the higher order Utility construct. Therefore, survey item development and the data analysis method deliberately ignored this issue for this research.

106

ACU7 COH1 COH3 CPT1 CPT3 CMP3 CNS1 CNS3 CUR4 CUR9 INT1 INT3 PRE3 PRE6 RLV1 RLV3 REL3 REP3 USF3

(0.06) 0.39 (0.24) 0.45 0.50 (0.25) 0.40 0.56 0.29 (0.09) 0.42 0.35 0.30 (0.17) 0.27 0.06 0.32 0.10 (0.23) 0.69 (0.16) 0.09 0.72 (0.05) (0.26) (0.38) 0.46 (0.31) (0.35) (0.38) 0.57 (0.23) (0.31) (0.33) (0.43) (0.19) 0.72 0.70 (0.20) 0.74 0.41 0.48 (0.20) 0.57 0.58 0.39 (0.11) 0.65 0.30 0.47 0.34 (0.26) (0.12) 0.64 (0.06) (0.36) (0.36) 0.50 (0.35) (0.39) (0.48) 0.52 (0.24) (0.25) (0.45) (0.28)

107 Item ACU7 COH1 COH3 CPT1 CPT3 CMP3 CNS1 CNS3 CUR4 CUR9 INT1 INT3 PRE3 PRE6 RLV1 RLV3 REL3 REP3 USF3

ACU5 ACU7 COH1 COH3 CPT1

0.79 (0.03) 0.30 (0.19) 0.47 0.49 (0.19) 0.35 0.47 0.41 0.06 0.43 0.42 0.43 0.08 0.33 0.21 0.39 0.20 (0.04) 0.83 0.38 0.53 (0.09) 0.61 0.56 0.42 0.02 0.74 0.35 0.56 0.38 (0.13) (0.28) (0.48) 0.62 (0.41) (0.39) (0.38) 0.58 (0.28) (0.31) (0.37) (0.50) 0.40 0.48 (0.14) 0.73 0.60 0.40 (0.03) 0.79 0.50 0.46 0.41 0.32 (0.09) 0.49 0.52 0.49 (0.07) 0.37 0.28 0.43 0.36 (0.23) 0.53 0.51 0.70 (0.13) 0.61 0.49 0.51 0.64

Mean 5.82 6.29 4.95 6.18 5.04 5.13 6.55 4.80 5.34 4.79 6.55 4.60 5.15 5.25 6.10 4.77 4.03 5.08 4.04

Variance 1.59 1.39 2.50 0.84 3.84 3.62 0.56 3.16 2.83 2.91 0.51 4.07 3.99 2.94 1.30 2.76 3.78 3.25 4.04

CPT3 CMP3 CNS1 CNS3 CUR4 CUR9

0.79 (0.16) 0.78 0.58 0.64 (0.03) 0.61 0.72 0.67 0.08 0.75 0.48 0.57 0.55 INT1

(0.24) (0.14) (0.08) 0.63 (0.17) (0.26) (0.25) (0.30) INT3

0.76 0.62 (0.17) 0.79 0.59 0.49 0.44 PRE3

0.81 (0.02) 0.76 0.52 0.55 0.59 PRE6

0.08 0.67 0.53 0.52 0.59

RLV1

(0.01) 0.09 (0.21) (0.11)

RLV3

0.65 0.65 0.65

REL3

0.34 0.69

REP3

0.54

Table 4.7, represents the final correlation matrix for the remaining 20 items with

associated means and variances for each item:

Table 4.7 Corrected Correlation Matrix for 20-item Instrument

Discussion of Instrument Validation: Validity and Reliability

The last stage of the factor analysis is validation, through which step the generalizability of the model and of the instrument can be determined.

Validity

The most critical aspects of this research are the proper operationalization of the problem, the successful allocation of the 1st and 2nd order factors posited in the basic research model, and the development and refinement of a Geospatial Utility-oriented survey instrument, which is the ultimate aim of the research. To successfully accomplish these three tasks, the validity of the research (i.e., of the model based upon the survey results) and of the final survey instrument (i.e., of the refinement of the dimensions and items within it) for its future use, must be assessed.

The essence of the validity to this research is construct validity, which is generally thought to include:

1. Face and content validity, which Trochim (2001) bundles together as Translation validity. Face validity assesses the appropriateness of the construct on the face of the concepts qualitatively examined for inclusion in the research or the model. Face validity is a subjective assessment. Content validity further assesses the

108

appropriateness of the research model based upon a theoretical examination of the body of knowledge within the domains of interest. The literature review (Chapter 2) and the pilot test provided a qualitative pre-test of the survey instrument, verifying both content and face validity for this research as designed and conducted.

The qualitative interaction with pilot study participants provided

affirmation and clarification of theoretical issues uncovered within the literature review. Straub (1989) discussed the effectiveness of content validity as it is established via literature reviews and expert panels, such as the focus group used in the pilot study. This model and the associated 20-item survey instrument are validated for face and content validity.

2. Discriminant and convergent validity, which are complementary concepts. Discriminant validity is the degree to which the separate measures being researched are distinctly different, which follows the principle that measures that are different should not correlate together. The pilot study focus groups provided discriminant validity by assisting in removing or combining those dimensions to closely aligned semantically to be usefully distinguishable by survey respondents during conduct of the main study. For example, though the literature uncovered separate dimensions for Trustworthiness, Credibility, and Reliability within the broader information quality stream of research, in the pilot study focus groups, GIS practitioners found these dimensions very closely aligned and the distinctions immaterial for the purpose of assessing each of them relevant to the quality versus context taxonomy for geospatial information utility. As shown in Table 4.6, the

109

strong separation of dimensions (as represented by the degree to which their individual items correlated) as loaded on two extracted factors shows little confusion of the loadings on both factors; therefore, it is concluded, based upon the loadings obtained, the two extracted factors are clearly distinguishable from one another, and discriminant validity is satisfied. Convergent validity is the degree to which measures converge correctly on the construct offered, by following the principle that measures of theoretically similar items should correlate closely together. In this research, convergent validity was checked both in the pilot study by clarification of the dimensions to survey in the main study, and through the main study by the factor analysis which provided an appropriate clustering of dimensions into the two first-order factors being researched: Quality and Context, by binning the second-order factors (dimensions) beneath them.

Normatively, through qualitative interaction with the pilot study focus groups, both discriminant and convergent validity were examined and then re-examined as part of the factor analysis. As mentioned above, in several instances, such as between Reliability (REL) and Trustworthiness (TRS), close semantic similarities between the measures forced removal of the lesser correlated measure in order to improve discriminant validity.

According to Trochim (2001), the Multi-trait Multi-method (MTMM) method is useful for assessing both convergent and discriminant validities as components of construct validity. In this research, noting the inherent disadvantage of MTMM—that having a fully vetted set of empirical data for all measures from all methods employed is difficult

110

to obtain—a modified form of the MTMM was employed. In a modified form of the MTMM proposed by Trochim (2001), the classic MTMM approach is employed, i.e., looking at the correlations on the diagonals; however, the “methods” dimension is reduced or eliminated as infeasible or impractical. Straub, et al, (D. Straub, Boudreau, & Gefen, 2004) in discussing forms of validity along with their associated heuristics represented in MIS research, point out the difficulties with and the relatively infrequent use of MTMM in MIS research. For this study, the applicability of the MTMM is in using multiple methods to address common methods bias and to apply the benefit of the direct qualitative contact with GIS practitioners in the focus groups to the discriminant and convergent validity problems.

External validity addresses the degree to which a study’s results can be effectively and usefully generalized to a larger population. There are two essential aspects to external validity and generalizability for this research:

(1) Can the study results, in terms of the validated model relating quality and context factors as components of geospatial information utility, and the refined survey instrument, be generalized to a broader set of GIS users in other settings?

(2) Can the study results, in terms of the general notion of modeling utility in specific terms, such as context-factors relating utility to attributes in use, be generalized to broader MIS and IS fields, independent of the geospatial dimension?

111

In the case of (1) above, the factor loadings and correlations are strong, the model did predict, to a large degree, the loadings on the two 1st-order factors: quality-factors and context-factors, validating the model and suggesting it is highly representative of GIS users. Limitations on generalizability are discussed more specifically in a later section. In the case of (2) above, the general model seems suggestively strong, based on the loadings, however, the questions are very specific to the use of the geospatial data used within geospatial information systems.

To more adequately extend these results

“upwards” to broader IS and management science communities, a similar approach but with less targeted GIS-centric items must be pursued.

Reliability

Generally accepted forms of research reliability include:

1. Inter-rater or observer reliability, which assesses the affect of using different raters or observers to collect data about the same phenomenon. This is not a factor in this research; there was only a single researcher conducting either the pilot study focus groups or the main study survey data collection and analysis.

2. Test-retest reliability, which assesses the consistency of a measure or treatment from one use to another. This form of reliability is also not relevant to this survey

112

building research, but will be very critical to further continuation of this research as the developed instrument is used and refined in the field.

3. Parallel forms reliability, which assesses the consistency of similar test results repeated in the same content domain. This form of reliability is not relevant to this research, but will certainly be so in the future.

4.

Internal consistency reliability, which assesses the consistency of items measured within a single test. This form of reliability is very relevant to this research; it is measured using Cronbach’s alpha previously reported: 0.8981. This Cronbach’s alpha score suggests a high degree of internal consistency.

113

Chapter 5: Conclusions, Implications, and Recommendations

Overview

This chapter discusses the research findings, provides conclusions, offers implications for the GIS data community and GIS related organizations, provides information about the study’s limitations, and proposes opportunities for future research.

Discussion of Research Results

Going beyond the stated purpose for this research, the specific outcomes include: (1) the successful exploration of the dimensions and factors comprising Geospatial Information Utility as a construct, at the level of binning the dimensions at the principal factors level, and, (2) the development of a validated, but preliminary survey instrument to be used within the GIS community to make more detailed assessments about the use of quality and context dimensions for organizations and GIS analysts using geospatial data. This approach to understanding geospatial information utility is important as the GIS and geospatial data and remote sensing communities, through trade and professional association venues, are currently addressing metadata standards for data inputted into GIS.

114

The survey response data obtained through this research provide empirical evidence of respondents’ distinctions between quality-based and context-based dimensions. Through the factor analysis and the resultant validity and reliability assessments, it was possible to further reduce the original set of 28 factors to a more manageable set of 12 factors (dimensions) loaded on the two first-order factors as shown here in Figure 5.1:

Geospatial Information Utility

Quality

Context

Attribute-based factors

Use-based factors

Coherence (COH) Interpretability (INT) Relevance (RLV) * Reliability (REL)

Accuracy (ACU) Completeness (CPT) Comprehensibility (CMP) * Consistency (CNS) * Currency (CUR) Precision (PRE) Representativeness (REP) Usefulness (USF)

Figure 5.1 Assessed Research Model with Correlated Dimensions of Qualityand Context-oriented Factors. Note: * indicates this dimension loaded other than expected. See explanations below. 115

The binning of the Quality (attribute-based) and Context (use-based) dimensions generally followed pre-test expectations, except for Comprehensibility (CMP) and Consistency (CNS), which were originally thought to belong to the Quality/attribute factor. This was surprising. It is possible it is due to respondents’ confusion over the intent of the questions or of the definitions provided. It is also possible these are a correct, but unexpected result. Perhaps, as the Information Puzzle metaphor suggests, there exists a knowledge “tipping point” whereupon one can discern the appropriate meaning of something even if the data and information are incomplete (or inconsistent?). If this is true, then it suggests, for example, that consistency as a dimension is less important than expected. This issue cannot be properly clarified until the next round of research; however, clearly results like this need to be explored.

In one other case, Relevance (RLV), the mean survey response scores were very weak and these factors likewise loaded unexpectedly on the Quality/attribute factor instead of the Context/use factor. It was noted by some respondents in emails sent after they had completed the online survey that they thought some of the dimensions were entirely subjective, whereas other dimensions, such as Currency (CUR) and Accuracy (ACU) have both subjective components (i.e., importance is a matter for the analyst’s expert judgment in some instances) and quantitative/objective aspects as well. For example, the age or accuracy of a data set or map can be directly calculated, while this is not necessarily true of some dimensions such as Relevance (RLV) and Interpretability (INT), which are entirely subject to the individual respondent’s judgment.

116

Benefits and Implications of this Study

This research provides a model for considering data and information inputted into GIS differently.

This research provides a conceptual model for viewing these data and

information within a context of how useful one or more particular data sets may be to the GIS user. In theoretical circles, this permits parsing of the previously dominant Quality paradigm into more discrete components (via the factor analysis). This is important because of the breadth or number of the dimensional constructs previously lumped underneath the single factor, Quality. In practitioner circles, this conceptual approach provides a way to begin to envision alternative algorithmic approaches to evaluating acquired data either before acquisition (especially important when resources must be expended to acquire data for analyses within GIS), or in conjunction with use so that an estimate of the utility and confidence of the analysis output to decision makers.

As this was an exploratory study, it is certain there is much more work to do. However, a strength of this study is the willingness to extend the body of GIS data knowledge into new territory. Another strength of this study is the possibility to extend this work into other areas where the constructs used for utility need new exploration.

117

Limitations of this Study

Sample size: the pre-test goal was to be able to reach no less than 3,000 GIS practitioners in order to obtain approximately 200-300 respondents. It turned out very difficult to get the 163 respondents used in the data analysis. This was surprising. In follow-up work, a different approach to obtaining the sample pool will be pursued.

Measurement scales: Though the use of Likert scales is well and long-established in both academic circles and lay/commercial usage, the accuracy of respondents’ responses is limited by the inherent structure of these scales. In some instances, survey respondents sent email with comments or suggestions about specific survey items, or about the use of the 7-point Likert scale itself. Most of those who sent email expressed an interest in being able to provide information beyond the structure of the data gathering and analysis.

Factor analyses: As a technique for data analysis and reduction, factor analysis is subject to interpretation by the individual researcher, particularly when it comes to selecting the appropriate number of factors and of labeling those factors selected via the data analysis. In this research, the selection of factors was simplified by the binary-ness of the underlying research model (i.e., examining factor binning in a Quality-versus-Context model), and of the clarity with which the factors loaded. However, as several judgments are required on the part of the researcher in performing factor analysis, there are some subjectivity and reliability issues associated with the technique.

118

Respondent bias: Via the pilot study or through the limited interactions by email with survey respondents who communicated after completing the survey, it became clear that some respondents seemed motivated for a “good” result to come from the study in as much as they claimed to see the benefits to the community of completing a study such as this.

As mentioned below, because of ongoing metadata standards definition and

research in the geospatial data and information field, this is a fortuitous time to be completing a study such as this. If any respondents perceived there was a desired “best” response, this could have influenced some responses.

Generalizability: the respondents came from just a handful of the domains for which GIS analyses are performed throughout the world. Although there are increasing growth of GIS into many different domains—soon to the point that the geospatial “mindset” may one day be common practice in virtually all industries and domains, for this study, though 14 industry or domain choices were offered (15 if “other” is counted), the respondents overwhelming selected just 4 domains and other. Therefore, one limit on generalizability is that the survey responses may be more representative about how GIS users in this narrow set of domains or industries use GIS and the data inputted into them than to the broader GIS community at large. Further work should address broadening the respondent pool, both in size and in breadth of the domains and industries represented.

119

Recommendations for Future Research

Four natural lines of investigation are worthy of further study based upon these research results: further instrument validation, extension of the research into the geospatial information metadata community, extension of the research into the streams of research working on data set assessment methodologies, and, extension of the research into the broader “utility” field, beyond the scope of the GIS and geospatial data fields into IS&T and management science problems.

Instrument validation:

Further instrument validation should take two forms;

(1) as this was an exploratory study into the factors comprising geospatial information utility, qualitative research within the community to ensure the dimensions and factors arrived at via this research are a correct and complete set and (2) further quantitative studies, using other methodologies and approaches, to include confirmatory factor analysis.

Extension of this research into the research streams surrounding the development and refinement of metadata standards within the geospatial data and GIS communities. Several governmental (e.g. Federal Geographic Data Committee) and professional or trade associations (e.g., Open Geospatial Consortium) are working on metadata standards that pertain to both the content and format of geospatial metadata. This work relies on blending multiple points of view about what metadata are for and how they are used.

120

This research into the organization of geospatial data into quality-based and contextbased factors has a natural place within the metadata community.

Extension of this research into the research streams involved in finding appropriate and viable algorithmic methods for assessing the quality and utility of geospatial data, information, and metadata used in GIS. Different schemas exist to assess the quality of geospatial data used in GIS analyses. This research originally began with an inquiry into this research stream focused on utility, vice quality, as a defining criterion. Following further validation and application of this survey instrument, and guided by work in the metadata standards field, the algorithmic assessment of data is a natural extension of this current research.

Extension of this research “upwards” into a broader discussion of utility.

Another

interesting opportunity exists within the context of the mechanisms and constructs by which practitioners and theoreticians view data and information inputted into several “generic” types of information systems. As with the GIS field, little attention has been paid to utility in this information systems-in-use context; though this research focused on information utility within the narrow GIS information context, a broader examination of these constructs and methods applied to IS&T and management science in general is worthy of exploration.

121

References Al-Hakim, L. (2007). Information quality factors affecting innovation process. International Journal of Information Quality, 1(2), 162-176. Aleskerov, F., & Monjardet, B. (2002). Utility Maximization, Choice and Preference. Berlin: Springer. Alexander, J. E., & Tate, M. A. (1999). Web wisdom: how to evaluate and create information quality on the web. Mahwah, NJ: Erlbaum. Ali, A. S. B. (2005). An Assessment of the Impact of the Fit Among Computer SelfEfficacy, Task Characteristics and Systems Characteristics on Performance and Information Systems Utilization. George Washington University, Washington, DC. Barrett, H. H., & Myers, K. J. (2004). Foundations of Image Science. Hoboken, NJ: Wiley Interscience. Baumol, W. J. (1951). The Neumann-Morgenstern Utility Index--An Ordinalist View. The Journal of Political Economy, 59(1), 61-66. Bell, D. E., & Farquhar, P. H. (1986). Perspectives on Utility Theory. Operations Research, 34(1), 179-183. Belle, G. v. (2002). Statistical Rules of Thumb. New York: Wiley Interscience. Bernstein, P. L. (1996). Against the Gods: the Remarkable Story of Risk. New York: John Wiley & Sons. Block, H. D., & Marschak, J. (1960). Random orderings and stochastic theories of responses. In I. Olkin, S. Ghurye, W. Hoeffding, W. Madow & H. Mann (Eds.), Contributions to Probability and Statistics (pp. 97-132). Stanford, CA: Stanford University Press. Bossler, J. D. (2002a). An Introduction to Geospatial Science and Technology. In J. D. Bossler (Ed.), Manual of Geospatial Science and Technology (pp. 3-7). London: Taylor and Francis. Bossler, J. D. (Ed.). (2002b). Manual of Geospatial Science and Technology. London: Taylor and Francis. Bovee, M. W. (2004). Information Quality: A Conceptual Framework and Empirical Validation. Unpublished PhD Dissertation, University of Kansas.

122

Braman, S. (1989). Defining Information: An approach for policymakers. Telecommunications Policy, 13(3), 233-242. Bruin, S. d., Bregt, A., & Ven, M. V. d. (2001). Assessing Fitness for Use: The Expected Value of Spatial Data Sets. International Journal of Geographical Information Science, 15(5), 457-471. Buede, D. M. (2000). The Engineering Design of Systems: Models and Methods. New York: John Wiley & Sons, Inc. Burgess, M. S. E., Gray, W. A., & Fiddian, N. J. (2006). Quality measures and the information consumer. In L. Al-Hakim (Ed.), Challenges of Managing Information Quality in Service Organizations (pp. 213-242). Hershey, PA: Idea Press Group. Burgess, M. S. E., Gray, W. A., & Fiddian, N. J. (2007). Using quality criteria to assist in information searching. International Journal of Information Quality, 1(1), 83-99. Burrough, P. A., & Frank, A. U. (Eds.). (1996). Geographic Objects with Indeterminate Boundaries. London: Taylor & Francis. Burrough, P. A., & McDonnell, R. A. (1998). Principles of Geographical Information Systems. Oxford: Oxford University Press. Burzynski, T. (1998). Establishing the Environment for Implementation of a Data Quality Management Culture in the Military Health System. Paper presented at the 1998 Conference on Information Quality. Campbell, J. B. (2002). Introduction to Remote Sensing (3rd ed.). New York: Guilford Press. Capurro, R., & Hjorland, B. (2003). The Concept of Information. In B. Cronin (Ed.), Annual Review of Information Science and Technology (Vol. 37, pp. 343-411). Clemen, R. T., & Reilly, T. (2001). Making Hard Decisions with Decision Tools. Pacific Grove, CA: Thomson Learning. Congalton, R. G., & Green, K. (1999). Assessing the Accuracy of Remotely Sensed Data: Principles and Practices. Boca Raton: Lewis Publishers. Congalton, R. G., & Plourde, L. C. (2002). Quality Assurance and Accuracy Assessment of Information Derived from Remotely Sensed Data. In J. D. Bossler (Ed.), Manual of Geospatial Science and Technology (pp. 349-363). London: Taylor and Francis.

123

Couclelis, H. (1992). People manipulate objects (but cultivate fields): beyond the rastervector debate in GIS. In A. U. Frank, I. Campari & U. Formentini (Eds.), Theories and Methods of Spatio-Temporal Reasoning in Geographic Space (Vol. Lecture Notes in Computer Science 639, pp. 65-77). Berlin: Springer Verlag. Decker, D. (2001). GIS data sources. New York: J. Wiley. DeGroot, M. H. (1983). Decision Making with Uncertain Utility Functions. In B. P. Stigum & F. Wenstop (Eds.), Foundations of Utility and Risk Theory with Applications (pp. 371-384). Dordrecht: D. Reidel Publishing Company. Doll, W. J., & Torkzadeh, G. (1988). The Measurement of End-User Computing Satisfaction. MIS Quarterly, June 1988, 259-274. Duncan, G., Heidbreder, W., Hammack, J., & Szpak, C. (1997). Map Feature Examination of RADARSAT for geospatial utility and imagery enhancement opportunities. Proceedings of the International Society for Optical Engineering: Integrating Photogrammetric Techniques with Scene Analysis and Machine Vision III, 3072, 55-66. English, L. (1999). Improving Data Warehouse and Business Information Quality. New York: John Wiley & Sons. English, L. P. (1999). Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. New York: John Wiley & Sons. Eppler, M. J. (2003). Managing Information Quality: Increasing the Value of Information in Knowledge-intensive Products and Processes. Berlin: Springer-Verlag. Falmagne, J.-C. (1978). A representation theorem for finite random scale systems. Journal of Mathematical Psychology, 18, 52-72. Farquhar, P. H. (1975). A Fractional Hypercube Decomposition Theorem for Multiattribute Utility Functions. Operations Research, 23(5), 941-967. Farquhar, P. H. (1984). Utility Assessment Methods. Management Science, 30(11), 12831300. Fishburn, P. C. (1965). Independence in Utility Theory with Whole Product Sets. Operations Research, 13, 28-45. Fishburn, P. C. (1967). Methods of Estimating Additive Utilities. Management Science, 13, 435-453.

124

Fisher, I. (1968). Is "Utility" the Most Suitable Term for the Concept It Is Used to Denote? In A. N. Page (Ed.), Utility Theory: A Book of Readings (pp. 49-51). New York: John Wiley and Sons. Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2000). Quantitative geography : perspectives on spatial data analysis. London ; Thousand Oaks, Calif.: Sage Publications. Fotheringham, A. S., Wegener, M., & European Science Foundation. (2000). Spatial models and GIS : new potential and new models. London ; Philadelphia: Taylor & Francis. Fotheringham, S., & Rogerson, P. (Eds.). (1994). Spatial Analysis and GIS. London: Taylor and Francis. Friedman, M. (1955). What All is Utility? The Economic Journal, 65(259), 405-409. Friedman, M., & Savage, L. J. (1948). The Utility Analysis of Choices Involving Risk. Journal of Politcal Economy, 56(4), 279-304. Gardyn, E. (1997). A Data Quality Handbook for a Data Warehouse. Paper presented at the 1997 Conference on Information Quality, Cambridge, MA. Garvin, D. (1987). Competing on the eight dimensions of quality. Harvard Business Review, 101-109. Garvin, D. (1988). Managing Quality: The Strategic and Competitive Edge. New York: The Free Press/McMillan. Giddings, F. H. (1891). The Concepts of Utility, Value, and Cost. Publications of the American Economic Association, 6(1/2), 41-43. Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate Data Analysis (Fifth ed.). Upper Saddle River: Prentice Hall. Hakansson, N. H. (1970). Friedman-Savage Utility Functions Consistent with Risk Aversion. The Quarterly Journal of Economics, 84(3), 472-487. Handy, R. (1970). The Measurement of Values: Behavioral Science and Philosophical Approaches. St Louis: Warren H. Green, Inc. Harman, H. H. (1968). Factor Analysis. In D. K. Whitla (Ed.), Handbook of Measurement and Assessment in Behavioral Sciences (pp. 143-170). Reading, MA: Addison-Wesley.

125

Harris, R. J. (2001). A Primer of Multivariate Statistics (Third ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Harvey, C. M. (1993). Multiattribute Risk Linearity. Management Science 39(3), 389394. Harvey, F. J. (2002). Processing Spatial Data. In J. D. Bossler (Ed.), Manual of Geospatial Science and Technology (pp. 450-463). London: Taylor & Francis. Hoch, S. J., & Kunreuther, H. C. (2001). The Complex Web of Decisions. In S. J. Hoch, H. C. Kunreuther & R. E. Gunther (Eds.), Wharton on Decision Making (pp. 114). New York: John Wiley & Sons. Hull, J. C., Moore, P. G., & Thomas, H. (1973). Utility and Its Measurement. Journal of the Royal Statistical Society, Series A, 136, 226-247. Isaaks, E. H., & Srivastava, R. M. (1989). An Introduction to Applied Geostatistics. New York: Oxford University Press. Iverson, G., & Luce, R. D. (1998). The Representational Measurement Approach to Psychophysical and Judgmental Problems. In M. H. Birnbaum (Ed.), Measurement, Judgment, and Decision Making. San Diego: Academic Press. Ives, B., Olson, M., & Baroudi, S. (1983). The Measurement of User Information Satisfaction. Communications of the ACM, 26(10), 785-793. Jolliffe, F. R. (1986). Survey Design and Analysis. London: Ellis Horwood Limited. Kahn, B., Strong, D., & Wang, R. (1997). A Model for Delivering Quality Information as Product and Service. Paper presented at the 1997 Conference on Information Quality, Cambridge, MA. Kahn, B., Strong, D., & Wang, R. (2002). Information Quality Benchmarks: Product and Service Performance. Communications of the ACM, 184-192. Kaplowitz, M. D., Hadlock, T. D., & Levine, R. (2004). A Comparison Between Web and Mail Survey Response Rates. Public Opinion Quarterly, 68(1), 94-101. Keeney, R. L. (1971). Utility Independence and Preferences for Multiattributed Consequences. Operations Research, 19(4), 875-893. Keeney, R. L. (1972). Utility Functions for Multiattributed Consequences. Management Science, 18(5), 276-287. Keeney, R. L., & Raiffa, H. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley.

126

Kerlinger, F. N. (1978). Foundations of Behavioral Research. New York: McGraw-Hill. Kifer, D., & Gehrke, J. (2006). Injecting utility into anonymized datasets. Paper presented at the International Conference on Management of Data. Koniger, P., & Rathmayer, W. (1998). Management unstrukturierter Informationen. Frankfurt. Kuhn, T. S. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago Press. Kuhn, T. S. (1973). Second Thoughts on Paradigms. In F. Suppe (Ed.), The Structure of Scientific Theories. Urbana, IL: University of Illinois Press. Kuhn, T. S. (1979). Metaphor in science. In A. Ortony (Ed.), Metaphor and Thought (pp. 409-419). Cambridge, MA: Cambridge University Press. Lange, O. (1934). The Determinateness of the Utility Function. The Review of Economic Studies, 1(3), 218-225. Lawrence, D. B. (1999). The Economic Value of Information. New York: SpringerVerlag. Lawshe, C. H. (1975). A Quantitative Approach to Content Validity. Personnel Psychology, 563-575. Lesca, H., & Lesca, E. (1995). Gestion de l'information, qualite de l'information et performances de l'enterprise. Paris: Litec. Lillesand, T. M., & Kiefer, R. W. (2000). Remote Sensing and Image Interpretation (4th ed.). New York: John Wiley & Sons. Longley, P. A., Goodchild, M. F., Maguire, D. J., & Rhind, D. W. (2001). Geographic Information Systems and Science. Chichester: John Wiley & Sons. Luce, R. D. (2000). Utility of Gains and Losses: Measurement-Theoretical and Experimental Approaches. Mahwah, NJ: Lawrence Erlbaum Associates. Maasen, S., & Weingart, P. (2000). Metaphors and the Dynamics of Knowledge. London: Routledge. Machlup, F., & Mansfield, U. (Eds.). (1983). The Study of Information: Interdisciplinary Messages. New York: John Wiley & Sons.

127

Marsden, R. (2000). UK Measurement of Geospatial Information Utility [Briefing]. London: UK Defence Imagery and Geospatial Liaison Staff. McCullagh, M. J. (1998). Quality, Use and Visualization in Terrain Modeling. In S. Lane, K. Richards & J. Chandler (Eds.), Landform Monitoring, Modelling and Analysis (pp. 95-117). Chichester, UK: John Wiley & Sons. Meeks, W. L., & Dasgupta, S. (2004). Geospatial information utility: an estimation of the relevance of geospatial information to users. Journal of Decision Support Systems, 38, 47-63. Meeks, W. L., & Dasgupta, S. (2005). The Value of Using GIS and Geospatial Data to Support Organizational Decision Making. In J. B. Pick (Ed.), Geographic Information Systems in Business (pp. 175-197). Hershey, PA: Idea Group Publishing. Mitchell, J. (1999). Measurement in Psychology. Cambridge, UK: Cambridge University Press. Morrison, J. (2002). Spatial Data Quality. In J. D. Bossler (Ed.), Manual of Geospatial Science and Technology (pp. 500-516). London: Taylor & Francis. Motro, A., & Anokhin, P. (2004). Information Quality in Informational Systems. Paper presented at the 2004 International Workshop on Information quality in information systems, Paris. Nebert, D. D. (2001). Discussion of the structure and need for geospatial information utility algorithm, as pertaining to USGS and US government users. In W. L. Meeks (Ed.). Reston, VA. Neumann, J. v., & Morgenstern, O. (1947). Theory of Games and Economic Behavior. Princeton. Obermeier, J. (2001). Discussion of Product Adequacy and Product Evaluations at NIMA, and Finding Automated Methods for Determining the Utility of Geospatial Information. In L. M. W (Ed.). Bethesda, MD. Oppenheim, A. N. (1966). Questionnaire Design and Attitude Measurement. New York: Basic Books, Inc. Osgood, C. E., Suci, G. J., & Tannenbaum, P. (1957). The Measurement of Meaning. Urbana, IL: University of Illinois Press. Padman, R., & Tzourakis, M. (1997). Quality Metrics for Healthcare Data: An Analytical Approach. Paper presented at the 1997 Conference on Information Quality, Cambridge, MA.

128

Parssian, A. H. (2002). Assessing Information Quality for Relational Databases.Unpublished manuscript. Payne, J. W., & Laughhunn, D. J. (1984). Multiattribute Risky Choice Behavior: The Editing of Complex Prospects. Management Science: Special Issue on Multiple Criteria Decision Making, 30(11), 1350-1361. Peterson, W. C. (1973). Elements of Economics. New York: W.G. Norton & Company. Pick, J. B. (2005). Geographic Infomation Systems in Business. Hershey, PA: Idea Group Publising. Pressman, R. S. (1997). Software Engineering: A Practitioner's Approach (4th ed.). New York: McGraw-Hill. Price, R. J., & Shanks, G. (2004). A Semiotic Information Quality Framework. Paper presented at the IFIP International Conference on Decision Support Systems (DSS 2004): Decision Support in an Uncertain World. Quiggin, J. (1993). Generalized Expected Utility Theory: The Rank Dependent Model. Boston: Kluwer Academic Publishers. Rai, A., Lang, S. S., & Welker, R. B. (2002). Assessing the validity of IS success models: An empirical test and theoretical analysis. Information Systems Research, 13(1), 50-69. Raper, J. (2000). Multidimensional Geographic Information Science. London: Taylor & Francis. Rea, L. M., & Parker, R. A. (1992). Designing and Conducting Survey Research: A Comprehensive Guide. San Francisco: Jossey-Bass. Redman, T. C. (1996). Data quality for the information age. Boston: Artech House. Redman, T. C. (1998). The Impact of Poor Data Quality on the Typical Enterprise. Communications of the ACM, 41(2), 79-82. Regenwetter, M. (1996). Random utility representation of finite n-ary relations. Journal of Mathematical Psychology, 40, 219-234. Reichardt, M. (2001). Discussion about the use of geospatial metadata and desirability of developing algorithmic methods to assess the utility of geospatial information used in GIS. In W. L. Meeks (Ed.). Bethesda, MD.

129

Reichwald, R. (1993). Die Wirtschaftlichkleit im Spannungsfeld von betriebswirtschaftler Theorie and Praxis. (Vol. Vol 1). Munich. Russ-Mohl, S. (1994). Der I-Faktor. Osnabruck: Fromm. Schlee, E. (1990). The Value of Information in Anticipated Utility Theory. Journal of Risk and Uncertainty, 3(1), 83-92. Schmidt, U. (1998). Axiomatic Utility Theory under Risk. Berlin: Springer. Shannon, C. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379-423, 623-656. Shannon, C., & Weaver, W. (1949/1972). The mathematical theory of communication. Urbana, IL: The University of Illinois Press. Shaw, M. E., & Wright, J. M. (1967). Scales for the Measurement of Attitudes. New York: McGraw-Hill Book Company. Smith, T. (1993, 1994). On the integration of database systems and computational support for high-level modeling of spatio-temporal phenomena. Paper presented at the Innovations in GIS: selected papers from the First National Conference on GIS Research UK, Keele University, England. Somers, T. M., Nelson, K., & Karimi, J. (2003). Confirmatory Factor Analysis of the End-User Computing Satisfaction Instrument: Replication within an ERP Domain. Decision Sciences, 34(3), 595-621. Stamper, R. (1996). Signs, Information, Norms and Systems. In B. Holmquist, P. B. Andersen, H. Klein & R. Posner (Eds.), Signs of Work: Semiosis and Information Processing in Organizations (pp. 349-397). Berlin: De Gruyter. Stamper, R. (2001). Organizational Semiotics: Information without the Computer? In K. Liu, R. J. Clarke, P. B. Andersen & R. Stamper (Eds.), Information, Organization and Technology: Studies in Organizational Semiotics. Norwell, MA: Kluwer Academic Publishers. Stonier, T. (1991). Toward a new theory of information. Journal of Information Science, 17(5), 257-263. Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Star, S. A., & Clausen, J. A. (1973). Measurement and Prediction (Vol. IV). Gloucester, MA: Princeton University Press.

130

Straub, D., Boudreau, M.-C., & Gefen, D. (2004). Validation Guidelines for IS Positivist Research. Communications of the Association for Information Systems, 13, 380427. Straub, D. W. (1989). Validating Instruments in MIS Research. MIS Quarterly, 13(2), 147-169. Taylor, R. S. (1986). Value-added Processes in Information Systems. Norwood: Ablex. Thurstone, L. L. (1959). The Measurement of Values. Chicago: Chicago University Press. Trochim, W. M. K. (2001). The Research Methods Knowledge Base (Second ed.). Cincinnati: Atomic Dog Publishing. VanDyke, J. (2001). Email regarding need for geospatial information utility assessment for users of CIA Map Services Center. Retrieved 4/13/2001 Viner, J. (1968). The Utility Concept in Value Theory and Its Critics. In A. N. Page (Ed.), Utility Theory: A Book of Readings (pp. 123-138). New York: John Wiley and Sons. Wang, R., & Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12(4), 5-33. Wang, R. Y. (1998). A Product Perspective on Total Data Quality Management. Communications of the ACM, 41(2), 58-65. Wang, R. Y., Reddy, M., & Kon, H. (1995). Toward Quality Data: An Attribute-Based Approach. Decision Support Systems, 13(3-4), 349-372. Wang, R. Y., Storey, V., & Firth, C. (1995). A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering, 7(4), 623639. Wigand, R., Picot, A., & Reichwald, R. (1997). Information, Organization and Management. Chichester: John Wiley & Sons. Zwass, V. (1998). Foundations of Information Systems. Boston: Irwin/McGraw-Hill.

131