Integrating Disparate Data for Decision Support: An Interdisciplinary, Object-Oriented, Open Source Approach Thornton Staples* Marc Evans** Bill Scherer*** Donna Tolson**** Ryan Nelson***** *Digital Library Research and Development University of Virginia Library Charlottesville, VA USA E-mail:
[email protected] School of Engineering University of Virginia Charlottesville, VA USA **E-mail:
[email protected] ***E-mail:
[email protected] ****University of Virginia Library Charlottesville, VA USA E-mail:
[email protected] *****McIntire School of Commerce University of Virginia Charlottesville, VA USA E-mail:
[email protected]
ABSTRACT In this paper we describe a unique system for integrating disparate data that will allow for greatly improved decision support in diverse domains. The Flexible Extensible Digital Object Repository Architecture (Fedora) provides the foundation for this project by combining a flexible, use-neutral approach to organizing streams of content and metadata with disseminators that can transform the data upon demand. The goal of the project is to develop a set of open source software tools that completes the integrated environment needed to find data, prepare it and hand it off to application software in support of decision makers in a variety of settings. Keywords DSS, data, integration, interdisciplinary, metadata, object-oriented, open source, Fedora
1. INTRODUCTION The best solutions to real world problems often require data from many resources, but the obstacles posed by obtaining and using disparate information produced by different sources are often overwhelming. Data are aggregated into different units of geography and time, they may be stored in incompatible formats, and the definitions and terms may be unfamiliar to those not intimately involved in their primary collection. The history of libraries is one of serving broad communities by building large, organized collections of information resources that can be easily discovered and utilized by a variety of audiences. Librarians have a long tradition of working closely with research communities, assembling collections that both meet and anticipate their needs, and providing the services needed to discover and use the resources. Over the last twelve years, the University of Virginia Library has been a leader in the effort to realize the potential of digital forms of information, including an active program for incorporating quantitative datasets into the program. Most recently, an interdisciplinary team of Library, Engineering, and Commerce School personnel have been working together to develop a data architecture that can be used to standardize datasets for inclusion in a digital library.
763
Integrating Disparate Data for Decision Support: An Interdisciplinary, Object-Oriented, Open Source Approach The proposed system will provide the initial structure for a powerful “digital workbench” that will improve the accessibility and efficacy of engineering, commercial, and scientific datasets. The workbench will minimize the effort necessary to locate, prepare, and relate disparate data; and reduce or eliminate the need for datasets to be stored by multiple entities. The result will significantly reduce the cost and time to researchers when accessing disparate datasets. In addition, it could dramatically improve the quality of overall analysis by affording researchers easy access to additional data sets with which to relate their primary data, and potentially improve their models via the additional resources available. Below we describe four diverse applications of the proposed system to demonstrate the viability and need for such a resource. Although the following types of analyses can be performed today, improved access to disparate datasets will allow much more timely analyses using existing techniques, and encourage experimentation with new techniques. 1.1 Transportation Engineering The entire transportation systems engineering process involves the collection of copious amounts of data to calibrate and improve models for the development of new, or improvements to existing, transportation facilities. The initiation of any transportation engineering process involves the evaluation of the current system and the planning for a potential system or facility, such as a new road or signal system. One critical resource for the transportation engineer is up-to-date historical traffic flow characteristics. This resource affords the engineer the opportunity to study patterns of traffic over time for each existing facility. Critical associative resources, from population growth to climate trends, serve to enrich the determination of the existing state of the regional system. A new facility will spawn new behaviors to existing facilities – accurate estimates for these new behaviors cannot be made by simply examining traffic flow data in isolation; they must include external data sources to ensure some degree of realism in its estimation. Another type of analysis that would benefit from the digital workbench is done by operating transportation engineers such as transit managers or freeway safety service patrol managers. For example, transit managers evaluate long-range and short-range aspects of their network. Population, crime, and climate data would all relate to the proper placement and most appropriate physical characteristics of transit stops, while traffic flow data is useful in estimating timely transit routes and re-routes. 1.2 Environmental Management Cities, states, and the Federal Government look to regional environmental quality estimation and assurance and rely heavily on disparate data. Air quality sensors will provide much of the information they require, but many regions throughout the nation have yet to install these devices. If made easily accessible, other sources can provide valuable information regarding the status of the immediate environment. Traffic data alone can provide an environmental engineer the ability to estimate the types and amounts of waste gasses and particulate matter generated by select classified vehicles traveling at select speeds. When traffic data are related to climatic data, estimates of Ozone levels may be determined. Add to this information the geographicallyspecific social and economic data found in the census, and human socio-economic impacts may then be evaluated – this then becomes the starting point for the estimated environmental cost of the mode of transportation evaluated by the traffic data. 1.3 Homeland Security Increasingly, government officials at all levels associated with safety and security are confronted with the task of reducing the risk of potentially life-threatening events while simultaneously allowing a population to continue its normal activities. Given improved access to diverse and disparate sets of data that relate to human and environmental characteristics, homeland security entities could evaluate regional effect and response characteristics to select scenarios – in essence providing advanced planning for a multitude of security and safety events. For instance, suppose that an emergency response team is planning how to deal with the following scenario: a powder-based hazardous material is released when a cargo vehicle collides with an aggressive automobile driver. Safety and security evaluators could examine geographic, topological and climatologic data for potential dispersal of the material, traffic and transportation network information for effects and clean-up, census data for population and economic effects. In essence, having much improved access to such data could afford regional safety units the ability to more readily develop holistic regional security assessments. While this has been demonstrated for planning purposes, increased accessibility to disparate data could also be utilized in an active response to a real event.
764
Decision Support in an Uncertain and Complex World: The IFIP TC8/WG8.3 International Conference 2004 1.4 Commerce There are a host of applications of the proposed system within commerce, including marketing, finance, and logistics management. More specifically, sales forecasting and store location models would benefit greatly through the enhanced integration of geographic, census and transportation data. In addition, the emerging area of location-based mobile retailing would be a natural fit. Take for instance the scenario of a GPS-enabled mobile customer being alerted to personalized promotions by retail establishments in close proximity. Currently, all of the above research efforts are possible, but they would require weeks, if not months, of data collection activities and analysis to provide meaningful outcomes. The proposed system will 1) dramatically reduce the time and effort required to find and access the data, 2) reduce the cost of doing so, and 3) provide additional information that will significantly improve the quality of decision making.
2. DSS MOTIVATION The proposed system will afford improved access to disparate data by providing a complete environment for the use of quantitative data, including an integrated collection of quantitative data and a powerful workbench that provides consistent access to it. The system will be completely built on open-source software, making it easily replicable for other uses. The workbench portion alone should facilitate increased experimentation and improved analysis for the usual community of users, as well as new discovery and experimentation by communities of users who have not been accustomed to have quantitative data available for easy use. It is expected that there will be significantly reduced research time and cost and dramatically improved quality of overall analysis through affording immediate access to additional data sets. With regard to the proposed system's ability to manage disparate data internally, if the overall system is embraced (with special regard to the user's workbench), there may exist a potential for a cultural change within disparate statistical and scientific data collection communities to collect and store their datasets in formats more applicable to archival/relation within such a data archive. In general, the primary concept is to include scientific and statistical data in a general research library, in an environment that supports power users of such data appropriately, while making it more accessible to the general users of the library. 2.1 Support for application domains Utilizing the initial Model Virginia tested datasets, as well as associated datasets relevant to those domains within the proposed system will have immediate impact on several applied domains, including: •
Transportation Engineering – core domain data, on traffic flow, associated with census, land use data, as well as climate, serve to enrich the determination of the existing state of the regional system. Placement and operation of new facilities will improve with access to additional associative data.
•
Environmental Management – non-core domain datasets, such as traffic and climactic data, can assist in the determination of air quality, and census can assist in the associative social impact.
•
Homeland Security – a new operational domain, the data associated with the management of a region become imperative in modeling and testing the overall regional human-system. Datasets such as traffic, census, and climate are of certain immediate value, but other datasets, such as crime, infrastructure, health, economy, will also be of value.
•
Commerce – mixed domain datasets such as geographic, census, transportation, crime, and/or climatic data can be leveraged to enhance existing, or create altogether new, business models.
2.2 Support for research domains Presently there exists a data glut and an information shortage with regard to engineering and science model building. Increases in computational power for workstations and personal computers are enabling analysis of more data every day. Thus, there is a need to explore the fundamental research question of how information technology could be best utilized to drive the construction of engineering and science models. The traditional modeling approach has been to specify a model, such as a finite-element model or a dynamic control model, and then use available data to determine the model parameters that produce the best fit. If additional data are available, they are used to test that model's applicability; if not, often a re-substitution estimate of performance is generated by using the same data for parameter estimation and testing. Our hypothesis is that a more logical approach, given the availability of large amounts of data in most present day applications, is to allow the data to dictate the model selection. A fundamental requirement which must be satisfied in order to implement such a data-driven approach is the availability of powerful computational platforms such as the proposed workbench.
765
Integrating Disparate Data for Decision Support: An Interdisciplinary, Object-Oriented, Open Source Approach 2.3 Other Potential Uses The extensibility of the proposed system immediately suggests that other domains may benefit from its utilization. As mentioned earlier, the Homeland Security domain could benefit by accessing a multitude of datasets that have potential for inclusion, such as economic, health, and criminal. A subset of Homeland Security, public safety, may find improved utility. From a criminal analyst perspective, access to data on traffic flow and transportation data may yield interesting results, for example, a recent experiment with crime data associated with transportation network in a given city demonstrated that 80% of crimes occurred within .3 miles of an Interstate. An even more detailed picture could be developed utilizing traffic flow data, census, and climatologic data. The initial association of crime data with transportation infrastructure took extensive time and effort to bring the data into the same computational tool to perform this analysis. The proposed system would allow such analysis, as well as the aforementioned improved analysis, in much less time, with much less effort.
3. BACKGROUND The digital library program at the UVA Library is a leader in the international effort to define and build digital collections and the systems needed to manage and deliver them. Since 1992, the Library has aggressively pursued a program to simultaneously build large digital collections and, working closely with other groups at UVA, to train and support faculty and students to use them in their research and coursework. 3.1 The Institutional Environment Digital initiatives in the Library include the following: •
The Geospatial and Statistical Data Center (Geostat) supports the community’s need for geographic and social science data by building and acquiring large collections and by providing a very active training program in Geographic Information Systems (GIS) for classes.
•
The Science & Engineering Libraries' Digital Laboratory provides support for patrons using the Library’s digital science and engineering collections as well as to students and faculty working with their own data.
•
The Robertson Media Center (RMC) and its Digital Media Lab provide direct support and training to students and faculty in the creation of digital resources and in project planning. The Lab is staffed and supported jointly by the Library and the University’s Information Technology and Communications division.
•
The Electronic Text Center (Etext) has been an international leader in the creation of electronic text collections for over a decade. The Center has also supported numerous faculty projects that develop electronic text archives and supported instructional use of texts on handheld devices.
•
The Fine Arts Library provides a service to digitize images for faculty, building and cataloging core and thematic collections of digital images for research and instruction about art and architecture.
•
Rare Materials Digital Services provides digitization of primary and secondary materials from the collections of the Albert H. Small Special Collections Library to support the teaching and research mission of the University.
Other University initiatives that work closely with the Library: •
The Instructional Technology Group supports faculty development programs and annually grants fellowships for highly innovative programs to integrate technology into teaching.
•
The Research Computing Support Center provides high-performance computing technology and consulting and training services particularly oriented to science and engineering.
•
The Institute for Advanced Technology in the Humanities (IATH) supports faculty research projects making extensive and innovative use of digital technology.
•
The Virginia Center for Digital History supports faculty development and use of digital historical data.
•
The Arts and Sciences Center for Instructional Technologies develops and supports technologyenhanced learning and teaching environments in second-language acquisition and in the study of foreign languages, literatures, and cultures.
766
Decision Support in an Uncertain and Complex World: The IFIP TC8/WG8.3 International Conference 2004 The result of these efforts is an extensive collection of humanities and social science digital resources, including electronic texts, cultural documentation images, maps, geographic data, audio, video, and statistical datasets. All of these collections are actively used and added to by a community of scholars and students. In addition to providing direct support to users and collection building, staff from the above-mentioned groups actively participates in standards efforts. Starting in 1992, UVA staff has worked with SGML, then XML, to use structured texts both to create electronic surrogates of printed texts, and to create new kinds of databases. The Text Encoding Initiative (TEI) and the Encoded Archival Description (EAD) standards are both hosted at UVA. Staff have also created new XML schema for general digital library metadata, and to create descriptive models of architectural and archeological sites. They have participated in other standards efforts as well, including the Data Documentation Initiative (DDI), the Dublin Core, the VRA Core and Cataloging Cultural Objects project, and the Open Archives Initiative. 3.2 Digital Library System Development In 1999, confident that the digital library program was very much on the right track, the managers of the Library decided that it was time to begin building the integrated library program needed by a major research institution to serve its constituencies over the next 20 years. That included not only continued building of small digital collections, but scaling the effort up to provide the necessary large collections. The digital social science collections that the Geostat Center had built needed to be extended to include more material for the sciences and engineering departments. Finally, in addition to making existing data collections available, it was clear that the Library would need to be able to archive and maintain new datasets created by researchers at UVA. After evaluating vendor systems that could manage and deliver digital collections in a sophisticated yet sustainable way, and finding none that met their requirements, the Library decided to create one. The Digital Library Research and Development Department (DLRD) was created and charged with building a system that would realize these goals with the collections already created and would position the Library well for whatever may come in the future. The Flexible Extensible Digital Object Repository Architecture (Fedora™) was the clear choice as a starting point for developing the system. Carl Lagoze and Sandy Payette of the Digital Library Research Group at Cornell University created Fedora under a National Science Foundation grant, for which they created a reference implementation of the software. UVA staff reinterpreted the architecture and built the first practical demonstration of a large-scale Fedora repository. In the initial testbed an integrated collection of 10,000,000 electronic texts, images, finding aids and network news poll datasets was created which delivered the content to users through a web site. The DLRD secured funding from the Andrew Mellon Foundation and joined forces with the Cornell group to create an open-source repository management and delivery system based on Fedora. The first version of the software was released in May of 2003. As of the release of version 1.2 on December 23, 2003, the software had been downloaded over 3,000 times by representatives of a variety of academic, government, and commercial organizations in 37 countries. At least one library systems vendor has already announced a product built around the open-source software that will be delivered early in 2004, and a variety of other groups have determined that the software will form the basis for their digital library systems. The project is on target to complete version 2.0 by the end of 2004, which will completely implement the Fedora architecture and offer a variety of management tools, content versioning, an archival audit trail for data objects, and a system for fine-grained policy enforcement that uses the Shibboleth protocol for user authentication. 3.3 Fedora The Fedora architecture is based on digital data object models that are templates for units of content that include digital resources, metadata about the resources, and linkages to software tools that have already been configured to deliver the content in desired ways. The metadata and the digital resources, either directly controlled by or external to the repository, are all datastreams in an object model. These object models define classes of data objects, each of which has a persistent identifier (PID) that uniquely identifies that unit of content within a given repository. From the user’s point of view the linkages to software tools (disseminators) are seen as behaviors of the units of content. These behaviors can be exploited to deliver content that has been prepared in a variety of ways directly to a web browser. They can also be used to prepare or configure content to be used through an external software application. In a sense, these object models can be thought of as containers that give a useful shape to information that is poured into them; if the information fits the vessel, it can immediately be used in predefined ways.
767
Integrating Disparate Data for Decision Support: An Interdisciplinary, Object-Oriented, Open Source Approach Fedora gives us the ability to describe abstract sets of behaviors that constrain a corresponding set of specific processes or mechanisms that deliver the behavior described for a given unit of content. One abstract set of behaviors (a bdef object) can be used to constrain any number of mechanism sets (a bmech object), ensuring a standardization of behaviors for different units of content that are equivalent in type, but differing in format. An object model subscribes to a set of behaviors by linking to a bdef object and pairing it with a link to an appropriate bmech object. This pair of links defines a disseminator and an object model can contain any number of disseminators. In practical terms, this means that a specific object model can have sets of behaviors for a variety of purposes, or sets of behaviors that are equivalent in purpose but that prepare the content to be delivered to applications with different format requirements. For example, two different object models for image data can be defined, one whose content is composed of multiple files that contain the same image data but at different resolutions, the other that contains a single wavlet-encoded file from which any resolution can be derived. Both content models are linked to the same bdef object, but each is linked to a different bmech object to carry out each of the behaviors. Both object models that have disseminators that deliver a basic set of sizes for local use and the wavelet object may have a disseminator that can exploit georeferencing information about the image content. To the user, the objects behave identically, even though the underlying media files and mechanisms for displaying those behaviors are physically different. 3.4 The Data Architecture The data architecture for the UVA digital library describes a set of standards for the creation of digital objects and for the relationships among them that can be exploited to turn an assortment of data objects into an integrated digital library. Fedora provides the repository architecture for systematically constructing data objects and the behavior objects that relate to them, ensuring a certain level of sameness in each of the objects that can be exploited by systems, while leaving a great deal of latitude in designing different kinds of data objects for different content. In the UVA digital library architecture there are a few basic principles that have been developed thus far. They include: •
Complex information resources are best represented by networks of related data objects. Each data object instantiates a component of the content that is appropriate to the medium and context of that component, and records its relationship to other objects. It quickly becomes clear that data design schemes for digital libraries that include all kinds of data must account for human conceptions of information resources that appear to be singular but are made up of many components of data in different media for which different management schemes and different delivery behaviors are required. Resources such as printed books and journal articles, or descriptions of archeological sites, buildings, and artworks may include images and/or audio and video clips in addition to some structured text that represents the resource as a whole. Any of the components of one resource may also be the components of another resource in the library. Representations of resource collections further complicate the picture. When a collection of resources is described, the collection itself then has content that is separate from any of its specific (child) resources. These complex resources are represented as data objects that have XML files that contain PIDs for other objects in their content. A book is represented using an XML-encoded TEI file that contains transcribed text and PIDs for image objects, one for each page image of that book. The book object is associated with disseminators that give the user access to tools that can be used for the book as a whole. Each of the individual image objects is a combination of datastreams and disseminators that make the object capable of delivering any size or format of image data required. They also potentially contain the PIDs for any number of additional objects that might also have a parent relationship to the image, such as a collection of illustrations by a particular artist that comprises page images from many books. Book objects could have similar relationships to one or more collection description objects. Exploiting this principle for representing datasets as networks of two different kinds of data objects provides a great deal of flexibility in organizing the data. A dataset description object contains one content datastream consisting of an XML file that describes the hierarchical structure of the dataset and provides descriptive annotation at each level of the hierarchy. The actual data for the dataset is contained in one or more database objects, each of which contains two content datastreams: an SQL database and a XML codebook that describes the content and structure of the SQL data. PIDs for database objects are embedded at appropriate points in the hierarchy of the dataset description object. As described above, the actual data content of a large dataset would need to be broken into logical parts for efficiency of management and access. For example, weather data is collected continuously
768
Decision Support in an Uncertain and Complex World: The IFIP TC8/WG8.3 International Conference 2004 from a changing number of geographically dispersed collection stations, which record such elements as temperature, cloud cover and precipitation. The dataset description hierarchy would start with year, containing any data about the collection process specific to that year. The second level of the hierarchy would be location, in this case describing the collection stations for that year, including geographic information and descriptive information about the collection process that was specific to that station. The database objects would contain daily records for the station for that year, each containing the measured data fields for that day. The XML files for the dataset description files will use the General Descriptive Modeling Scheme (GDMS) developed by the DLRD. The GDMS standard can be used to build arbitrarily complex typed hierarchical descriptions, to provide rich annotations to them, and to associate digital resources to the appropriate points in the structure. The library has used the GDMS to create descriptive models of artworks, architectural and archeological sites, and other types of annotated collections. For this project, an ontology of types appropriate to scientific datasets will be developed, and best practices defined for use of the GDMS. The XML codebooks prototyped so far have been developed using the Data Documentation Initiative (DDI) format. The DDI was designed for social science data with a particular concentration on datasets from surveys and opinion polls. At the low level it works for the three datasets in the prototype, but some of the high-level description sections still need adapting. The goal will be to develop a codebook format that is generally appropriate for datasets. If necessary, the DDI standard will be extended or altered, and feedback will be provided to the DDI group. •
Every data object will be tied directly or indirectly to a primary collection object. Every object that is added to the UVA digital library must be part of a primary collection. These primary collections have their own objects that describe the collection as a whole and, if applicable, have a full-text index of XML files related to the collection associated with them. These primary collection objects should be thought of as the root nodes for networks of related objects, ensuring that every object in the digital library is a part of some network. Whenever some new type of collection is added, a new primary collection object will be created. These top-level collection objects are implicitly child nodes of the digital library itself. In the case of datasets, a primary collection object will represent the Library’s collection of statistical and scientific datasets as a whole. This object will have a full-text index of all of the dataset description objects, representing the point of access for users who know that they are looking for datasets. See the User Interface section below for more details.
•
Each data object will always have three metadata datastreams: o
Descriptive metadata, where the intellectual content of the object as a whole is described;
o
Administrative metadata that describe the history of the digital object, policies associated with the use of the data and technical information about the data object; and
o
Relationship metadata that organizes the relationship information, containing the PIDs of related objects and information about the nature of the relationship.
This standardization of metadata ensures that there is a set of protocols that are globally applicable to the digital library. Both users and managers of the digital library will always have one predictable way to discover and investigate an object regardless of content type or the medium through which it is expressed. •
Disseminators will be used to standardize access to objects wherever possible. A key feature of the UVA digital library architecture is that disseminators on all objects will fall into one of three categories: o
default behavior definitions to which every object in the digital library must subscribe;
o
collection-specific behavior definitions to which every object of a certain primary collection must subscribe; and
o
special behavior definitions that provide specialized access to a subset of objects in a given collection. Determining these behaviors for objects associated with datasets will be one of the tasks for this project.
769
Integrating Disparate Data for Decision Support: An Interdisciplinary, Object-Oriented, Open Source Approach In work thus far, the default behaviors include methods providing general access to objects through web interfaces. As each object model is designed, a decision is made about how to implement each of these methods for that object, ensuring that a minimal level of access is possible without having to investigate each object. One set of methods delivers the metadata in basic pre-defined ways. Because every object has the same kind of metadata, a user (human or mechanical) who knows the PID of an object can always ask for the metadata for that object in the same way. The other set of methods are used for high-level web access across the entire collection. The method get_preview always returns a minimal view of the object suitable for presentation with previews of other objects in situations such as returning hits from a search. The method get_default_view defines a starting point for access to the object when the entire access window of a web browser is available. This method usually renders a web page that shows a first view of the content and presents menus that give the end-user access to all of the other disseminations that are possible for a given object. Some behaviors are specific to the type of object in each collection. For image objects these methods include standardizing on different views of the image when there are different object models for image objects; all objects must be able to deliver a thumbnail view of the image, a screen sized view of the image, and a maximum size available. The special behaviors are used to deliver some functions that are associated with the content of some but not all objects of a type. Again using the image objects as an example, some bitmap images are geo-referenced, others are not. An example includes the addition of an extra set of behaviors to images of maps so that a geographically based discovery interface for maps could take advantage of them. All of these principles are currently being implemented in the first phase of the digital library at UVA. The first installments of three large primary collections, including Modern English electronic text collections, art and architecture image collections, and document collections from the UVA Special Collections Library will become fully operational in the spring of 2004. By the start of the proposed project, the three datasets described below will be the first quantitative dataset collection integrated into Virginia’s digital library.
4. DSS DEVELOPMENT Starting in the summer of 2003, Library personnel have been working with faculty from UVA’s Departments of Systems and Information Engineering and Civil Engineering to design a general approach to collection building that can be applied to all types of statistical and science datasets. In the near term, the goal is to provide easy access to a uniform collection of data resources about Virginia. The Library is well into the process of developing a large collection of such resources but the data are currently organized in a bewildering array of formats and organizational schemes. The digital library system is now at the stage where it can be used to normalize the collections and provide an easy-to-use interface to them. The group has done much of the work to develop the first prototype of a collection. Thus far, three datasets have been included: the population census, climate information, and traffic data. Work is underway to extract data from these three sets for the six urban areas that make up the Hampton Roads region of the lower Chesapeake Bay to test out the complete design. By the time that the proposed project begins, these collections will be available to be used as an initial testbed for the software design. While the software is under development, more datasets will be added that provide appropriate examples of the data needed for modeling in the three domains described in section 2. The software to be developed falls into two parts. The first involves the design and development of more sophisticated disseminators to be added to the dataset description objects and the database objects. The disseminators already on the objects will use the dataset description object content in relatively simple ways to navigate the datasets to find a specific database object, which can then be downloaded. The disseminators to be developed in this project will take advantage of the structure of the dataset description objects to allow complex aggregations and extractions from multiple database objects related to the dataset. For example, a user may want to pick a particular set of variables from the population census across a selection of counties in Virginia. Rather than having to navigate to each database object and perform the necessary extraction, the dataset description object would be able to define a single dissemination that would be carried out across a group of database objects. For traffic or weather data, where data are created for every instance or minute, a user may want to aggregate data to the hour or day. This too could be accomplished by a single dissemination carried out across a set of database objects. Fedora’s design really starts to shine when manipulating classes of objects that have appropriate sets of defined behaviors and share characteristics, even if the details of the data organization differ within the class. As not all operations are appropriate for all datasets, the objects only carry the behaviors that they can support. The user
770
Decision Support in an Uncertain and Complex World: The IFIP TC8/WG8.3 International Conference 2004 interface can use the disseminators to deliver appropriate sets of choices to the user. For example, census data are already aggregated should not have disseminators that add up records, whereas the weather and traffic data are additive and many uses of them will call for aggregation. Much of the interesting research in this project centers on defining behavior definitions for classes of objects. In addition to the distinction about aggregate and instance datasets, it is already obvious that datasets are almost always organized around time, place and subject information but in different combinations. There seem to be patterns of dataset characteristics based on these distinctions that will give definitions to classes of objects. If there is any hope of getting a handle on datasets in a general way, such classes must be defined. The other major work of the project will be to design and build the data workbench. This will be a client-side java application that gives the user access to the objects in the Fedora repository, including the primary collection object that represents the collection of all dataset description objects. This means that the workbench can be built to take advantage of the disseminators, essentially using the behavior objects to “plug in” custom functionality for each object at the time of use. It would give access to all discovery functions through the primary collection object, allowing users to take advantage of the standard interfaces to build and save their own annotated view of the collections, providing ways to save often used sets of functions. Two teams will carry out the work: a design team that analyzes the testbed data, designs the appropriate behavior definitions for the disseminators and designs the specifications for the functionality of the workbench. The core of this team will be the PIs and the graduate research associate. Other personnel will be drawn in for advice as appropriate, to answer question and provide advice about the use of datasets in different domains and about good user interface design. A technical team will carry out all of the programming, including a full-time, high-level java programmer and some part-time student programmers, both paid for by the grant. The technical team will be supplemented by DLRD staff with experience using Fedora. All software developed in the project is made freely available under an appropriate open-source license.
5. CONCLUSIONS The University of Virginia has been a leader in providing digital library collections and in offering the services needed to use them. Library and Engineering School personnel have developed an information architecture, based on a design that is a hybrid XML/SQL system that exploits the strengths of both, for including quantitative data collections on an equal footing with other digital resources in a digital library. The proposed system will create an on-line environment that allows both sophisticated and novice users the ability to discover scientific and engineering data and to extract and prepare their data for use in complex modeling scenarios. Open-source software developed for the project will both directly exploit the Fedora architecture of the digital library system by developing specialized disseminators, and by providing a client application that is a “data workbench” that can be used to prepare and use data collections extracted from the Library. Specific modeling scenarios that could be applied to other domains, involving transportation, environmental, homeland security, and commercial applications, will be demonstrated using the Library’s developing archive of information about the environment and society of the Commonwealth of Virginia.
REFERENCES Payette, S. & Staples, T. (2002) The Mellon Fedora Project: Digital Library Architecture Meets XML and Web Services. Lecture Notes in Computer Science, Vol. 2459. Springer-Verlag, Berlin Heidelberg New York, 406-421, URL http://link.springer.de/link/service/series/0558/papers/2458/24580406.pdf, Accessed 9 March 2004 Staples, T. The General Descriptive Modeling Scheme. The University of Virginia Library website, URL http://www.lib.virginia.edu/digital/reports/metadata/gdms.html, Accessed 9 March 2004. Staples, T., Wayland, R. & Payette, S. (2003) The Fedora Project: An Open-source Digital Object Repository Management System. D-lib Magazine, April, URL http://www.dlib.org/dlib/april03/staples/04staples.html, Accessed 9 March 2004 The Fedora Project: An Open-source Digital Object Repository Management System. The Fedora Project website, URL http://www.fedora.info/, Accessed 9 March 2004 UVa
Metadata. The University of Virginia Library http://www.lib.virginia.edu/digital/reports/metadata.html, Accessed 9 March 2004.
website,
URL
771
Integrating Disparate Data for Decision Support: An Interdisciplinary, Object-Oriented, Open Source Approach
COPYRIGHT Staples, Evans, Scherer, Tolson, & Nelson © 2004. The authors grant a non-exclusive licence to publish this document in full in the DSS2004 Conference Proceedings. This document may be published on the World Wide Web, CD-ROM, in printed form, and on mirror sites on the World Wide Web. The authors assign to educational institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. Any other usage is prohibited without the express permission of the authors.
772