Integrating Amazonic Heterogeneous Hydrometeorological Databases

Integrating Amazonic Heterogeneous Hydrometeorological Databases Gláucia Braga e Silva, Breno Lisi Romano, Henrique Fernandes de Campos, Ricardo Godoi Vieira, Adilson Marques da Cunha e Luiz Alberto Vieira Dias Brazilian Aeronautics Institute of Technology São José dos Campos, São Paulo, Brazil Abstract This paper tackles a logical database integration process implemented for existing databases from the Brazilian National Water Agency (Agência Nacional de Águas – ANA). It describes an important part of a Brazilian Project between ANA and Brazilian Aeronautics Institute of Technology (ITA). The integration process started with a detailed analysis of four existing databases, followed by logical modeling elaboration and integration. Its major contributions are the description of main processes, documentations, model auditing, and human resources involved. At the end, the best practices used for applying main data modeling and database design techniques together with a successful use of a modeling tools are also presented.

Key Words: data dictionary, database modeling, integration process, normalization, trigram technique.

1.

Introduction

This paper tackles logical Database (DB) integration processes for existing DB at the Brazilian National Water Agency (Agência Nacional de Águas – ANA) [1], within the scope of the Amazonic Integration and Cooperation Project for Modernization of Hydrological Monitoring (Projeto de Integração e Cooperação Amazônica para a Modernização do Monitoramento Hidrológico - ICAMMH). The Project is under development of the Brazilian Aeronautics Institute of Technology (Instituto Tecnológico de Aeronáutica - ITA) Professors, Doctoral and Master Degree students from the Electronics and Computer Engineering Department and other members of the ITA Software Engineering Research Group (Grupo de Pesquisa em Engenharia de Software - GPES) [2]. From detailed analyses of four existing application DB at ANA, an integrated logical model was built using applicable DB techniques. The resulting model was audited by using database design and modeling tools. All

procedures were documented in order to provide model communication, understanding, and maintainability. The remaining of this article is organized as follows. The second section describes the motivation of this paper. The third section shows some related works. The Integration Process is addressed in the fourth and detailed from the fifth to the ninth sections. The tenth section evidences the use of the proposed DB at ANA. Finally, the eleventh and twelfth sections present some conclusions and suggestions for future works.

2.

Main Scenario

In order to accomplish the Brazilian hydric resource monitoring mission and provide useful information for decision-making, the Brazilian ANA needs to record hydrological information like hydrometeorological measurements, water quality analysis, historic series, stations measurement management, and critical hydrological events forecasting. This information is collected by remote measurement stations comprised of data acquisition equipments which can be linked to Data Collection Platforms (Plataformas de Coletas de Dados - PCDs) located near water masses. Same examples of data collected from data acquisition equipments are: meteorological (solar radiation incidence, air temperature, relative humidity, and precipitation), hydrological (quota and rain), and water quality (water temperature, pH, and dissolved oxygen). These measurements are taken from different geographic points on regular and irregular intervals or determined by hydrometeorological events. The collected data are available from the Internet to be accessed by hydric resources users such as institutions, corporations, involved neighboring countries, public and environmental organizations, farmers, research institutes, and other interested groups. Some of the main applications for these data are estimation of hydric resources availability, hydrological variability analysis, climate change prevision and meteorological and critical events forecasting.

Nowadays ANA has four distinct Application DB responsible for hydrometeorological data storage: TELEMETRIA, HIDRO, PMQA and RMQA. TELEMETRIA and HIDRO application DB regard the telemetric data storage of hydrometeorological parameters while PMQA and RMQA application DB are responsible for the water quality monitoring data storage. Figure 1 shows sources of data sent to these DB.

The information flow starts with data packages acquired from: PCDs; Brazilian States Situation Rooms; or one of the four TELEMETRIA, HIDRO, RMQA, and PMQA Application DB. Initially, data are received from the SAD and forwarded to the STD, by means of a unified protocol. The STD aims data domain verification, filtering, and data persistence by using the proposed Application DB.

Figure Figure 1: Existing Application DBs from ANA The decentralized storage of strongly related and similar information complicates ANA management processes, causes redundancies, inconsistencies and creates difficulties for the information recovery process. Also the methodology and technology lack of standardization make difficult updates and maintenances. Table 1 shows four Database Management Systems (DBMS) used to implement existing Application Databases. The last table line presents a new proposed integrated Application Database and its DBMS. Table Tabl e 1: SGBD used by Application DB Application DB SGBD TELEMETRIA HIDRO RMQA PMQA PROPOSED APPLICATION DB

MS SQL Server 2000 MS ACCESS 97 MS SQL Server 2000 MySQL 5.0 Oracle 11g Spatial

In this context, the ICA-MMH Project proposes to reduce these nonconformances by developing an integrated and optimized application DB version from existing application DB, using advanced techniques. The proposed Application DB will be handled by other systems integrating the ICA-MMH System of Systems – SoS (Sistema de Sistemas – SdS) shown in Figure 2. The ICA-MMH SoS is comprised of the following systems: Data Acquisition System (Sistema de Aquisição de Dados - SAD); Data Treatment System (Sistema de Tratamento de Dados - STD); Monitoring, Control, and Decision Support System (Sistema de Monitoramento, Controle e Apoio à Decisão - SMCAD); Data Diffusion System (Sistema de Difusão de Dados - SDD); and the proposed Application DB System (Sistema de Banco de Dados - SDB).

Figure 2: Information flow from ICAICA-MMH SoS Once stored, data and/or information are ready to be monitored and/or controlled by the SMCAD and disseminated by the SDD for users of hydric resources who can provide feedbacks which will be stored on the proposed Application DB for ANA’s assessments. The existing PMQA application DB is not considered as an external data source because it is not updated by ANA anymore. Therefore all data derived from this DB will be directly imported to the proposed Application DB.

3.

Related Works

Liu Sheng and Garcia [3] addresses the need for integrating existing heterogeneous databases in a hospital to achieve operational efficiency, effectiveness in diagnostic decision making, cost economy, better risk management, and strategic planning in a competitive health care environment. This context applies also to the ICA-MMH Project due to problems similarity. In a proposed Application DB integration like the one addressed in this research, entities identification can be determined by the correspondence between object instances from more than one DB, as mentioned by [4]. Another related work described by Reddy et al. [5] proposes a methodology to resolve conflicts of naming, scaling, types, level of abstraction, and data inconsistencies during the integration process.

4.

Integration Process

The integration process started with cautious analysis from existing Application DB documentations. Then a logical model was built and normalized. Model elements were documented by applying techniques to build a consistent data dictionary. The model was audited, in order to get quality and reliability. All steps were documented with the purpose to improve future works. Thus, DB design and modeling tools were used to accomplish the expected schedule and make development easier. A conceptual diagram for the existing Application DB integration process is shown in Figure 3.

Figure Figure 3: Conceptual diagram for the existing existing Application DB Integration Process

5.

Existing Application DB detailed Analysis

The four existing Application DB documentations include reports, SQL scripts, data dictionary, data sample, and legacy systems [6]. The documentation for the HIDRO Application DB was considered satisfactory. From the TELEMETRIA Application DB it was received only an SQL script and a data sample. Finally, from PMQA and RMQA Application DBs only incomplete documentations were received. Thus, the ITA Technical Team has begun a cautious analysis, in order to get the best possible context understanding to develop the integration logical model. Although essential to clear up the existing Application DB knowledge, this step was both time and human resources consuming. Intersections among existing Application Databases were identified and registered in an Integration Sheet relating similarities, and mapping entities and attributes which could be used within an integrated model. The main questions and doubts about existing Application Databases were also collected for later verification and validation with the ANA Technical Team. Here, it was found that this analysis procedure could be useful later on in a data loading, mapping and/or conversion phase from existing Application Databases to the future proposed Application DB.

Besides, the future integrated model must be a coherent and consistent version from previous ones concerning their stored information.

6.

The Proposed Application DB Modeling

This section describes the integration process of four existing Application Database Models into a unique logical model. It was developed by using the best practices of data modeling available.

6.1.

The Logical Integration Model

The logical integration model for the Proposed Application Database contains the business domain representation disregarding technical or technological details. The Entity-Relationship Model (E-R M) representing the proposed Application DB logical model is shown in Figure 4. In this model, the entities are categorized as follows: entities inherited from existing DBs are represented by corresponding DB colors (GREY for PQMA, GREEN for HIDRO, YELLOW for TELEMETRIA, and CYAN for RQMA); entities used in two or more Databases are represented by the PURPLE color (INTEGRATION); and finally, exclusive entities for the new proposed Application DB were created to provide model consistency and integrity being represented in BLUE .

Figure Figure 4: The Logical Integration Model The obtained E-R M contains 74 entities and has covered all four existing Application Database scopes, without redundancies and inconsistencies and following hydrological patterns and concepts. The logical integration model development was supported by ERwin 4.0, a database design and modeling tool, provided by Computer Associates [7] which maintains an academic agreement with ITA. Figure 5 shows a relationship between the proposed Application DB and four existing Database entities/ attributes. Then, 50% reduction of entities and 74%

reduction of attributes without losing functionality could be observed. These reductions came from the high level of data redundancy at the four existing Databases. 1600

1518

1400 1200 1000 800 600 400

396

200

103

74

0 Attribute

Entity DBs' ANA

DBS Proposed

Figure 5: Entities and Attributes Reductions Reductions In order to provide more detailed information, Table 2 shows the main figures of total entities and attributes from each four existing ANA Application DB. Table Tabl e 2: Existing Application DB Entities/Attributes ANA Application DB TELEMETRIA HIDRO RMQA PMQA

6.2.

ENTITY 52 24 14 13

ATTRIBUTE 951 428 71 68

Modeling Techniques

Technical Team development and help in the interaction process with the ANA Technical Team. 6.2.1. Normalization. The E-R M development was based upon the best practices suggested by Date [9] for data model normalization but also has considered the impact on data storage and recovery performances. The normalization technique utilization has avoided the appearance of functional dependencies that are nontrivial, partial, transitive, and multivalued. At the same time it has controlled inconsistencies by simplifying the representation of observed facts, the unnecessary redundancy, and accidental losses of information [10]. In general, the proposed Application Database was in 3rd and/or 5th normal forms. On the other hand, specific situations have demanded that some entities be downgraded to a previous normal form, by eliminating the junction cost in selection operations. 6.2.2. Using the trigram technique. The trigram technique consists of using a three character chain (a trigram) normally made up of the first three letters or by the entity’s three most significant letters. This technique was used as an attribute name prefix emphasizing its origin entity. This adopted notation is shown in Table 3. Table Table 3: The Trigram Notation Name Simple

Among data modeling best practices applied to the integration model development it can be emphasized the importance of using patterns, normal forms, trigrams and data dictionary techniques for Database development [8]. 6.2.1 Adopted Design Patterns. This section presents two design patterns used in this work. The first one was dealing with the guidelines DRT (Data Rule Technique) for the following data model nomenclature: DRT#01: To use names with well defined syntaxes and semantics; DRT#02: To avoid foreign language names with difficult comprehension and/or pronunciation; DRT#03: To avoid special character use (exception to underscore); DRT#04: To use singular names for entities/attributes; DRT#05: To use common terms for business clients; and DRT#06: To use intuitive abbreviations. The second design pattern has applied a color scheme to illustrate the entities mapping process from four existing Application Databases (TELEMETRIA, HIDRO, RMQA, and PMQA) to the integrated versions of the proposed Application DB with new entities, as shown in Figure 4. This color scheme aimed to guide the ITA

Composite

Notation «Entity Trigram»_ «AtributeName» «Entity Trigram»_ «AttributeName»

Example sta_name (Attribute name from station entity) sta_interval_transmission (Attribute interval_transmission from station entity)

6.2.3. The Data Dictionary Technique. The data dictionary technique was used to detail entities, attributes, and relationships. As previously mentioned, the proposed Application DB data dictionary was built by Erwin 4.0. Within this tool the following nomenclature was used to facilitate modeling, team communication, model documentation, and proposed Application DB entities and attributes identification. Its use was made also compatible with the color scheme shown in Figure 4. ORIGIN DATABASE.entity name.attribute name E.g.: HIDRO.transversalprofile.sectiontype

Special characters were used by the team to flag doubts (#) and comments (!) to be checked with the ANA Technical Team. Figure 6 exemplifies the applied dictionary into the logical model, emphasizing the adopted documentation strategy for entities and attributes.

justification for the non compliance. In this case, modelers have not participated into the auditing team to ensure auditing impartiality and reliability.

Figure 8: Entity with too many references

Figure 6: An example example of the Data Dictionary

7.

The first example in Figure 7 shows a trigram error that was not easy to be manually identified. The second in Figure 8 shows the entity ESTACAO excessively referenced only because it was the model main entity, something that could be clearly avoidable.

Model Auditing

An internal verification and validation from the produced logical model was started by using database system auditing techniques. These techniques consist in model verification with regard to normalization, performance, incorrect interpretation of application context in the business domain, among other aspects. Auditing can be executed in manual or automated way by using diagnostics and validation tools. In this context, to verify the structural integrity of the proposed Application DB logical model, a trial version of ERwin Data Validator 7.2 [11] was used. Even though results have been considered satisfactory in this development step the auditing team applied a manual procedure which has provided the identification of some problems not mapped by the automated procedure. One of these problems was that some relationships between entities were not identified, besides other conceptual aspects that wouldn’t be automatically captured, like business rules. Among the main problems identified by the automated tool two of them that can be mentioned are: attribute names duplicated into two entities (Figure 7); and too many references to one entity (Figure 8).

Figure 7: Attributes with the the same name Once a nonconformance was identified, the auditing team met with the modeling team, in order to discuss the main actions to be taken for necessary adjustments or the

8.

Documentation

The documentation was a support activity throughout the integration process, which produced a set of artifacts able to lead other activities within the proposed Application DB development scope. Some developed artifacts include: • Integration sheets resulting from detailed analyses of the four existing Application Databases; • The Proposed Application DB Entity-Relationship Model; • Guidelines for data model nomenclature; • The Data Dictionary; • The Auditing Report; • The Database design and modeling tools report; • The Activity planning; and • Client meeting reports.

9.

Human Resources Allocated

The work to develop the proposed DB logical model involved eleven members and was divided into two steps. In the first step, all members were involved in the detailed analysis so they could better understand the business domain. While in the second step the team was divided into four different activities according to Table 4. Table Table 4: Activity versus Number of Involvements Involvements Activity Number of Involvements Modeling 4 Dictionary 2 Auditing 3 Documenting 2

10. The Proposed Application DB for ANA The proposed Application DBs is important to ANA because it will be used by the Hydric Resources National System (Sistema Nacional de Informações sobre Recursos Hídricos - SNIRH) which is responsible for the Brazilian hydric resources management. The SNIRH has been developed by ANA in parallel to the ICA-MMH Project. The main benefits from this proposed Application DB are: to improve management of Information Systems; to increase information sharing; and mainly to reduce data storage duplications.

11. Conclusion After facing the main difficulties from the heterogeneous database integration process, the use of good practices in the data modeling has resulted in a more effective, consistent and non redundant logical model. This has reflected a coherent solution for client needs in the hydric resources management context. In this research and development Project, among the main applicable database modeling techniques, the orientation supplied for the integration process description associated with the use of design patterns, database design, and modeling tools have been proved to be relevant factors for the success of this work. All process activities were well documented, the resulting model was audited, and a constant interaction with the client was kept along the entire process in order to attend specified requirements. The resulting model is evidenced because it represents milestones for the ICA-MMH Project and the foundation for the future development of other systems. The flexibility of the logic model and its documentation can facilitate future specializations and maintenance.

12. Suggestions for Future Research Once the logical model was already checked by ANA, the next step will be the development of a physical model in Oracle 11g spatial. After this implementation the ITA team will be starting procedures of loading data from the four existing ANA Application DBs. At this moment data processing mechanisms must be applied in order to guarantee quality, reliability and consistency. Data loaded from PMQA Application DB will be unique because it has not been updated anymore. However some procedures must be created during data loading from other three Application DBs that are still operationally needed to ANA. The proposed Application DB is the core of the ICAMMH Project. It must be taken into account as soon as

possible so hydrometeorological data can be manipulated through human interfaces being developed from other teams to be also successfully integrated later on to the ICA-MMH System of Systems - SoS.

13. Acknowledgments Authors of this paper would like to thank: the ITA, for its technologic and scientific development incentives; ANA, for the opportunity of participating in the ICAMMH Project; the Projects and Studies Foundation (FINEP); and the Casimiro Montenegro Filho Foundation (FCMF), for the available infrastructure and scholarships; and the Computer Associates, for the academic agreement with ITA, supporting its database modeling tools.

14. References [1] Brazilian National Water Agency (ANA). Available at http://www.ana.gov.br. Last access: 10/26/2008. [2] Software Engineering Research Group (GPES). Available at http://www.gpes.ita.br. Last access: 10/25/2008. [3] Liu Sheng, O.R.; Garcia, H.-M.C.; Information Management in Hospitals: an Integrating Approach. Proceedings of the Ninth Annual International Conference, Phoenix, 1990. [4] Lim, E.-P.; Srivastava, J.; Prabhakar, S.; Richardson, J.; Entity identification in database integration. Proceedings of the Ninth International Conference, Phoenix, 1993. [5] Reddy, M.P.; Prasad, B.E.; Reddy, P.G.; Gupta, A.; A Methodology for Integration of Heterogeneous Databases., IEEE Transactions, Volume 6, Issue 6, Dec. 1994. [6] Brazilian National Water Agency (ANA). “Sistema de Documentação de databases” (Database System documentation), Brasilia, DF, 2008. [7] Computer Associates, AllFusion® ERwin Data Modeler. Available at: http://www.ca.com/us/products/pro duct.aspx?ID=260. [8] Cunha, A. M. “Class Notes CE-240 – Projeto de Sistemas de Bancos de Dados”, ITA, São José dos Campos, SP, Brazil. 2008. [9] Date, C. J. “Introdução a Sistemas de Bancos de Dados”, Editora Campus, Rio de Janeiro, 2004. [10] Sanches, A. R. “Class Notes - Fundamentos de Armazenamento e Manipulação de Dados”. 2005. [11] Computer Associates, AllFusion® Data Model Validator. Available at: http://www.ca.com/us/products/ product.aspx?id=1081.