These corporations lack a DW/DM development methodology that takes into an account .... The MRS OLTP systems are composed of 204 COBOL applications ...
A Methodology Targeted at the Insertion of Data Warehouse Technology in Corporations Walter Pereira Karin Becker Faculdade de Informática – PUCRS – Porto Alegre – Brazil {wpereira, kbecker}@inf.pucrs.br
Abstract A particular interest has been observed in the Data Warehouse (DW) technology by corporations aiming to improve their decision processes. A large number of corporations that have no tradition on the use of computer systems for decision support, has to rely on a team qualified in the development of traditional operational systems and database technology, but inexperienced on DW development issues. Moreover, for a number of reasons (e.g. availability, costs, privacy), it is not always possible to count on external development teams or consultants. This work presents a methodology targeted at the development of DW pilot projects, which aims at the smooth adoption of DW technology by corporations. The methodology has been successfully tested in a military DW pilot project, and the results obtained so far confirm its adequacy and consistency towards the established goals. The paper describes the striking features of methodology and analyses its application in a real case study.
1
Introduction
Due to the competitiveness in business, managers are constantly challenged to respond in a timely manner to the opportunities and threats of the market place with decisions that make the best use of corporate resources. On-Line Transaction Processing Systems (OLTP), though essential for performing the daily operations of corporations, offer little support, if any, to decision making. Decision Support Systems (DSS) are computer–based systems that aim to improve the effectiveness of decision making, typically by performing analytic information processing [6]. Data Warehouse Systems (DW) are DSS targeted at processing large amounts of historical data, in order to identify profiles, patterns, behaviors and tendencies [5], [7], [9]. Data Marts (DM) are DW of limited data scope designed to represent a specific business area or corporation sector (e.g. marketing, sales, financing, etc) [1], [9]. A DM can be regarded as a logical subset of a complete DW [7]. Works such as [2], [3], [7], [9], among others, claim that the development of a DM/DW project must be carried out by a staff experienced on this area. It is even argued in [3] that the factor inexperience is one of the main causes of failure in DW projects. As a consequence, these authors propose development methodologies targeted specifically at development staffs experienced in DM development. However, many corporations that lack tactic and strategic information, and feel attracted to DW technology, count on their staff with professionals highly skilled in the development of traditional OLTP and database technology, but inexperienced in the development of DM/DW. Moreover, for a number of reasons (e.g. availability, costs, privacy, deadline), it is not always possible to count on external development teams or consultants. These corporations lack a DW/DM development methodology that takes into an account the available staff and their type of experience. The Brazilian Army is an example of such a corporation. This institution is divided
into several Military Organizations (MOs) specialized in computing, with highly qualified professionals, some of them even familiar with decision support technology, although lacking practical experience. Consequently, Army decision-makers do not benefit from DSS to qualify their decisions in all levels. Considering this context, the contributions of this paper are twofold. First, it proposes a methodology targeted at the development of DW pilot projects, which aims at the smooth adoption of DW technology by corporations. It considers organizations with no tradition in the use/development of DSS in general, and which can count only on a qualified staff, but inexperienced the development of DW systems. Second, it assesses the use of the proposed methodology in a military DW pilot project, based on the Military Recruitment Service (MRS). The rest of this paper is structured as follows: Section 2 presents the relevant aspects of the case study; Section 3 analyses existing DM/DW development methodologies; Section 4 discusses the striking features of the proposed methodology for pilot project development; Section 5 addresses briefly the software engineering paradigm in which the proposed methodology was based on; Section 6 assesses the proposed methodology in the development of a pilot project for military activities; and finally, Section 7 draws conclusions and presents future work. 2
Case Study
2.1
The Military Recruitment Service
The Military Recruitment Service (MRS) is a complex process, supported by a set of OLTP systems. The basic purpose is to select from the civilians who enlist every year, those who effectively will serve the nation through the Brazilian Army. Though the recruitment process is the same for the whole country, it is carried out according to responsibility areas, called Military Regions (MRs), each of them supported by a MO specialized in a data processing, referred to as Data Computing Centers (DCCs). The remaining of this article is exemplified with data, activities and processes from the 3rd and 5th MRs, comprising 3 states of Southern Brazil, and which are supported by the 1st DCC, the latter located in Porto Alegre. However, the discussions presented in this paper can be generalized to any MR/DCC of the country. The MRS is divided into four phases, illustrated in Figure 1. 1st DCC
IMPLANTATION (1st phase) MSC
MILITARY LIFE (4th phase)
File (3rd MR) File (5th MR)
Recruit
Data
Total: 160.000/Year
3rd and 5th MR DISTRIBUTION (3rd phase) SELECTION (2 nd phase) MR SF
- Skills - Recruit data - MO vacancies - Parameters
RF 20.000/Year
Figure 1 - MRS Phases
Incorporated recruit 10.000/Year
140 Reports (Operational/ Managerial) generated by 1st DCC
In Brazil, enlistment is mandatory for male youngsters in their 18th birthday year. The implantation phase starts when the youngster presents himself to a Military Service Council (MSC), becoming a recruit. Basic data is collected about each recruit (e.g. name, address, education level) and sent to the corresponding DCC, which processes and stores this information in distinct files, according to origin of the data (there is one file per MR/year). Considering recruits enlisted in the 3rd and 5th MRs only, about 160,000 records are annually included in the 2 corresponding files. Then the Selection phase follows. The DCCs process the respective files and generate a form for each recruit, called Selection Form (SF). Selection forms are then completed manually with data obtained in medical and dental evaluations, aptitude and psychological tests and interviews, to which most recruits are submitted. The respective files are updated after the filled selection forms are processed by the DCCs. Distribution is the most critical phase in the recruitment process. It has the goal of selecting among all recruits, those who best suit to the needs of the Army. During this phase, the OLTP systems are input with a number of parameters coming from various Army divisions physically located in Brasília, Rio de Janeiro, Porto Alegre and Curitiba. These parameters, together with the specific requirements of each MO in terms of skills (e.g. driver, mechanic, cooker) and vacancies, are matched against recruits data (e.g. weight, height, physical condition), as available in the respective files. The goal is to find the recruits who present the best profile according to the standards established by the Institution. Considering the 3rd and 5th MRs context for example, every year about 20,000 recruits are selected as a result of this matching. For each of them, Record Forms (RFs) are generated, and these recruits are submitted to a last set of exams (e.g. medical) and interviews at the MO to which they had been assigned. Finally, recruits are either selected or dismissed from Military Service. In our example, around 10,000 recruits out of the initial 160,000 are effectively incorporated into the Brazilian Army. In the last phase, called Military Life, all files are updated according to the results of the Distribution process, particularly the records of those recruits who were incorporated in the Brazilian Army. Their records are updated for the last time when these soldiers conclude the military service period. 2.2
The MRS OLTP Systems
The MRS OLTP systems are composed of 204 COBOL applications developed in the 70’s, which are executed according to recruitment phases. Administrators and decision-makers from several levels in the Army feel an enormous need of managerial and tactical information relative to all MRS phases. A deep analysis of the possible causes of this situation revealed the following critical issues: • Information produced by MRS OLTP systems normally refers to a single year, and the integration of data from distinct years is not common, thus limiting their analytic capability. • Every year, about 140 types of reports are generated. The extensive amount and lengthy extension of produced reports make their analysis difficult, if not impossible, since the time required for their analysis in many cases is not cost effective. • In order to generate new reports in response to a specific situation, new programs have to be developed, a task that only the DCC in São Paulo is allowed to perform. Additional factors aggravate this situation, such as bureaucracy (e.g. official letters, solicitations,
authorizations), geographical dispersion, delay required for producing the reports, etc. Consequently, ad-hoc information is hard to obtain, in particular in the necessary delay. • Most of the time, data is not used to outline tendencies, patterns and profiles. It should be stressed that the demand on managerial and tactical information required at various levels and phases of the MRS, is mostly supplied through intuitive and unstructured knowledge of people who have accumulated experience during the years they have been involved with the recruitment process. 3
DM/DW Development Methodologies
Developing data warehouses has become a popular but exceedingly demanding and costly activity in information systems development and management [11], which limits the wider adoption of this technology [4]. Frequently, corporations are not willing to take the risks involved in the development DM/DW projects, due to restrictions such as costs, technical knowledge, time, etc. Consequently, many organizations start by developing a DM or a pilot project before they invest in a complete DW, in order to gain experience, to show users the value of decision support information, or to provide a proof of concept for organization directors or committee [9]. Moreover, proven methodologies targeted at DM/DW development are still lacking. The few authors who have translated their own experience into a set of DM/DW development guidelines, referred to in this work as “development methodologies”, recommend that DM/DW construction has to be carried out by an experienced staff [7], [9]. These reasons help explaining why few works address the issue of DM/DW development by technically skilled staff (e.g. highly qualified in OLTP systems development, database technology, etc), but with no practical experience in DW development. Though the case study addressed in this paper focus specifically on the military context, this problem is commonly observed in many organizations. Indeed, many enterprises and institutions could benefit from the use of DSS to support their underlying decisions, and they might even be tempted to develop and adopt such systems. However, they lack within their own personnel the experienced staff recommended by DW development methodologies, and may not be not able to make use of an outside development team due to reasons such as costs, privacy, deadline or availability. These organizations demand a development methodology suited to their needs and constraints, and which helps in the conveyance of all the experience and knowledge the in-house staff has in the development of traditional OLTP systems and associated technology, into DW development experience, in a smooth transition. In order to meet these requirements, it is our opinion that such a DW/DM development methodology has to present the following properties: • It has to take into account the development by a staff that lacks practical experience in DW development. • It has to be complete, addressing the project from its very initial stage up to its conclusion. • All phases have to be well detailed, making it possible and easy to perform all required activities. • It has to include a phase specifically targeted at the test and experimentation of various DW underlying aspects, such as tools, architecture, infrastructure, modeling alternatives. etc. Several works were analyzed in the light of these requirements, in particular [2], [3], [7], [9]. Among these, [7] and [9] were considered the most complete and detailed ones. Moreover, it was observed that although the set of phases they propose is slightly different, in essence,
their work is similar in terms of components, functionality and activities addressed. Therefore, they were used as a conceptual framework for the present work, from which we have extracted the criteria to compare and evaluate existing methodologies. The results are summarized in Table 1. Table 1 - DW Methodologies Comparison [2]
[3]
[7]
[9]
Completeness
No
Yes
Yes
Yes
Inexperience
No
No
No
No
Phases
- Planning - DM Scoping - Design and - Project Plan implementation - DM - Support and Implementation enhancement
Detail Level
Not detailed
Data architecture
No
Functional Architecture
No
No
Technical Infrastructure
No
No
Yes
Yes
Dimensional Modeling
No
No
star and snowflake
star and snowflake, multiple fact tables, outboard tables and multistar schemas
No detailed
No detailed
Very detailed
Detailed
No
No
No detailed
No detailed
Very detailed
Very detailed
No detailed
No detailed
Detailed
Detailed
No detailed
Poorly detailed
Very detailed
Detailed
Database design Pilot project User Applications Data Audit Use, Support and Extension
Not detailed
- Planning - Gathering data requirements and - Project Planning modeling - Business Requirement definition - Physical database design and - Technical architecture design development - Product selection & installation - Data sourcing, integration, and - Dimensional Modeling mapping - Physical design - Populating the data warehouse - Data staging design & development - Automating the data management - End-user application specification process - End-user application development - Creating the starter set of reports - Deployment - Data validation and testing - Maintenance & growth - Training - Rollout Very detailed
Detailed
Enterprise DW, Generic, dependent DM, Independent DM, data access (ROLAP and MOLAP) independent DM e Integrated DB Dependent DM feeding a DW Data integration, DW, Data Back Room, Presentation Servers, transformation, Data architecture Front Room, Services and metadata and metadata
proof of concept e architecture and proof of concept (Product selection infrastructure, before deployment & installation) of the DW
As for the factor inexperience and experimentation, it is worth mentioning that [9] is the only one that recommends explicitly the construction of a pilot project before the actual development of a DM/DW, suggesting two distinct types of pilot projects: proof of concept and architecture and infrastructure. The former is intended at showing to administrators and decision-makers in general, how DM/DW can be useful to support their decision activities. The latter is targeted at verifying how all DW components work together, as well as understanding and gaining experience in all life cycle development phases, before the actual
DM/DW construction occurs. However, despite the emphasis given to pilot projects construction, [9] does not integrate this aspect in any phase of its DM/DW construction methodology, nor provide a detailed systematic or set of guidelines for their development. As it can be observed in Table 1, no available methodology fully addresses the set of requirements adopted for this work. In the rest of this paper, we propose a number of adaptations such that the factor inexperience can be taken into account in a DW/DM development methodology. 4
A Methodology for the Insertion of DW Technology in Corporations
The methodology proposed in this work is an adaptation of existing ones, in particular [7], [9], to cope with a common and relevant problem: the development of a first DW project by a corporation in-house team that lacks practical experience in such development. As already mentioned, these organizations demand a development methodology that helps in the conveyance of all the knowledge the in-house staff has in the development of traditional OLTP systems and associated technology, into DW development experience, in a smooth process. An example of such context is the Brazilian Army, in which the proposed methodology was assessed and refined. The key idea is the use of a pilot project referred to by [9] as architecture and infrastructure pilot project. The original contribution of the present work is that it integrates the development of a pilot project in a set of phases that later result in the development of the actual DW/DM, as well as provides a set of detailed guidelines for pilot project development. The overall methodology lifecycle is divided into 3 major stages, as depicted in Figure 2.
experimentation Scoping preliminary requirements gathering
Preliminary planning
Detailed requirements gathering
Prototyping
definition
execution
Project definition and pilot project planning updating
Actual DM/DW Pilot Project
DM / DW Pilot project management
Figure 2 - Proposed Methodology Life-Cycle
The Experimentation Stage constitutes a distinctive feature of the proposed methodology, of which prototyping is the most important phase. The prototyping phase involves in fact the development of a pilot pre-project, in which a restricted set of data is considered throughout all development steps, in order to examine issues such as architecture, dimensional modeling techniques, database project, etc. The following benefits are expected from prototyping, among them: • To establish initial contact with the various DM/DW development techniques. • To understand the complexity involved in the DM/DW development. • To gain experience with new tools and technologies. • To learn the different tasks involved in each DM/DW development phase. • To learn how to assign time schedules for the execution of the various underlying tasks. • To learn about and practice the design of an analytic data base, and carry out activities
such as mapping, extraction, transformation, data load, etc. • To establish continuous interactivity with users, in order to gather decision requirements and provide users with useful information. The basic idea is that, by the end of the Experimentation Stage, the team has acquired enough knowledge and technical skills, such that is possible to make better and more reliable decisions on the various issues involved in the actual pilot project to be developed. The Definitions Stage is aimed at reducing as much as possible the risks and uncertainties, and to transform the experience obtained in the previous stage into more accurate definitions to guide the DW pilot project construction up to its conclusion. Finally, the Execution Stage is dedicated to in the development of the actual DW pilot project. As already mentioned, most phases and underlying activities are extracted from existing methodologies, [7], [9] in particular. The contribution of this work is the redistribution of these activities, such that the development of a pilot project by inexperienced team is made possible, with the expected benefits. 4.1
Experimentation Stage
4.1.1 Scoping preliminary requirements gathering. The fist step of the experimentation stage is targeted at gathering as much data as possible from the development environment and to use the obtained results as an entry for the next phase [2]. This phase is subdivided in six independent modules, allowing the respective data to be collected in parallel, according to Table 2. Table 2 - Scoping Requirements Gathering Modules
MODULE Environment and organizational influence Requirements gathering Database design Data sourcing Data delivery Administration and support
OBJECTIVE - gathering of human factors related to the organizational environment, such as people, interests and expectancy influences. - to understand and gather business requirements ( e.g. cost estimate, services, hardware, software, tools, personnel, business area and DM scope). - preliminary gathering of which business requirements will be translated into database physical objects (e.g. dimension and fact tables, granularity, aggregates), and definition of who is going to perform each activity. - to determine which data sources are necessary to enable data extraction, transformation and load activities in the DM. - to gather requirements referent to data presentation to decision-makers, as well as tools and/or necessary applications (e.g. predefined and ad-hoc reports, interfaces, etc.). - to gather administration and support activities necessary for the project, prior during and after DM development.
4.1.2 Preliminary planning. In this phase, all the requirements gathered in the previous phase is consolidated in a document called Project Plan. The Project Plan can be structured according to the same topics as in the scoping requirements gathering phase, and it can contain other additional items, such as project viability analysis, success criteria, etc. The Project Plan has to be submitted to the project sponsor, to the corporation board, and to users, so that a judicious analysis of all topics, particularities and activities can be made. At the end of this
analysis, the Project Plan can be approved, adjusted or it can even be rejected, when the development project is aborted. 4.1.3. Detailed requirements gathering. If the Project Plan in approved, possibly with adjustments, the requirements gathering, initiated in the scoping phase, is intensified. It is also the moment for updating the Project Plan in questions such as development staff and schedule, data architecture, operational architecture and application strategies for its activities and processes, infrastructure and dimensional modeling techniques which can be adopted, database project, etc. The results of this phase establish the basis for the remaining of the process, and consequently, the quality of the system depends heavily on the amount of details and the precision of the requirements gathered in this phase. 4.1.4. Prototyping. It is the main phase of the experimentation stage, constituting a striking feature of the methodology proposed in this work. This phase has the same objectives presented by [9] for architecture and infrastructure pilot projects, but it has a different scope. The first difference is the results of the phase, namely prototypes, in which several alternatives for the actual DW pilot project design can be considered, tested and assessed. Second, as already mentioned, although [9] recognizes the value of experimentation through pilot projects, it does not insert this concept in a development methodology, nor provide precise guidelines for its construction. The prototyping phase overcomes this limitation, by highlighting a number of issues that must be considered, with the corresponding activities. The main modules of this phase are depicted in Figure 3. Each module can be executed as many times as necessary or viable, given the available financial resources, personnel, deadlines, products to be tested, etc. The basic idea is that, given the requirements and definitions resulting from the previous phases, one or more prototypes are constructed. This allows the intensive experimentation and testing of technologies, architectures, tools, etc., enabling to acquire experience and to eliminate or reduce most risks and uncertainties that normally jeopardize the development process. Due to space limitation, the underlying modules are briefly discussed below, but further details can be obtained in [8].
Use, support and extension DW tuning
Functional architecture Data architecture Detailed requirements gathering Phase
Infrastructure Dimensional modeling
Logical DB review
Product test
Planning DB design
Functional architecture execution Final user applications
Data audit
Physical DB review
Definition Stage
Support Use Extension
Management
Time
Figure 3 - The prototyping phase
•
Planning: in this module a concise planning is made for the different activities to be executed in the construction of a prototype. It has to contain a clear definition of the prototype objective and business area, involved staff (with the corresponding tasks and responsibilities), deadlines, and the definition of the most important reports (including their structure and presentation).
•
Data architecture: it is aimed at the choice of an architecture to organize data in the prototype (e.g. centralized, dependent DM, independent DM [1], [3], [9]).
•
Functional architecture: definition of the main functionality required from the data warehouse. It involves the establishment of an overall plan that describes the data flow from source systems up to end users, representing the main warehouse elements (e.g. source systems, data organization area, organization server, users, etc.), data flows and corresponding services (e.g. extraction, transformation, data load, etc.) [7], [9].
•
Infrastructure: it is aimed at the definition of the technical resources (e.g. hardware, software, communication, advisors, training). required to support the defined functional architecture. Notice that the defined functional architecture has a heavy influence on the choice of the technical resources necessary to support it, and vice-versa.
•
Products test: it involves the testing of available tools to be used in the development of prototype. The goal is to allow the staff to become familiar with the features of tested products, such that functionality and performance can be evaluated. One must be particularly careful with performance issues related to time, (e.g. retrieval time, extraction time, data load and transportation time). Notice that a loop is depicted in Figure 3 for this module, indicating that it can be re-executed several times. In particular, it is interesting to test different products in the same conditions (e.g. same infrastructure, data modeling, data amount), or changing scenarios for testing the performance of a same product under different situations. Through the re-execution of this module, the staff should be able by the end of the prototyping phase, to conclude about the tools most suited to the situation at hand. In particular, the staff has to verify whether functionality available in a tool can cope with extraction, data load, transforming, etc, or if additional applications must be built to handle these aspects. In the latter case, the details of these applications must be specified (e.g. programming language, staff, deadlines).
•
Dimensional modeling: it involves the choice of a multidimensional structure (i.e. star, snowflakes or existing variations) and its implementation, considering the requirements for the selected business area.
•
DB design: encompasses all activities required to prepare the analytic database to receive the prototype data, such as database size estimation, physical structure creation, database physical partitioning, data protection (backup and recovery), dada security, index creation, database tuning, index review, etc. [7].
•
Functional architecture execution: the objective of this module is to the perform activities specified in the functional architecture module. During its execution, data is mapped and extracted from the source data, transformed and loaded into the prototype analytic database, using either generic tools or applications specifically designed for the task, as revealed by the assessment performed in the products test module.
•
Final user applications: it involves for the construction of an initial set of reports necessary to support decision-makers [7], based on the specifications resulting from the planning module.
•
Data audit: the objective of this module is to verify and guarantee the quality of data stored in the analytic database. Data can be validated through a defined set of reports, development of procedures to verify the correctness of data extraction/transformation/load procedures, verification of data domains, record summarization, analysis of log files, etc.
•
Use, support and extension: its aim is to support daily activities of the DW use, assuring its availability and continuous performance; to attend and assist DW use expansion, enabling the inclusion of new applications, users and data; and to help keeping the system constantly updated, so that it provides appropriate support for decision making [3].
•
Management: it involves the follow up of the activities defined for prototype development, as defined in the planning module [7].
4.2
Definition Stage
The Definition Stage, composed of a single phase called Project definition and pilot project planning updating, is another distinctive feature of the methodology proposed in this work. This stage is of outmost importance, given that all results obtained through experimentation in the previous phases, particularly prototyping, are carefully analyzed and transformed in a set of definitions for conducting the remaining of the project. In other words, this phase is responsible for the convergence of all uncertainties existing before the beginning of project development, and the knowledge, experience and insights obtained through the Experimentation Stage. In this way, the staff can select among all considered design and technological alternatives, the ones that present the best cost/benefit for the corporation. The definitions of this stage update the Project Plan, of which the initial version was elaborated in preliminary planning phase. Several issues of the Project Plan are reviewed, such as development staff, together with their corresponding tasks, responsibilities and deadlines; schedules, including initial and conclusion dates; required/viable financial investment; project pilot business area and purpose; etc. Additionally, several definitions are included/reviewed, such as data and functional architectures, together with application strategies for their activities and processes; infrastructure; selected dimensional modeling technique and specification of various modeling issues (e.g. dimension tables and their attributes, fact table and its variables; data granularity; use of aggregates); details on database design; required users applications, together with the type of reports (i.e. predefined and ad-hoc), navigation structures and target decision makers; data updating strategies, etc. 4.3
Execution Stage
The last stage, Execution, is also composed of a solo phase, named after the goal pursued here: Actual DM/DW Pilot Project Development. The development activities and guidelines underlying this phase were extracted mainly from [3], [7], [9], but adaptations were required, given that most uncertainties have already been eliminated through repetitive prototype development and planning review. Indeed, although the development of the actual pilot project includes new issues to be considered, particularly a larger data volume and increased process application complexity (e.g. extraction, transformation, etc.), it is considered that most of these difficulties were eliminated through the experience, knowledge and insight acquired during former stages. The development modules of this stage, depicted in Figure 4, are similar to the ones of the prototyping phase (Figure 3), except for some modules that have been eliminated, namely planning, data architecture, functional architecture and products test, which were decided upon in the Definitions Stage.
Execution Stage
Use, support and extension DW tuning
Infrastructure Experimentation Stage
Definition Stage
Dimensional modeling
DB design
Functional architecture execution Final user applications
Logical DB review Data audit
Physical DB review Support Use Extension
Management Time
Figure 4 - Execution Stage Modules
5
The Paradigm of the Proposed Development Methodology
The proposed development methodology is based on a spiral model paradigm of the software engineering, in which prototyping is combined with classic life cycle elements in an evolutionary approach [10]. The methodology involves four logical cycles of tests and experimentation, as depicted in Figure 5. DM/DW
Project Pilot Experimentation Stage Prototyping Phase Product test Module
Figure 5 - Test and Experimentation Cycles
The first cycle is constituted by the pilot project as a whole, of which the objectives and benefits are extensively highlighted in [9], as discussed in the previous sections. As the result of its development, the pilot project can evolve into a complete DM/DW, receive adjustments, be executed again or even be aborted, in case the corporation considers that project conclusion is not be viable. However, each time a project is developed, it is possible to execute one or more times the Experimentation Stage (Figure 2), according to the need and availability of personnel, time and other resources. Additionally, each time the experimentation stage is executed, it may engender the execution of the prototyping phase as many times as required. Finally, each prototyping phase might trigger the execution of one or more products test modules (Figure 3). These four cycles were established with various purposes, among them, experience acquisition by the development staff, and the development of a strong and continuous interaction between the project staff and final users, transforming the latter in effective participants of the various stages of the pilot project construction. But most importantly, the cycles presented in Figure 5 are organized in such a way that each one of them subsumes the inner ones, reducing gradually risks, uncertainties and staff
inexperience, and consequently increasing, little by little, the possibilities of success for the DM/DW pilot project. 6
A Pilot Project Targeted at Supporting Military Activities
The methodology discussed in the previous section was applied to the MSR case study (Section 2) in order to assess its consistence and suitability. The Army is a typical example of the organizations considered in this work: (1) it has no tradition in the development/use of DSS, but it feels attracted to DW technology due to the expected benefits for its decision activities; (2) it counts on a qualified staff in the development of OLTP systems and associated technology, but which lacks practical experience in the development of DW, and (3) the use of an external development team or consultants is not viable due to factors such as costs, availability and privacy. Due to space limitations, only the most significant aspects of the case study are highlighted here, but a complete description can be found in [10]. The in-house staff initially assigned for the project was constituted by 4 people, all of them highly skilled in the development of OLTP systems. The team was composed by a manager and a developer, both having a fair theoretical knowledge on DW technology, but no practical experience on the development of such systems; one expert in OLTP development and one database administrator, both having no knowledge on DSS. None of these people were initially involved with the MRS system. The most important results of the scoping requirements gathering phase were: (1) acquisition of a general overview of different aspects of managerial and operational nature that compose the MSR context and associated OLPT systems, and (2) the assessment of the viability of the pilot project, particular due to adequate technical infrastructure availability (e.g. personnel, hardware, software), propitious organizational environment (e.g. decisionmakers interested in the project and its results), as well as availability of historical data. All results were recorded in a Project Plan. During the preliminary planning phase, the Project Plan was submitted to decisionmakers of different levels of the Army, and after the necessary adjustments, the project was approved and authorized. During the detailed requirements gathering phase, the gathering of requirements initiated in the previous phases was intensified by means of interviews and facilitation sessions conducted in Porto Alegre, São Paulo, Brasília and Rio de Janeiro, through meetings or other communication means (e.g. telephone, email). Available reports, usual decision procedures, involved OLTP systems (including the COBOL code), etc, were studied. The obtained results allowed a review of the Project Plan, as discussed Section 4.1. The following aspects were observed in practice: (1) given that the pilot project depends directly on the quality of the requirements gathered, the team responsible for this task has to be carefully selected, instructed and monitored; (2) a rigorous management and monitoring is required, in order to avoid loosing the focus of the efforts spent in the project; (3) in this stage, it is fundamental to study and assess the possible technological alternatives that can be adopted in the following phase, such as data architecture, infrastructure, database design, etc. During the prototyping phase, two prototypes (called “A” and “B”) were developed for a reduced, but significant data set. Prototype “A” made use of two distinct Regional Records files, corresponding to approximately 160,000 records, whereas for prototype “B”, five Regional Records files were employed, involving near 370,000 records. The construction of the prototypes has involved a number of experiments and variations in terms of architecture, tools, infrastructure and other issues. Among the most important results and lessons learned,
we can mention: (1) the development staff has acquired large experience in the DM/DW construction, by performing several times the various activities underlying DW development, as distributed in the modules of the prototyping phase, which enabled to confirm the utility and consistency of the proposed methodology for handling with inexperienced teams; (2) the centralized data architecture is easier to construct and it allows centralized control; (3) the functional architecture proposed by [7] could be implemented adequately due to the quantity of details provided by that author, and this kind of architecture has presented very good performance; (4) the dimensional star modeling technique revealed a number of advantages, such as efficiency in data retrieval, intuitiveness for final users, simplified metadata understanding and navigation by developers and final users; (5) considering the various tools available for testing (Data Mart Suite - Oracle, MS SQL Server 7.0 - Microsoft and Impromptu and Power Play - Cognos), the MSR particularities and the available technical infrastructure, the tool that best suited the needs in general of the pilot project, was Data Mart Suite (DMS), whereas Impromptu and Power Play was more suitable for the data presentation phase; (6) with the use of the DM Report and DM Discoverer tools (DMS), together with Impromptu and Power Play, it was perfectly possible not only to generate the same administrative reports presently produced by the MRS OLTP applications, but also to obtain managerial information in response to users ad-hoc queries, with the advantage of offering high speed and flexibility in reports creation and update; (7) the smallest possible granularity, i.e. a fact table for each recruit, had to be adopted to cope with the generation of all required reports; (8) besides the use of standard tools, there was a need for the development of specific applications for transformation activities and loading data in the analytic database, as well as for the data audit module; (9) the hardware, software and network infrastructure that presented the best performance, considering the available resources, was a NT server accessing physically the database located in AIX server via a 10 Mbps net, TCP/IP protocol, as depicted in Figure 6. DW DEVELOPMENT
Hardware - Pentium II 333 Mhz, 96 Mb RAM, HD 8.4 Gb, DVD
Software - Win NT 4.0 - DMS - Client - Impromptu and Power Play
(a)
NT SERVER Hardware Software - Pentium II 400 Mhz, - Windows NT 4.0 128 Mb RAM, HD 3 x - DMS - Server 4.3 Gb SCSI, DVD
(b)
AIX SERVER Hardware Software - RISC 6.000 F50, 256 - IBM AIX 4.3 Mb RAM, HD 5 x 4.3 - BD Oracle Gb SSA + 1 x 4.3 SCSI 8.0.5
(a)
(a) Network TCP/IP - 10 Mbps (shared) (b) Network TCP/IP - 10 Mbps (dedicate)
USER Hardware Software - Several - DM Discoverer/Report - Impromptu/Power Play
Figure 6 - Technical infrastructure
In the Definitions Stage, all knowledge, experience and insight acquired though the prototyping phase, was used to update the Project Plan. The main adopted project definitions were: (1) inclusion of another database administrator in the development staff, due to continuous adjustments, reorganization, partitioning, etc of the analytic database; (2) update of project schedule, increased of two more months, totaling 8 months for the whole pilot project development; (3) adoption of a centralized data architecture; (4) use of the functional architecture proposed by [7], with the use of DMS tools for data extraction and part of data transformation activities, applications written in Oracle PL/SQL for other data transformation
and loading activities, as well as the data audit, and Oracle and Cognos tools for data presentation activities; (5) adoption of the technical infrastructure specified in Figure 6; (6) use of the star dimensional modeling technique, with the preliminary definition of a the model involving a fact table and twenty-five dimension tables; (7) adoption of the least granularity possible, so that each enlisted recruit would be represented as the fact; (8) aggregates would not be considered for the pilot project; (9) files corresponding to the 3rd and 5th MR referring to the years 1996 to 1999 would be used, representing 800,000 records that require a minimum storage space of 2 Gb; (10) data security would be executed through authorization and authentication procedures; (11) data protection would be performed through a daily, monthly and weekly backup policy; (12) data access would be initially available for the main decisionmakers in an unrestricted form. In the Execution Stage, all project and experience definitions acquired in previous phases were employed. In this way, this phase was enormously facilitated considering that practically all development procedures had already been considered and tested previously, mainly during the prototyping phase. Therefore, it was possible to generate the same administrative reports presently produced by MSR OLTP systems, as well as additional ones, in response to decision-makers ad-hoc demands. It should be stressed that decision-makers were definitely impressed by the facility and short delay with which required information could be produced by the DW. It is also interesting to highlight that, initially, demands would come from managers involved in the MSR context, such as “the number of apt recruits who declared the wish to serve in the period from 1996 to 1999, referent to the city of Porto Alegre”, “the number of recruits submitted to medical and dental tests in the period from 1996 to 1999, in Porto Alegre”, etc. Without the DW, these requests would have to be answered by new applications over the OLTP, involving lots of bureaucracy and a minimum delay of 10 days. Using the developed DW, the requested reports would be delivered in approximately one hour. Later, we were authorized to provide information for external organizations that contacted the Recruitment Service to obtain statistic data about Brazilian youth. Two examples can be mentioned: (a) the Pontifical Catholic University of Paraná asked the 5th MR for anthropometrical and clinical statistic data (e.g. height, weight, foot and head size, waist size, clinical diagnosis, etc.) of the enlisted youth from the three Southern States of the Brazil for the last six years, for a scientific study in physiology and sportive nutrition areas; (b) the Education Office of the State of Rio Grande do Sul, through the 3rd MR, is requiring statistic data referent to the instruction degree of young people who enlisted in Porto Alegre over the last years. 7
Conclusions and Future Work
This paper presented a DW/DM development methodology to cope with a common and relevant problem: the development of a first DW project in corporations by an in-house team that lacks practical experience in such development. These organizations demand a development methodology that helps in the conveyance of all the knowledge the in-house staff has in the development of traditional OLTP systems and associated technology, into DW development experience, in a smooth process. The methodology proposed is an adaptation of existing ones to take into account the inexperience factor, mainly through the experimentation and definition stages. The experimentation stage constitutes a large “laboratory”, where the staff gains experience and eliminates uncertainties about the project through the execution of a number of intensive and repetitive tests. Although the objectives of this stage are not new, the contribution of this work
lies in its insertion into a complete methodology, and in the detailment of prototype development systematics. The definition stage, another contribution of this work, allows the transformation of these experiences into project definitions, reducing in this way the uncertainties and surprises traditionally involved in the actual pilot project development. The methodology was applied with great success in the Brazilian Army, where the MRS was used as a case study. It was possible to assess the consistency and contribution of the proposed methodology for the problems addressed in this paper. The staff had no difficulty in developing the actual pilot project after going through the two initial stages. Additionally, the DW pilot project produced was fully approved by decision-makers, and enabled to prove to organization’s decision-makers the value of decision support systems. This work presented the first steps towards a complete methodology. Future work includes its validation in other case studies, the evaluation of staff experience evolution and variations on their correspondent methodological needs, the extension of pilot projects for complete DW systems, tools for supporting the development process, metadata, among others. References [1] Bontempo, Charles & Zagelow, George. The IBM - Data Warehouse Architecture. Communications of the ACM, 41 (9): 38-48. Sept; 1998. [2] Dyché, Jill. Scoping Your Data Mart Implementation. DBMS. Aug; 1998. [3] Gardner, Stephen R. Building the Data Warehouse. Communications of the ACM, 41 (9): 52-60. Sept; 1998. [4] Gray, Paul & Watson, Hugh J. Decision Support in the Data Warehouse. New Jersey, Prentice Hall PTR, 1998. [5] Inmon, William H. How to Build a Data Warehouse. Rio de Janeiro, Editora Campus, 1997 (in Portuguese). [6] Keen, Peter G. W. & Morton, Michael S. Scott. Decision Support Systems: an organizational perspective. Addison-Wesley Publishing Company, 1978. [7] Kimball, Ralph; Reeves, Laura; Ross, Margy & Thornthwaite, Warren. The Data warehouse lifecycle toolkit: expert methods for designing, developing, and deploying data warehouses. New York, John Wiley & Sons, 1998. [8] Pereira, Walter Adel Leite. A methodology targeted at the insertion of data warehouse technology in corporations. MSc. Dissertation. Porto Alegre, PUCRS. (in Portuguese) [9] Poe, Vidette; Klauer, Patricia & Brobst, Stephen. Building a data warehouse for decision support. New Jersey, Prentice Hall PTR, 1998. [10] Pressman, Roger S. Software Engineering: A Practitioner's Approach. McGraw Hill, 1991. [11] Sen, Aru & Jacob, Varghese S. Industrial - Strenght Data Warehousing. Communications of the ACM, 41 (9): 29-31. Sept; 1998.