Improving data quality Growing in Maturity Frank Boterenbrood MSc 10-03-2010
Research
Improving data quality in higher education
Management Summary Windesheim aims to become a near zero-latency organization: an organization that is able to respond promptly to events in its environment. However, unexpected errors hinder the implementation of near zero latency business process technologies. These errors are caused by poor data quality. The main business problem triggering this research is that poor data quality inhibits Windesheim‟s ability to become a near real-time organization. Closer examination reveals a serious business impact of poor data quality, which is defined by student (customer) dissatisfaction, inefficient process execution, loss of image and loss of control. Earlier research revealed that poor data quality is caused by applications not checking input values, and information objects having different values and definitions in different business domains, which in turn is caused by a departmental view on data instead of a more holistic business process wide view on information. The migration of a departmental view on data towards a holistic view on information is characterized as a growth in maturity. Not only does the impact of rapid changes in technology force Windesheim to grow in maturity, the migration from the data processing era to the information era is driven by international developments too. As part of this migration, a natural crisis, the technological discontinuity has to be overcome. In this research, the relation between organizational maturity and data quality is sculpted in an instrument predicting required organizational change. The (CMMIbased) instrument defines five levels of data quality maturity, ranging from 1) Initial through 2) Managed, 3) Defined, 4) Quantitatively Managed and 5) Optimizing. For each level, based on proven theories, process areas, goals and metrics are defined. Using this instrument, and for Windesheim‟s main business process, study management, current and required data quality and corresponding organizational maturity were investigated. Current and required maturity levels were assessed by observing process areas and goals currently implemented and linking required goals with business rules of study management. It was found that currently, in this domain, Windesheim has reached data quality maturity level one, and to satisfy both the business rules and become a near-zero latency organization, data quality maturity level three is a minimal prerequisite. Some data quality maturity level four goals will have to be reached as well. To reach the goals required, a three staged migration path is recommended 1: 1. 2.
3.
Reach data quality maturity level two (Managed) first by repairing the current database and creating reports for data quality monitoring purposes by means of well defined projects; Reach data quality maturity level three (Defined) by putting a lasting programme in place, adapting Educator (preventing errors by checking data quality at the input functions and simultaneously reducing complexity by simplifying functionalities), empowering staff, making teachers responsible for the complete process cycle, creating near real time interfaces based on standard application interfaces, and handling the technological discontinuity; Implement required level four (Quantitatively Managed) goals by establishing and communicating strict deadlines within the study process.
This will clear the way for Windesheim to become a near-zero latency organization, improve study management process efficiency, reduce cost of error detection and recovery, and improve customer (student) satisfaction. Taking into account the benefits of this outcome for Windesheim, I advice management to decide on implementing the recommendations made in this research.
1 Detailed advice is available in paragraph 5.6.2
6-Feb-12
F. Boterenbrood
Page 2
Research
Improving data quality in higher education
The Organization Windesheim is a university of professional education, located in Zwolle, and currently serving more than 17.000 students. The organization is controlled by the Board of Directors, which directs the departmental management. The number of employees is 1.800, 900 of which are teaching staff. On the board level, Windesheim and VU university Amsterdam are closely related. As a result of this cooperation, Windesheim does offer some master studies in Zwolle, and has recently started the Honours College Zwolle, a college aimed at serving international and „high potential‟ students 2.
Board of directors
11 Schools
6 service departments
Business Partners
Collaborating Schools
VU-Windesheim coöperation
Students Accreditation Figure 01: Windesheim Context Diagram
2 Instellingsplan 2007 – 2012, Besluit nummer 441 College van Bestuur van Windesheim
6-Feb-12
F. Boterenbrood
Page 3
Research
Improving data quality in higher education
Parties involved Author:
Frank Boterenbrood Waardeel 1f 8332 BB Steenwijk
6-Feb-12
E-mail:
[email protected]
Supervisor:
Albert Paans
E-mail:
[email protected]
Supervisor:
Rob Keemink
E-mail:
[email protected]
F. Boterenbrood
Page 4
Research
Improving data quality in higher education
Preface Surely, it is hard to find anything less inspiring than data quality. But look at it this way: there the data sits in the application‟s database, waiting for it to be retrieved, combined, processed and transformed into useful information. This is its moment of glory, the moment when it shines at the user interface, or even management dashboard perhaps, being delivered by information services conform service level agreements and processed by applets and modules, conform well established and glorious architectural patterns and styles, only to find that it is in error, flawed, outdated, misplaced …. Data, most literally, is the foundation on which information systems are build, like piling creates a foundation for (Dutch) houses. There is nothing sexy about a concrete pillar. It is hammered into the ground and remains invisible for eons to come. However, if it isn‟t there, or if there is something wrong with it, the construction it is supposed to support will inevitably come tumbling down. Today, every business operation relies on their information systems. And with these information systems, organizations create and consume immense amounts of data. If the data are flawed, time and money may be lost in equally large quantities, causing at least embarrassment and loss of reputation. Today, every business, every leader, every consumer has a vested interest in the quality of data. This is true for Windesheim too. This research investigates the relation between data quality and maturity of an organization, in particular the maturity of a higher education organization. Yet, the results are not confined to education. What has been found here, may well be applicable in other organizations. It is my hope therefore, that this research may contribute to improved data quality in a much broader context. For, when data is flawed, no investment in modern and exiting technologies may undo the damage, while once data is fit for use – or has a quality even beyond that, the capabilities of data to support and improve business are hard to overestimate. Acknowledgements First I do thank my beloved spouse Carin, who in the past years had supported me in my study efforts by enduring many hours of loneliness and reduced attention. I would like to thank Rob Keemink, who has invested a large amount of time and money into my study, and defended this investment, despite of many financial cut-backs and management discussions. There are thanks for Albert Paans too, who was assigned the burden of being the official constituent for this research, and invested a lot of his time in studying and debating the results I put forward, which greatly contributed to the quality of the research. I would like to thank Maarten Westerduin, for trusting me not to lose track of the Windesheim School of Information Sciences priorities. Also, I would like to extend my gratitude towards my colleagues of Bedrijfskundige Informatica, who at so many occasions enabled my study and graduation by taking on extra duties where I was not able to fulfill them. And last but most certainly not least, I would like to thank Marlies van Steenbergen, Theo Thiadens and Arjen de Graaf for their time invested in and light shun on the WDQM and data quality in higher education in general.
6-Feb-12
F. Boterenbrood
Page 5
Research
1.
Improving data quality in higher education
Table of contents Management Summary .......................................................................................................... 2 The Organization .................................................................................................................... 3 Parties involved ...................................................................................................................... 4 Preface.................................................................................................................................... 5 1. Table of contents ........................................................................................................... 6 2. Exploring data quality in higher education ................................................................... 9 2.1 Project Introduction ....................................................................................... 9 2.1.1 Windesheim‟s Mission ............................................................. 9 2.1.2 Windesheim‟s Information Technology ................................. 10 2.2 Business Problem description ...................................................................... 11 2.2.1 Indications .............................................................................. 11 2.2.2 Consequences ......................................................................... 11 2.2.3 Business Problem ................................................................... 12 2.3 Cause analysis ............................................................................................. 12 2.3.1 Technical / functional causes ................................................. 13 2.3.2 Process design causes ............................................................. 13 2.3.3 Organizational causes ............................................................ 13 2.3.4 Growing pains ........................................................................ 14 2.3.5 Perspective ............................................................................. 15 2.3.6 Past, current and future situation ............................................ 15 2.3.7 Summary ................................................................................ 17 2.4 Research Problem ........................................................................................ 17 2.5 Stakeholder Analysis ................................................................................... 18 2.6 Project Relevance ........................................................................................ 20 2.6.1 Stakeholder Relevance ........................................................... 20 2.6.2 Business Relevance ................................................................ 20 2.6.3 Relevance to Science ............................................................. 20 3. Conceptual Research Design ....................................................................................... 21 3.1 Theoretical approach and focus ................................................................... 21 3.1.1 Focus ...................................................................................... 21 3.1.2 Maturity revisited ................................................................... 21 3.1.3 A vision on Maturity. ............................................................. 22 3.1.4 What is data quality? .............................................................. 22 3.1.5 A vision on Data Quality........................................................ 24 3.2 Research Goal.............................................................................................. 24 3.3 Research Model ........................................................................................... 24 3.4 Research Questions ..................................................................................... 25 3.4.1 Main questions ....................................................................... 25 3.4.2 Sub questions for main question 1 ......................................... 25 3.4.3 Sub questions for main question 2 ......................................... 26 3.4.4 Sub questions for main question 3 ......................................... 26 3.4.5 Sub questions for main question 4 ......................................... 26 3.5 Concepts used .............................................................................................. 27 4. Technical Research Design ......................................................................................... 28 4.1 Research Material ........................................................................................ 28 4.2 Research Strategy ........................................................................................ 29 4.2.1 Strategy .................................................................................. 29 4.2.2 Reliability ............................................................................... 29
6-Feb-12
F. Boterenbrood
Page 6
Research
Improving data quality in higher education
5.
6.
6-Feb-12
4.2.3 Validity .................................................................................. 29 4.2.4 Scope ...................................................................................... 29 Research Execution ..................................................................................................... 30 5.1 Correlation between data quality and maturity ............................................ 30 5.1.1 Maturity, a brief history ......................................................... 30 5.1.2 Maturity levels ....................................................................... 30 5.1.3 Process Areas ......................................................................... 31 5.1.4 Identifying relevant process areas .......................................... 32 5.1.5 Windesheim Data Quality Maturity Model ............................ 37 5.1.6 Alternative views on data quality maturity ............................ 40 5.1.7 Conclusion ............................................................................. 43 5.2 Data Quality Attributes................................................................................ 43 5.2.1 Dimensions of data quality..................................................... 43 5.2.2 Data Quality Dimensions Discussed ...................................... 45 5.2.3 WDQM Goals ........................................................................ 50 5.2.4 (Time)related dimensions....................................................... 52 5.3 Business rules .............................................................................................. 53 5.3.1 Business rules, a definition .................................................... 53 5.3.2 Study management ................................................................. 54 5.3.3 Business rule mining .............................................................. 55 5.4 Current data quality maturity level study management domain .................. 55 5.4.1 Interview results ..................................................................... 56 5.4.2 Current Maturity .................................................................... 56 5.4.3 Current data quality dimension‟s attribute values .................. 57 5.4.4 Conclusion ............................................................................. 59 5.5 Required data quality maturity level study management domain ................ 60 5.5.1 Workshop results.................................................................... 60 5.5.2 Discussion .............................................................................. 61 5.5.3 Initial Research Problem ........................................................ 61 5.5.4 A data quality maturity level three (Defined) organization .... 62 5.5.5 Level 4 (quantitatively managed) requirements ..................... 62 5.6 Growing from current to required maturity ................................................. 63 5.6.1 Gap analysis ........................................................................... 63 5.6.2 Migration................................................................................ 65 5.7 Concluding .................................................................................................. 68 5.7.1 Conclusion ............................................................................. 68 5.7.2 Recommendations .................................................................. 69 5.7.3 Stakeholder Value .................................................................. 70 5.7.4 Achieved Reliability and Validity .......................................... 70 5.7.5 Scientific Value and Innovativeness ...................................... 71 5.7.6 Generalisation ........................................................................ 71 5.7.7 Research Questions Answered ............................................... 71 5.7.8 Recommendation on further research .................................... 73 5.7.9 Reflection ............................................................................... 74 Appendices .................................................................................................................. 75 6.1 Interview Report Windesheim Integration Team ........................................ 75 6.2 Interview Report WDQM Marlies van Steenbergen ................................... 77 6.3 Interview Report Data Quality in Education Th. J.G. Thiadens ................. 78 6.4 Interview WDQM dimensions Report Arjen de Graaf ................................ 80 6.5 Interview report Current Data Quality Educator Gerrit Vissinga ................ 81
F. Boterenbrood
Page 7
Research
Improving data quality in higher education
6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17
6-Feb-12
Interview report Current Data Quality Educator Gert IJszenga ................... 83 Interview report Current Data Quality Educator Gert IJszenga Continued . 84 Interview report Current Data Quality Educator Klaas Haasjes .................. 86 Interview report Current Data Quality Educator Louis Klomp ................... 87 Interview report Current Data Quality Educator Viola van Drogen ............ 89 Data Quality Workshop ............................................................................... 91 Business rules according to the Windesheim Educational Standards .......... 95 Detailed Business Rules .............................................................................. 96 Project Flow ................................................................................................ 98 Literature ..................................................................................................... 99 List of figures and tables ........................................................................... 103 Glossary ..................................................................................................... 103
F. Boterenbrood
Page 8
Research
Improving data quality in higher education
2.
Exploring data quality in higher education
2.1
Project Introduction
2.1.1
Windesheim’s Mission Windesheim´s mission statement: ―As an institution in higher education in the Netherlands, Windesheim offers a broad choice and is foremost a social venture. Windesheim is a community in which active and knowledgeable individuals meet. Windesheim is an innovative knowledge and expertise centre, challenging individuals to develop themselves towards valuable and self-confident professionals. Integration of three primary processes, education, research and social entrepreneurship results in excellent opportunities for dispersion of knowledge. Windesheim offers tailored education and supports individual study careers. Competences and personal planning are the foundations for each individual student. In the area of research and social entrepreneurship Windesheim distinguishes it selves by the implementation of knowledge exchange centers in Zwolle and participation in regional knowledge networks3.‖
3 Instellingsplan 2007 – 2012, Besluit nummer 441 College van Bestuur van Windesheim
6-Feb-12
F. Boterenbrood
Page 9
Research
2.1.2
Improving data quality in higher education
Windesheim’s Information Technology As indicated by figure 2, the Windesheim application landscape has become rather intertwined over the years. Min v O&W
Westerhof boeken
LOI
Scholen bestand
Resultaten
VU
Open Universiteit
SURFspot
CBAP en CRI-HO
IBG
Cendris
Alumni Persoonsgegevens NAW mutaties - NAW - Aanmeldingsgegevens - Bekostigingsgegevens
Voorraad en Bestel gegevens
Campus
Postcode tabel
Student Student Dossier
Stageadres
Resultaten Externe Studiepunten
KvK
Contact gegevens
Adressen
PinkRoccade
BGC
Adressen
CLIEOP
Caso
CLIEOP
Windesheim.nl
Aanvrager
Salaris gegevens
Aanmeldingen
Medewerker
Relatiebeheer
Token
Totaaloverzicht Cats
Student gegevens
Bais
Excel Scanproces
Decos MijnWindesheim
HRM
Ccerum Aanvrager Decaan Psycho Caas Finx Crim
Persoons gegevens Inschrijvingen
Aanbod keuzevakken per student
Webcats
Bank
Journaal posten
OFA
Contactgeg.
Radius
Roosters Tentamen ins. Winkel
Keuze modulen
Nal
Journaalposten
Postregistr.
xls
Arbo Discoverer Personeel gegevens
HBO-Raad Persoons gegevens
Vacman
Persoonsgegevens En Lesmoment per groep
Untis Inzet gegevens
Inzet Planning
Brongegevens
GW SIM Maggy
VMSII
Koffie-rekening
Borrel ISEK (Evasys)
Printkosten Kostenplaats bij medewerker
Webplanon
Vubis
Bestellingen
Voorraad
Edutel PIM Netstorage
Tijd reg.
Portokosten
P-nummer
Blackboard Concord
Divers Divers Div. plansys.
Finance Grootboek Cash mgmt Credit/Debit Activa
Inschrijvingen
Forum
Palt
Flexlm
NDS
Educaat
HRM Inkoop Projects Mgmt info Noetix
Rooster Info + Medewerker code
Keuze modules
APRO
Personeel gegevens
Cijfers
Cijferlijst
60
In/Ex Casso
ERP
Personeel gegevens
Kern
Portaal
Controle bestand
Octaaf Kassa
Xafax
Min v Justitie
Beschikbaarheid
SPSS pakket Studenten Inschrijving
PABX Docent gegevens Roosters (5 * /jaar ruimteplanning)
Desktop Net
Resultaten
Multo (TVS) Kostenplaats transformatie
Persoons gegevens
Facility-Office
SLA-Base Configuratie Items / afdeling
Teleform
Excel Huisvesting
DB Koppeling Koppeling Handmatig Autorisatie Nog uitzoeken Intern systeem Externe partij Uitgefaseerd
Figure 02: the Windesheim application landscape (Windesheim, 2004)
The figure demonstrates that it is the interfaces between (clusters of) applications causing complexity. Almost every connection requires manual intervention; therefore every data transfer represents a delay in business processes. To reduce integration complexity and increase business service levels, in 2005 the implementation of a service oriented architecture was initiated. One of the drivers was that Windesheim aims to become a near real-time organization. An example of this is given by the enrollment process: as soon as students are enrolled for a study, access to all campus-wide and study related student information services is to be granted quickly. Today, this process takes days, implementing real-time event driven communication patterns is believed to reduce processing time to minutes, and perhaps mere seconds. To design and implement the service based interfaces, a System Integration task force is installed. This task force, currently employing three professionals, is part of the IT department, yet it is governed by the Windesheim Information Manager. CIO
IT department
Information Management
System Integration
Figure 03: IT service department and system integration organization
6-Feb-12
F. Boterenbrood
Page 10
Research
Improving data quality in higher education
2.2
Business Problem description
2.2.1
Indications In the past, the system integration task force encountered integration problems, caused by unexpected and puzzling values in data fields. Triggered by these observations, in 2007 the quality of the database of one information system was investigated 4. The investigation revealed that in cases values of fields could not be explained, or were used to indicate specific situations. Business rules defining these situations and explaining the odd values were not documented. As a result, accounting of costs of facilities delivered was unsure at best. Upon completion, operations had corrected the issues found and Windesheim was advised to document business rules, formalize data management accordingly and implement a closed-loop data quality process. Surprisingly, shortly after this result was reached, the integration team encountered the same errors all over again. And in addition to these existing issues, every new data source added to the integration architecture introduced new and unexpected data quality problems 5. Issues found today are (but not limited to): Enrolment of students results in duplicate accounts; Painful mistakes like sending notifications to deceased students; Due to database corruption, management reports are rendered useless; Sometimes fields contain text-strings stating that „Debbie has to solve this problem‟; Names of students are completely missing, student addresses are incorrect, information is entered in wrong fields; Location (room) numbers are missing or contain special, unexpected codes; Data is outdated or is valid in / refers to different time periods between information systems; It was found that at least in one instance, lack of data quality caused a class to be scheduled in a stair case6.
2.2.2
Consequences Consequences of errors in data may be severe. In Enterprise Knowledge Management, David Loshin binds the election problems in 2000 in Florida, USA directly to poor data quality (Loshin, Enterprise Knowledge Management, the data quality approach, 2001). Loshin identifies an operational, tactical and strategic impact domain suffering from poor data quality. In the operational domain, costs are associated with detection, correction, rollback, rework, and prevention of errors, warranty, reduction of business and loss of customers (Loshin, Enterprise Knowledge Management, the data quality approach, 2001). In the tactical and strategic domain, decisions may be delayed or based on external or alternative data sources, hampering change processes. Business opportunities may be lost, business units get
4 Adviesbrief gegevenskwaliteit database facility office 2007 5 See appendix 1: Interview report system integration team Windesheim 2009 6 Fact Finding Roostersysteem Windesheim.doc, versie 1.2, 26 september 2007
6-Feb-12
F. Boterenbrood
Page 11
Research
Improving data quality in higher education
misaligned, management looses confidence in their management information systems (Loshin, Enterprise Knowledge Management, the data quality approach, 2001). 2.2.3
Business Problem The initial problem, triggering this research, is that a lack of data quality threatens the implementation of a service oriented architecture. The main business problem is that poor data quality inhibits Windesheim‟s ability to become a near real-time organization. Looking further, and along the lines of David Loshin‟s observations, in both the operational and tactical domain areas where poor data quality has an impact on Windesheim‟s business goals may be identified: Operational domain: 1.
2.
3.
Today, students expect any organization encountered to be a real-time organization. Banking, insurance companies, web shops, they all offer near zero-latency business services. So why can‟t Windesheim? Not being able to live up to modern expectations may cause Windesheim to obtain a reduced score on rankings published by the HBO-raad7. A reduced score may students decide to go and study elsewhere, (loss of customers) resulting in a demise of income. Currently, batch files transferring data between applications are checked manually on a daily basis. And yet, from time to time errors are propagated between applications. Detection, correction, rollback and rework associated with poor data quality cause serious overhead, reducing the organization‟s efficiency. Poor data quality is a cause for mistakes in Windesheim‟s external relations. Some mistakes are more painful than others, yet all of these mistakes cause damage to Windesheim‟s image of being a trustworthy knowledge partner in the region. This may be a cause of business opportunities being lost. And even if this is lesser the case, being an institution largely funded by public money Windesheim has a responsibility to be precise and correct in interacting with customers, constituents and society in general.
Tactical domain: 4.
2.3
Business intelligence retrieved from questionable data is uncertain at best. As a consequence, Business Activity Monitoring is hampered, which in turn means that the margin of error in daily processes is unknown. It also means that monitoring progress on achieving business targets is influenced as well.
Cause analysis The initial research in 20078 did have a narrow scope on exploring data quality. The research was confined to exploring data quality in only one application. However, the application observed supported (and still supports) facility management, in education a very relevant secondary process, directly supporting and influencing education itself. And secondly, having a rather narrow scope, the research dug very deep into the problem, extracting interesting conclusions from the application‟s database. Based on this research, technical and process design causes were identified. However, organizational causes remained untouched. Therefore, in this paragraph, technical/functional and
7 HBO-Monitor, http://www.hbo-raad.nl/onderwijs/kwaliteit 8 Adviesbrief gegevenskwaliteit database facility office 2007
6-Feb-12
F. Boterenbrood
Page 12
Research
Improving data quality in higher education
process design causes found are mentioned, and organizational causes are explored deeper. At the end of this paragraph, a summary is presented. 2.3.1
Technical / functional causes The first observation was that the COTS9 application used for supporting facility management was a complicated one indeed. It was found that, to implement specific requirements special database fields were used, for which the application offered no input checks. Therefore, the content of these fields was dependent on input being checked on correctness manually. It was found that in cases those fields were used to signal special situations, i.e. they were overloaded (Loshin, Enterprise Knowledge Management, the data quality approach, 2001). In other cases, values were missing or inexplicable. The investigation revealed that the database structure of the application was not fully utilized. As a result, in many cases checks on correctness and consistency were not present, allowing errors in input data to exist. Not only manual input caused flaws in data quality, processing batch files received from adjacent applications introduced errors as well. In the Windesheim application landscape, business objects have different names and formats between applications. A course for instance is named course, module, (variant) onderwijs eenheid, or vak. In various applications data with respect to a course is entered, enriched, updated and transferred to the next application. The dispersed nature of the underlying information landscape obstructs the actual view on the current status of a course (Windesheim, 2004).
2.3.2
Process design causes What was found during the initial research, was that operations (functional beheer) did have a very narrow view on its scope. Instead of using the applications standard reporting facilities, some reports were created using self made front-end applications, compensating for (correcting) errors in data. In storing and processing data, business rules were known and applied, yet not documented. To prevent input errors, specialized personnel only were allowed to perform certain tasks. In general, data management was found to be characterized by a departmental view, lacking a more holistic view on (the role of data in) the Windesheim business process. Within the boundaries of the individual department, measures were taken to compensate for the lack of data quality, hiding the issue for local management: „My data are OK‟. As a result, technical issues are not dealt with, since to management, they are invisible.
2.3.3
Organizational causes Why is it that, when research reveals that bills send for facilities delivered are unsure at least, the news is seemingly accepted in stoic fashion? Does Windesheim management have a disregard for accountability? That is not very likely. To understand the position of Windesheim on information processing, a historical view is needed. In 1986, Windesheim university started as a merger of 12 regional institutions in higher education (Broers, 2007). At the time, in order to gain support for the merger, an agreement was made that management of faculties and facilities was to be decentralized (Broers, 2007). It took a management crisis in 1992 for the new institution to realize that the benefits of a fusion could be harvested only if
9 Commercial Off The Shelve
6-Feb-12
F. Boterenbrood
Page 13
Research
Improving data quality in higher education
old individual values are replaced by new, common goals. A more centralized model is introduced in 1995, staff and technologic support are organized in centralized service centres. In the years that followed, the walls that divided the once so mighty and independent faculties were steadily reduced, while the independence (and size) of the service centers grew. (Broers, 2007). 2.3.4
Growing pains In 1979 Nolan argued that an organization and its use of information has to grow in maturity (Nolan, march-april 1979). Nolan defined 6 stages (initially 4) of maturity. In his vision, no stage could be skipped. In every stage, a predictable type of crisis would signal the transition to the next stage. In a recent publication, Architecture and Demassing took the place of the original columns 5 and 6 (Data Administration and Maturity) (Tan, 2003).
Figure 04: Nolan’s stage model
When we take a closer look, it can be argued that currently the Windesheim faculties are well on their way into the integration phase, or in broader perspective, higher education is transferring from the data processing era to the information era. This can be observed by institutions striving for integration, developing shared minors, i.e. education crossing the borders of a faculty. However, the supporting service centers are still lingering in the control phase, which is indicated by an ongoing isolated view on information systems10. The Windesheim application landscape still is very task oriented, with separate systems supporting individual business functions. Not the integrated business process is observed, focus lies on support of individual tasks. With faculties tearing down their walls and service centers still staying put, the organization is in danger to lose balance. It is foreseeable that service centers will have to make a transition from the control stage to the integration stage as well. In 1992, in Germany Richard P. Marble applied Nolan‟s stage model on the transition East-German industry went through, and described the transition as follows: ―… management realizes a need to emphasize central planning. The attention that the computer resource finally receives leads to a change in management thinking – now regarding their task as one of managing the data resources of the organization, not the computer resources.” (Marble, 1992)
10 As shown in Figure 2: the Windesheim application landscape (Windesheim, 2004)
6-Feb-12
F. Boterenbrood
Page 14
Research
Improving data quality in higher education
With this transition, a firm crisis is to be expected. This should not be seen as a loss of control, but merely a change of paradigms. During this change of paradigms, organizations take a step back and have to rethink many of their existing strategies and principles. This backwards movement is called a discontinuity (Zee, 2001). In figure 5 Van der Zee extends the Nolan model with a third era (the network era) accompanied by two crises: a technological discontinuity and an organizational discontinuity.
Figure 05: Era’s and discontinuities, (Zee, 2001)
The technological discontinuity is observed by English: ―… many CIOs, by and large are NOT Chief Information Officers — but Chief Information Technology Officers. Falling into the techno-trap of believing their job was to put in place the information technology infrastructure, their job was then to build or acquire and deploy hardware, networks, and applications, period. Few CIOs saw and understood the Information-Age paradigm. … The Information-Age purpose of the CIO has always been to deliver Quality, Just-In-Time Information‖ (English, 2009). Currently, Windesheim is exploring Nolan‟s phase 4, and with it, the technological discontinuity crisis must be overcome. In this discontinuity, van der Zee (2001) places the focal point on both technology and organizational changes, such as the emergence of IT-governance, with the need for ICT to be present in the boardroom (represented by the CIO) (Zee, 2001). 2.3.5
Perspective It seems that at Windesheim, a perceived data quality problem is not merely a technological issue, nor is it an issue of just getting the processes right. If this was the case, the earlier data quality project would have had more lasting results. It seems to be a case of aiding Windesheim‟s service centers through the technological discontinuity.
2.3.6
Past, current and future situation Are the changes Windesheim is currently experiencing a phase that will quickly blow over, or are they part of a greater scheme? Will Windesheim continue to grow in maturity or is it likely that the organization will experience a fall-back into the control-stage again? To find out what is going on, not only the recent past of Windesheim is of interest, but the whole picture of Higher Education in Europe needs attention.
6-Feb-12
F. Boterenbrood
Page 15
Research
Improving data quality in higher education
Europe not being divided by national borders and Latin being the language of choice in medieval universities, medieval universities attracted Wanderstudenten from all over Europe: „Until the eighteenth century the European university was an European institution, reflecting European values of intellectual freedom and of a borderless community‘ (Vught & Huisman, 2009). This all changed when territorial states arose, installing national frameworks.. From the eighteenth century up until the dawn of the twenty-first century, national borders and policies effectively resulted in „national science‟. It was not until the 1980‟s before the first EU policy initiatives appeared. In the second half of the 1990‟s, a myriad of programmes and declarations were spawned, aimed on „making Europe the most competitive and dynamic knowledge based economy in the world‟ (European Council, Lisbon, 2000) and to create „the European Higher Education Area‘ (Sorbonne Declaration 1998 & Bologna Declaration 1999). Currently, 46 European nations are involved in this process, including the Netherlands. (Vught & Huisman, 2009). This process causes a landslide in the area of hogescholen (universities of applied science). The clearcut distinction between hogescholen en universiteiten has begun to blurr since hogescholen started to offer both Bachelor and Master degrees, and started conducting scientific research11, where previously only Bachelor degrees were offered and scientific research was strictly reserved for universities. But more importantly, a search for transparency was spawned: ‗This, coinciding with increasing pressure from professional organizations and external regulatory bodies to control what was being taught …. led towards the standardization of curricula‘. (Vught & Huisman, 2009) In the future, it is to be expected that a generally defined common (curriculum and study logistics) framework will both ensure transparency and yet acknowledge diversification (Vught & Huisman, 2009). How does this all translate to Windesheim? The Five Forces model of Porter may help finding an answer to that question. The Five Forces model of Porter is an outside-in business unit strategy tool that is used to make an analysis of the attractiveness (value...) of an industry structure (Porter & Millar, 1985). When we project this model on Windesheim, we find the following forces shaping an institution like Windesheim: First of all, institutions compete with each other for the attention of the student (Rivalry among existing firms). This is shown by the constant attention of institutions for quality statistics released by the HBO-raad. Secondly, the student has a great deal of influence (Bargaining Power of Buyers). Constantly, his opinion about the quality of education is measured and published and in response, courses and schedules are revised. As a result of European and National developments, students are highly mobile, strongly increasing their Bargaining Power. At third place, commercial ‗substitutes‘ do exist. Commercial organizations offer grades that rival recognized titles. For instance, in IT employees owning Microsoft certificates are in high demand, rivaling employees owning bachelor or master degrees.
11 By means of the lectorate. (HBO-raad Lectorenplatform, 2006)
6-Feb-12
F. Boterenbrood
Page 16
Research
Improving data quality in higher education
And finally, potential entrants (like DNV-CIBIT) fill in niche markets. Although the titles they offer are internationally recognized, the courses they offer do not fit governmental approval and therefore are not subsidized.12 Windesheim, being a university of professional education, is in the midst of this turmoil. Windesheim faculties are aligning themselves with European strategies, starting with implementing a Minor/Major educational model, jointly developing minors and even offering those minors to students of other institutions, (trying to „lure‟ them away) for some minors introducing English as the general language used in classes. It seems that the wanderstudent is re-instated, but this time in unmatched masses, forcing the institution to synchronize education in an international setting and trying to be as attractive as possible. Pan-European developments force institutions to prepare for intimate inter-institutional cooperation. In this volatile environment, Windesheim does not have the luxury not to grow in maturity. 2.3.7
Summary In this chapter, what has been found is that: Windesheim strives to become a near zero latency organization; Surprising errors hamper technical initiatives to implement near zero latency business process technologies; These errors are caused by poor data quality; Closer examination reveals a serious business impact of poor data quality, which is defined by student (customer) dissatisfaction, inefficient process execution, loss of image and loss of control; Poor data quality is caused by applications not checking input values, and information objects having different values and definitions in different business domains; Which in turn is caused by a departmental view on data instead of a more holistic business process wide view on information; International developments force Windesheim to grow in maturity, migrating from the data processing era to the information era; As part of this migration, a natural crisis, the technological discontinuity has to be overcome; In this crisis, the organization is to develop a holistic view on information.
2.4
Research Problem In the past, on technical, functional and process design level, causes of data quality issues have been identified and countermeasures have been described. This vision needs to be extended by exploring the relation between structures defining maturity and data quality within the context of a Dutch institution of higher education, in particular Windesheim, and even more precise, the Windesheim service centers. In this, the focus is on crossing the border between the control and integration stage in Nolan‟s stage model (Nolan, march-april 1979) (Tan, 2003), overcoming the technological discontinuity (Zee, 2001). Extending the technical / functional vision on data quality does raise a myriad of questions. What impact on data quality will overcoming the technological discontinuity have? Will a growth in
12 Interestingly, the fifth force, Threat of new entrants is rather unknown to education. The emergence of new institutions is highly regulated and care is taken for new institutions not to compete with existing institutions in the region.
6-Feb-12
F. Boterenbrood
Page 17
Research
Improving data quality in higher education
maturity be enough to solve the data quality issues identified? What exactly does „growing in maturity‟ mean? What are consequences for the organization of Windesheim? Do consequences found align with Windesheim‟s strategic developments? What will the response of Windesheim‟s management be? What arguments will spawn interest in improving data quality? Is there a danger of falling back into the comfort of the data processing era? By extending the research beyond the technical and functional domain, the research enters the domain of information as a subject of organizational and political forces, and using information as a strategic instrument. It has become a problem of strategic alignment. The research problem may therefore be summarized as: At Windesheim, what defines the border between the control and integration stage? What are positive and negative correlations between structures defining organizational maturity and attributes defining data quality, enabling Windesheim to become a near zero-latency organization?
2.5
Stakeholder Analysis
Stakeholder
Role
Concern
Relation to the problem
Board
To set and guard Windesheim‟s strategy
Information Manager
To implement and guard a coherent view on information To extend the human knowledge base To prevent unauthorised disclosure or manipulation of information To safeguard undisturbed and reliable information delivery in business processes To implement change and control daily business processes
Control on finance, quality of the institution and strategy. Alignment of the institution with national and international developments. Correctness of data
Loss of image and loss of students will hint loss of control. Inefficient business processes impose a financial drain on the organization. Changing from a localized to an integral view on data may be a cause of concern. Poor data quality may cause inefficient business processes.
Validity and Reliability of knowledge Availability, Integrity, Confidentiality of data
New knowledge may be discovered, existing theories validated Poor data quality obstructs integrity and availability
Secure and Correct use of data. Enabling future change.
Poor data quality may cause inefficient business processes, loss of image and loss of students
Budgeting and Effectiveness, (Baida, 2002), Reliability of data
Poor data quality cripples effective, reliable management. Changing from a localized to an integral view on data may be a cause of concern. Poor data quality cause student names to be misspelled or missing altogether, resulting in loss of trust. Poor data quality results in students complaining, and complicated registration & planning processes. Poor data quality cause applications to abort and time spent on debugging
Science Security Manager
CIO
Management
Students
To be educated
Staff
To educate
Operations
To ensure operational IT
6-Feb-12
Findability, Security, Reliability, Availability, Timeliness Security, Reliability, Timeliness Manageability, Correctness of data
F. Boterenbrood
Page 18
Research
Improving data quality in higher education
Functional Support
To ensure operational applications
Correctness of data
System Integration
To ensure near realtime service-based system integration
Correctness and availability of data
Poor data quality leads to manually identify, correct and rollback errors daily. Poor data quality cause application interfaces to abort and time spent on debugging
Table 01: Stakeholder analysis
For management (Board, CIO, general management, information management) solving the data quality problem will be based on a cost/benefit assessment. Operations and Functional Support will be willing to participate in solving the problem, if care is being taken where personal interests are involved. Figure 6 presents a graphical representation of stakeholders and their relation to the proposed data quality research project.
Board
Staff
Operations
CIO
System Integration
Student Functional Support
Information Manager
Security Manager
Science
Management
Figure 06: project stakeholders
Stakeholders being committed to this project are the CIO, Information Manager and Science. The CIO and Information Manager are financier and constituent of this research respectively. (To be) involved in the project are (members of) functional support, operations, system integration and the security manager, since the results of this research are likely to be of direct interest to these stakeholders and because of specific knowledge within these groups. Management holds a somewhat special place. IT management is likely to be involved, other management may be affected. Other stakeholders affected by any advise resulting from this research are students, staff and board.
6-Feb-12
F. Boterenbrood
Page 19
Research
Improving data quality in higher education
2.6
Project Relevance
2.6.1
Stakeholder Relevance Relevance for stakeholders is discussed in the previous paragraph.
2.6.2
Business Relevance Currently, education at Windesheim is embarking on a journey towards a higher level of maturity and service centers have to join this movement. However, the destination of this journey is not clear for everyone, and for others the road ahead is unknown. This research will shed light on this migration, by offering knowledge on what Windesheim might look like when data processing is replaced by a more integral view on information. In the long run, this paradigm shift will enable Windesheim to stay in sync with (inter)national processes. In the short term it will increase efficiency, student satisfaction and management control and prevent loss of image.
2.6.3
Relevance to Science In the field of data quality, many publications, services and even tools are available. Publications look at data quality from a technical point of view, suggesting valid input checks and database constraints as a solution . Business processes are recognized to be part of the equation too, and efforts are made to point out that processes need to be implemented as a closed loop, automatically correcting errors (Batini & Scannapieco, 1998) (Loshin, Enterprise Knowledge Management, the data quality approach, 2001) (McGilvray, 2008) (Lee, Pipino, Funk, & Wang, 2006). However, in the field of education, academical research binding (loss of) business data quality to business maturity has not been identified. The US national center for educational statistics has set up a data quality task force, offering advice to members of staff of an educational institution to create a Culture of Data Quality (Data Quality Task Force, 2004). This publication is aimed at the field of statistics, and underlying research is unknown, yet recommendations presented by the report may prove useful. One research dealing with data quality in e-business has been found (Data Quality and Data Alignment in E-business) (Vermeer, 2001). The research defined a context for data quality, and established a formal relation between data quality in EDI messages and business process quality. Finally, the research presented a method for establishing data quality in business-chains: DAL (Data Alignment through Logistics) (Vermeer, 2001). The definition of the context of data quality and its relation to business process quality delivers strong support for the research at hand.
6-Feb-12
F. Boterenbrood
Page 20
Research
Improving data quality in higher education
3.
Conceptual Research Design
3.1
Theoretical approach and focus
3.1.1
Focus The field to explore as defined by the Research Problem is broad. This research will focus on identifying the relation between organizational maturity and the required level of data quality, as this has been identified as the root cause of the business problem at hand.
3.1.2
Maturity revisited Before enthusiastically embarking on a journey into the unknown, can additional proof be obtained, pointing towards a link between data quality and organizational maturity? In “Data Quality and Data Alignment in E-business” Ir. Bas H.P.J. Vermeer (2001) addressed issues resulting from distributed data management: ―…..two problems arise in a multiple database situation: a translation problem and a distribution problem. The translation problem arises because the same fact may be differently structured at different locations. Therefore, schema translation is necessary to map the structure of the source schema to the structure of the manufacturer‘s schema. This results in a mapping schema between the source schema and the receiver‘s schema that is used every time a fact in the source database is updated. The distribution problem arises because each fact update is first translated and then transported over a network to a limited set of users, where it is finally interpreted and stored in the receiver‘s database. During translation and interpretation, mapping errors may occur, which results in loss of data quality. During transportation, the data may get delayed, damaged, or delivered to the wrong recipient, resulting in inconsistencies among different locations.‖ (Vermeer, 2001). Thus, having a localized view on data, distributing and transforming data objects throughout an application landscape, introduces a translation and a distribution problem. Then, why not develop a single, integrated view on data? Why not just implement an ERP package? Dale L. Goodhue et al (sept 1992) question the common believe that data integration always results in positive benefits for any organization. It was shown that creating one integrated solution is simply not feasible in many organizations. Data integration may have positive effects in terms of improved efficiency where subunits are highly aligned. Yet in unstable, volatile environments striving for data integration will not result in tangible benefits: “…This model suggests that the benefits of data integration will outweigh costs only under certain circumstances, and probably not for all the data the organization uses. Therefore, MIS researchers and practitioners should consider the need for better conceptualization and methods for implementing ‗partial integration‘ in organizations‖ (Goodhue, Wybo, & Kirsch, sept 1992).
6-Feb-12
F. Boterenbrood
Page 21
Research
Improving data quality in higher education
Conclusions: In terms of data quality, it is best if there is only one single view on corporate wide data definitions in existence; Only organizations who are able to successfully align their subunits are likely to achieve business benefits from data integration; Even with alignment, complete data integration is not likely to be achieved. Even without striving for data integration, the research done by Vermeer and Goodhue et al hints that aligning business units (i.e. observing the whole business value chain, instead of localized departmental processes) is an important prerequisite for achieving improved data quality. 3.1.3
A vision on Maturity. Maturity may be defined by Stages. (Nolan, march-april 1979). Yet more recent theories tend to embrace Level as measure of maturity: BPMM (Object Management Group, 2008), CMMI (Software Engineering Institute, 2009), ISO 15504 / Spice (Hendriks, 2000) where each level is defined by certain structures. In this research, maturity is defined as an attribute of an organizational process, organized in maturity levels, defined by certain structures being in place, revered to as maturity structures.
3.1.4
What is data quality? Indeed, what is data quality? Even though on quality in general multiple definitions and standards exists, on data quality this is lesser the case. It seems as if the idea “the computer never lies” still holds some ground. Even T. William (Bill) Olle, in “The Codasyl Approach to Data Base Management” (Olle, 1978) did not make any remarks regarding the relationship between data base management and data quality. Which is remarkable, since a database management system may be regarded to be the technical guardian of data quality! Data, business rules and business processes are linked closely together (Besouw, 2009). In about four decades, in most businesses the data and business rules are materialized in the form of automated information systems. Those information systems aim to reflect reality as closely as possible. But what we find in the real world, is that reality is in a constant flux and information systems are trying to cope. There is a natural gap in time between the situation in reality and the registration of that situation in an information system. The problem this time lapse introduces was unwittingly recognized by T. William Olle with respect to the book he has written: “The time factor is in itself a problem because the CODASYL specifications are changing inexorably as the years go by. The book reflects as accurately as possible the most recently published specifications at the time of writing.” (Olle, 1978) What is true for the written word might be true for information systems too. The struggle of information systems to stay aligned with reality is one of the topics in „De (on)betrouwbaarheid van informatie‟13 (Bakker, 2006). Take for instance the dynamics of the Dutch population: “According to the CBS, in October 2004 the Netherlands housed 16.258.032 citizen, of which 8.045.914 male and 8.212.118 female……But what makes us believe that we are capable to assess the number of
13 The (un)reliability of information
6-Feb-12
F. Boterenbrood
Page 22
Research
Improving data quality in higher education
citizen with this accuracy? In the year 2000 for instance, 206.619 people were born and 140.527 deceased. At what moment in time was that exact amount of citizen determined? Wait an hour and the number has changed! ― (Bakker, 2006)14 Bakker not only demonstrated that it is impossible to make a headcount in a dynamically changing system with a high degree of accuracy, he also argued that in fact, no data at all is ever exactly correct. When, for instance, one sets off to measure the coastline of Great Britain, one will find that using precise measurements will result in a considerably longer coastline being measured compared to the use of coarse methods (Bakker, 2006). And then again, every measurement has a certain degree of uncertainty, a measurement error. It is simply impossible to measure a physical object exactly. (Bakker, 2006) Therefore, it is important to establish a threshold, defining the acceptable degree of uncertainty. To establish such an threshold and guard the compliance of data quality, the Data Management Association introduces the Data Quality Management function: “Data Quality Management – Planning, implementation and control activities that apply quality management techniques to measure, assess, improve and ensure the fitness of data for use.” (Mosley, 2008) This definition points the way for the definition of the right threshold: data should be fit for use. Arvix, a Dutch company dedicated to the improvement of data quality, seems to agree: “The quality (of data) is closely related to its use” (Arvix, 2009). In addition, Frans Besouw translates fit for use into the ability of data to support business rules (Besouw, 2009). In the vision of Arvix, data quality reveals the capability of data to be successfully utilized over a prolonged period of time (Arvix, 2009). Apparently, fit for use is a measure that is likely to change over time, as business rules evolve over time. An example can be found in banking. Two decades ago banks sending us an account transaction overview once a week was regarded acceptable. The most recent transactions included on this overview were about half a week old, including the account total shown. A decade ago, private banking customers were enabled to monitor all transactions on-line. Online access implies on-time information, and a delay in actuality of not more than one day was seen as acceptable. Today however, customers are able to monitor their accounts in real time. In the last ten years, in private banking actuality of information that is perceived to be fit for use has shrunk from days to minutes. Quality can be measured. ISO 9126 offers an standard for the evaluation of software quality. An extension on the ISO 9126 quality standard is the Quint quality model (Zeist, Hendriks, Paulussen, & Trieneken, 1996). However, these quality standards are aimed at measuring integrated information system quality. To specifically target data quality in a given situation, in “Kwaliteit van softwareprodukten, Praktijkervaringen met een kwaliteitsmodel” 15 (Zeist, Hendriks, Paulussen, & Trieneken, 1996), the already extended ISO model was extended even further by adding two new quality attributes to the Quint model: Database Accuracy and Database Actuality. Verreck, de Graaf and van der Sanden even express quality of data in terms of attributes. They propose to define quality as a function of Reliability and Relevance: Q=R2 and redefine this as „lasting usability‘. (Verreck, Graaf, & Sanden, 2005).
14 Translated from Dutch 15 Quality of software products, hands-on experiences with a quality model
6-Feb-12
F. Boterenbrood
Page 23
Research
3.1.5
Improving data quality in higher education
A vision on Data Quality We started off with the discovery that many problems in Windesheim‟s IT were caused by poor data quality. In many publications, data quality is treated as being purely a technological issue. What we found was that this vision needs to be extended by exploring the relation between structures defining maturity and data quality. Now we have discovered that data quality is not an absolute value, but a question of defining the right threshold: Data is inaccurate by nature; When data inaccuracy exceeds a certain threshold, quality becomes flawed; The threshold is defined by data being fit for use; In general, fit for use can be seen as the ability of data to support business rules; Fit for use can be operationalized by means of quality attributes; For every specific situation, appropriate attributes are to be defined; For these attributes, and therefore for the data quality threshold, what is being perceived as acceptable values evolves in time, as business rules evolve in time.
3.2
Research Goal The goal of this research is to contribute to the improvement of data quality at Windesheim by analyzing the gap between the current and required data quality threshold and corresponding current and required maturity, identifying positive and negative correlations between data quality attributes and structures defining maturity.
3.3
Research Model
a
1 b
Theories on Data Quality Theories on Maturity of Organizations
Maturity and data quality instrument
View on required threshold and maturity
2
Theories on Maturity of Business Processes
Stakeholders Involved
Theories on Ensuring Quality in Business Processes
Current data quality threshold
Benchmark
c
4 Current threshold and maturity
f
e
Current maturity
d
Advice
3
Figure 07: Research Model
6-Feb-12
F. Boterenbrood
Page 24
Research
Improving data quality in higher education
An analysis of theories on data quality and maturity, backed by exploring an external implementation (external benchmark) (a) results in a conceptual model (maturity and data quality instrument) (b), which will be discussed by an expert group of stakeholders involved16. This will lead to a populated conceptual model (view on required threshold and maturity) (c). An assessment of the current data quality threshold and current maturity (d) results in a description of the current situation (e). Confronting the validated view with the description of the current situation leads to a Gap Analysis (f).
3.4
Research Questions The main research questions are found by decomposition of the research model.
3.4.1
Main questions Observing theories on maturity and data quality, and external benchmarks, what positive and negative correlations between structures defining maturity and data quality attributes may be found? What values of data quality attributes will define the required data quality threshold and therefore the required maturity structures at Windesheim? What are the current organizational maturity and current values of data quality attributes? Finally, the central research question: What is the gap between current maturity structures & data quality threshold and required maturity structures & data quality threshold in the light of enabling Windesheim to become a near zero latency organization? Sub questions are found by examining the chart of concepts used, described in the next paragraph. To avoid dispersion of research questions, the sub questions are described first, and concepts used later.
3.4.2
Sub questions for main question 1 Both decomposing the main question, and interpreting the embossed part of concepts used (next paragraph), the following sub questions are found: 1.
2.
What structures define maturity? a. What levels of maturity do exist? b. What maturity structures in the field of organizational structure, process, technology, information and staff describe each level? In higher education, what positive and negative correlations between maturity and data quality may be found? a. For this research, what is the relevant set of business rules? b. How will this set of business rules evolve in time? c. What data quality attributes are relevant for these business rules? d. What values of data quality attributes correlate with each level of maturity? e. What do process quality theories describe about positive correlations between quality and maturity? f. What do process quality theories describe about negative correlations between quality and maturity? g. Are those observations consistent?
16 As identified in stakeholder analysis: figure 6
6-Feb-12
F. Boterenbrood
Page 25
Research
3.4.3
Improving data quality in higher education
Sub questions for main question 2 1. 2. 3.
3.4.4
To support the business rules identified earlier, what values should data quality attributes have? What level of maturity is required to enable those data quality attribute values? What organizational structure, process, technology, information and staff criteria define the maturity found?
Sub questions for main question 3 No further decomposition is required.
3.4.5
Sub questions for main question 4 1. 2.
6-Feb-12
What is the gap between the current and required organizational structure, process, technology, information and staff criteria? What conclusions and recommendations may be derived from this gap?
F. Boterenbrood
Page 26
Research
3.5
Improving data quality in higher education
Concepts used
Maturity
Structure Criteria
Process Criteria
Data Quality
Systems
Technology Criteria
Maturity Levels
Staff Criteria
Information Criteria
Data Quality Attribute Values
Fit for Use
Business Rule Support
Between what?
Correlation
Defined by
Organizational Maturity Theories Described by
Time
Process Maturity Theories Process Quality Theories
Data Quality Theories
Figure 08: Concepts Used
The main concept in this research is that there is a correlation to be discovered between Organizational Maturity and Data Quality. A quick scan of BPMM (Object Management Group, 2008), CMMI (Software Engineering Institute, 2009), ISO 15504 / Spice (Hendriks, 2000) reveals that maturity levels seem to include criteria related to Structures, Systems and Staff of McKinsey‟s 7factor model (Pascale, Peters, & Waterman, 2009). Processes, Technology and Information criteria all define the Systems factor. At this stage, the Systems factor may be expected to offer a link between maturity (information quality) and data quality attribute values. Data Quality Attribute Values are fit for use if they offer support for business rules, a condition which evolves in time. The correlation may be derived from organizational maturity theories, process maturity theories, process quality theories (six sigma, www.sixsigma.nl) and data quality theories. Process quality theories are expected to offer a second link between maturity and data quality. A link between process quality and process maturity has already been identified (Gack, 2009). At this point, it is assumed that a certain level of maturity is defined by a set of structure, process, technology, information and staff criteria. It is also assumed that information criteria and data quality attribute values can be linked, and that data quality theories will support the links found. These assumptions are to be validated in this research.
6-Feb-12
F. Boterenbrood
Page 27
Research
Improving data quality in higher education
4.
Technical Research Design
4.1
Research Material
Research question Research Object
Source
Retrieving Method Comment
What levels of maturity do exist? Maturity
Literature
Desk Research
Much has been published on this topic
What structures in the field of organizational structure, process, technology, information and staff describe each level? Maturity Literature Desk Research Much has been published on this topic At this moment, which business rules are affected by lack of data quality? Affected Windesheim Business Rules Stakeholders: operations, Interviews integration team Windesheim Desk Research Documentation What data quality attributes are relevant for these business rules? Relevant data quality attributes Stakeholders: operations, Interviews integration team Literature Desk Research What values of data quality attributes correlate with each level of maturity? Correlation between maturity levels Literature on maturity and Desk Research and data quality attribute values literature on data quality Publications and research Desk Research External specialists Interview
Integration team and operations has latent knowledge on business rules On data quality and Windesheim Business rules research had been done already Integration team and operations has latent knowledge on business rules and required data quality
Some research indicating a link between quality and maturity has been identified already At Arvix, a company specialised in data quality, interest for this research may be raised. Dr Theo Thiadens, lector ICT Governance at Fontys, has agreed upon an interview already
What do process quality theories describe about positive correlations between quality and maturity? See previous question Literature on process See previous See previous question quality question What do process quality theories describe about negative correlations between quality and maturity? See previous question See previous question See previous See previous question question Are those observations consistent? Results from previous questions are compared and analysed
None
Analysis
To support the business rules identified earlier, what values should the data quality attributes have? Required data quality threshold Workshop Stakeholders involved (figure 6) What level of maturity is required to enable those data quality attribute values? Maturity required Correlation found will be Analysis used What structure, process, technology, information and staff criteria define the maturity found? Required values of maturity Theories described earlier Substitution elements What are the current organizational maturity and current values of data quality attributes? Operational values of maturity Stakeholders: operations, Interviews Observing both maturity and data quality improves reliability elements and data quality integration team Windesheim Desk Research Documentation What is the gap between the current and required structure, process, technology, information and staff criteria? Results from previous questions are None Analysis compared and analysed What conclusions may be derived from this gap? Result from previous questions is Theories identified earlier Analysis analysed What recommendations may be defined? Result from previous questions is Theories identified earlier Analysis analysed
Table 02: Research Material
6-Feb-12
F. Boterenbrood
Page 28
Research
Improving data quality in higher education
4.2
Research Strategy
4.2.1
Strategy This research is characterized by a grounded theory approach, based on desk research. To improve reliability and validity, a survey was conducted, by interviewing specialists in the field and within Windesheim. The subjects covered by the survey were maturity levels, process quality elements, data quality attribute values and the correlation between them. Interviewees were presented with statements and conclusions derived from publications and literature, and asked whether these are in line with their experience, using examples of real-world situations. The results were used to validate the hypothesis that maturity structures and data quality are related. External participants were chosen based on their expertise on (dealing) with data quality in general and maturity. Internal participants in interviews and the workshop were chosen based on their experience with data quality in the business domain, both from the viewpoint of operations and user departments. Care was taken to include participants from a department where data quality was perceived to be troublesome and a department where data quality issues were perceived to be successfully resolved.
4.2.2
Reliability To reliably discover a relation between variables (i.e. data quality and maturity structures) a quantitative approach is required. This research however, was qualitative of nature. Multiple theories on maturity and quality were discussed and balanced. The results were cross checked by means of a survey amongst specialists. Population of quality attribute values was performed by a workshop involving Windesheim specialists, enabling them to reflect on the process and results. The rigor of the study and triangulation ensure reliability. However, results are less detailed compared to results gained from a quantitative approach.
4.2.3
Validity In this project plan, it has been found that multiple theories point towards a required gain in maturity. It is therefore a valid approach to look for a relation between data quality attribute values and maturity structures. In this research, literature and publications of theories and research were explored to validate this hypothesis. This relation was discussed by specialists in a limited survey. Building on multiple, accepted sources, reflection on results acquired and open discussion ensure internal validity, while applying the grounded theory approach ensures external validity.
4.2.4
Scope This research explored the gap between required and current maturity at Windesheim. This gap analysis is focused on a specific business domain: study management. This business domain is chosen in close cooperation with the CIO and the Information Manager. The main goal of study management is to manage major, minor and course definitions, present those definitions to other business domains like scheduling & study planning and to manage study progress.
6-Feb-12
F. Boterenbrood
Page 29
Research
5.
Improving data quality in higher education
Research Execution This chapter presents the observations achieved by executing the research according to the research plan. Multiple maturity models defined in publications and literature are compared. After combining, normalizing and transforming the results, the Windesheim Data Quality Maturity model WDQM is created. Dimensions of data quality are explored, leading to the description of the relation between data quality maturity levels and data quality dimensions and attributes. Business rules are harvested from Windesheim business and IT documents focused on the Windesheim business domain of study design, education, assessment and grading. Based on these business rules, best fitting data quality attribute values are defined, leading to an analysis of the required data quality maturity.
5.1
Correlation between data quality and maturity The next paragraphs explore the first research question: what positive and negative correlations between structures defining maturity and data quality attributes may be found? To find this relation, theories on maturity and data quality are explored.
5.1.1
Maturity, a brief history The first effort to formalize a maturity model was triggered by problems occurring with delivering complex software systems for the US Department of Defense (DoD), mainly in connection with the Strategic Defense Initiative (SDI). Originally, the Capability Maturity Model (CMM) was developed as a tool to assess software suppliers. Development started in 1986 at the Software Engineering Institute (SEI) of Carnegie Mellon University and led to the Software Process Maturity Framework in 1987. In 1991, this resulted in the publication of CMM as the Capability Maturity Model v.1.0. Based on experience with the use of this model, a new version 1.1 was published in 1993 (Kneuper, 2008). The five-stage maturity model immediately got to the attention of developers worldwide. In 2002, Brett Champlin, senior lecturer at Roosevelt university, counted over 120 maturity models, all derived from or inspired by the initial CMM (Champlin, 2002). To integrate multiple viewpoints, in 2000 the Capability Maturity Model for Integration (CMMI) version 1.0 was published. This model was developed even further, resulting in CMMI version 1.2 in 2006, offering three constellations which extend the area of applicability of CMMI to development (CMMI-DEV), acquisition (CMMI-ACQ) and services (CMMI-SVC) (Kneuper, 2008).
5.1.2
Maturity levels CMM, its successor CMMI and their derivatives are based on common structures, the most wellknown of which perhaps is the definition of Maturity Levels introduced by Crosby (1980). Currently, five levels are agreed upon (Kneuper, 2008) (Curtis, Hefley, & Miller, 2009): 1. Initial, no structures are in place at all. Activities are performed on an ad-hoc basis; 2. Managed, processes are characterized by the project; 3. Defined, processes are defined by the organization; 4. Quantitatively managed, processes are measured and controlled; 5. Optimizing, focus is on continuous process improvement.
6-Feb-12
F. Boterenbrood
Page 30
Research
Improving data quality in higher education
Some maturity models recognize the five-level structure, yet assign different labels. An example are is Master Data Management (Loshin, Master Data Management, 2008) in which the levels are labeled 1 initial, 2 reactive, 3 managed, 4 proactive, 5 strategic performance successively. In Automotive Spice, 6 levels of maturity are recognized, starting at level 0 (0 Incomplete, 1 Performed, 2 Managed, 3 Established, 4 Predictable, 5 Optimizing) (Hoermann, Mueller, Dittmann, & Zimmer, 2008). This seems to compensate for criticism that the step between CMMI level 1 and CMMI level 2 is too big (Kneuper, 2008). The Organizational Project Management Maturity Model (OPM3) however, skips the first level initial altogether and four levels remain (SMCI - Standardize, Measure, Control and continuously Improve) (Project Management Institute, 2008). In this research however, the level structure of the currently as standard accepted CMMI will be adopted. 5.1.3
Process Areas The second important structure is the definition of Process Areas. A process area is a cluster of related practices in an area that, when implemented collectively, satisfy a set of goals considered important for making improvement in that area. Examples of process areas are project planning, organizational training, and causal analysis & resolution (Kneuper, 2008). At maturity level 1, processes are characterized as ad hoc or even chaotic. Therefore, no process areas are assigned to maturity level 1 (Kneuper, 2008). In successive levels, process areas accumulate. In order to reach managed maturity, all process areas of level 2 have to be mastered. And all process areas of both levels 2 and 3 have to be mastered in order to reach defined maturity (Kneuper, 2008). Each process area is defined by Goals. Goals guide the implementation of process areas within the context of each stage. For each goal, practices to reach those goals are associated. In total, CMMI defines up to 48 goals and 512 practices (Kneuper, 2008). In addition, People CMM for instance identifies 22 process areas, each defined by its own set of goals and practices to reach those goals (Curtis, Hefley, & Miller, 2009). This on its own poses a problem. Combining multiple maturity models to identify the relevant maturity structures in the field of organizational structure, process, technology, information and staff, may lead to a list of hundreds of process areas, goals and practices. Such a cluster of elements cannot be analyzed in the time available. An alternative approach is required. Cabellero and Piattini have created a CMMI based data quality maturity model: Caldea (Caballero & Piattini, 2003). This model recognizes five maturity levels (Initial, Definition, Integration, Quantitative Management and Optimizing) and for levels two to five, data quality activities and goals are defined. This model is aimed at constructing and supporting a Data Management Process within an organization. At this point, it would be most helpful to simply adopt the Caldea model, implement the data quality activities and operationalize associated variables. Unfortunately, the Caldea model is described at a high abstraction level, omitting any implementation details, leaving out specifications of maturity structures and dimensions. And, since its conception in 2003, many theories on data quality have been published, incorporating recent developments in IT, not (fully) present at the time Caldea was described. Therefore, Caldea simply is not specific enough to be directly applicable, and is likely to be outdated. However, the guidelines Caldea offers, may well lead the way in constructing a more specific and up-to-date data quality maturity model.
6-Feb-12
F. Boterenbrood
Page 31
Research
5.1.4
Improving data quality in higher education
Identifying relevant process areas How can maturity structures be identified efficiently without overlooking important elements? The approach adopted here is to: 1. Identify data quality improvement measures by literature study and interview; 2. Assign those measures to organizational structure, process, technology, information and staff, thus creating a balanced view; 3. Assign the resulting set of measures to maturity levels by linking each measure with a specific process area and/or practice and, again, balance the result. In the next paragraphs, the results of this approach are presented. Data Quality Improvement measures A wide range of measures is discussed in literature, ranging from proper database design to instating data governance and data quality management. It may easily be overlooked, yet it makes perfectly sense: when the design is flawed, the system build according to this design may hardly be expected to deliver high quality output. To prevent data quality issues to arise in the first place, Batini, Scannapieco and others stress the importance of good database design (standardization / normalization) and data integration (Batini & Scannapieco, 1998) (Fishman, 2009). Design and development call for a separation between development, test and production environments, for one would not want test and development activities to interfere with production processes and data. Such an environment is characterized by the ROTAP17 abbreviation. Another characteristic of building proper information systems is the elimination of manual activities. As pointed out by Thiadens in his interview (Appendix 6.3), manual interaction may account for up to 5 percent of data quality faults (Starreveld, Leeuwen, & Nimwegen, 2004). When improving data quality, reducing manual intervention therefore is paramount. When data quality issues arise, a problem solving approach is required including root cause analysis, data profiling and cleaning, source rating, schema matching and cleaning, business rule matching and new data acquisition (Verreck, Graaf, & Sanden, 2005) (Besouw, 2009) (McGilvray, 2008) (Batini & Scannapieco, 1998).
Root cause analysis is a technique to identify the underlying root cause, the primary source resulting in the problems experienced. Data profiling originated as a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as for exploring relationships that exist between value collections within and across data sets. For each column in a table, a data profiling tool provides a frequency distribution of the different values, offering insight into the type and use of each column. Crosscolumn analysis can expose embedded value dependencies, whereas intertable analysis explores overlapping value sets that may represent foreign key relationships between entities. Source Rating has the goal of rating sources on the basis the quality of data they provide to other sources. Schema matching takes two schemas as input and produces a mapping between semantically correspondent elements of the two schemas. Schema cleaning provides rules for transforming a conceptual schema in order to achieve or optimize a given set of qualities. Business rule matching is the art of comparing data values found with valid values according to business rules. For instance, a person can either be male or female, therefore a database field named „gender‟ containing a value other than male of female is suspect. New data acquisition is an activity in which suspect data is replaced by newly retrieved data. (Loshin, 2008) (Verreck, Graaf, & Sanden, 2005) (Besouw, 2009) (McGilvray, 2008) (Batini & Scannapieco, 1998)
17 Research, Ontwikkel (Design), Test, Acceptation and Production environments
6-Feb-12
F. Boterenbrood
Page 32
Research
Improving data quality in higher education
Solving individual data quality issues is referred to as „small q‟ by (Besouw, 2009). Yet, it may be easily understood that without addressing the causes leading up to data quality issues in the first place, an organization will be problem solving continuously without ever reaching a more lasting solution. What is needed is a holistic approach on data quality, referred to as „large Q‟ by (Besouw, 2009). Yang W. Lee, Leo L. Pipino, James D. Funk and Richard Y. Wang propose data to be considered to be a product: an information product (IP). “An IP is a collection of data elements that meet the specified requirements of a data consumer” (Lee, Pipino, Funk, & Wang, 2006). In the vision of Yang W. Lee et al, treating information as a product requires the manipulation of data to be organized as a production process and puts data quality on the board‟s agenda. To reach this goal, data quality roles and responsibilities are established, data quality management procedures are in place and practical data standards are in use. (Lee, Pipino, Funk, & Wang, 2006). Yang W. Lee et al identify four fundamentals (Lee, Pipino, Funk, & Wang, 2006): 1. 2. 3. 4.
Understand the consumer‟s needs; Manage information as a product of a well defined information product process; Manage the life cycle of the information product; Appoint an information Product Manager.
Instating those fundamentals is also known as Master Data Management (Loshin, 2008) (Besouw, 2009) or Data Governance (Fishman, 2009). Master Data Management (or: Data Governance) includes data quality Service Level Agreement (SLA), life cycle data management and end-to-end process control. Process control implies the presence of controls, elements in the dataflow where the quality of data and process is ensured and monitored. Controls include data and specifications, technology, processes, CRUD18-roles and people & organization (work instructions and employee education) (McGilvray, 2008) (Besouw, 2009). Thiadens identified assigning responsibilities to the right people as a major contributor to data quality: ―Problems in grade assignment may be solved by making the lecturer directly responsible for correct and timely grading. Lecturers are corrected by students when grade assignment is late or questionable. Registration of lecturer availability may be much improved if the lecturer is made responsible, and is given the right tools to manage this information‖ (Interview Thiadens, Appendix 6.3). To consider information to be a product opens the way to apply production quality frameworks to information. One widely accepted framework is Six Sigma, a product quality improvement framework reducing defects by improving the production process. In monitoring the product quality, technology, processes organization and staff are viewed as a whole. In Six Sigma, sigma represents the standard deviation. Six Sigma means six times sigma, indicating 3.4 defects per million opportunities (Boer, Andharia, Harteveld, Ho, Musto, & Prickel, 2006). The main instrument of Six Sigma is the continuous DMAIC quality improvement cycle (Define, Measure, Analyze, Improve, Control). In Six Sigma, Key Goal Indicators (KGI‟s) are defined, and translated in Key Performance Indicators (KPI‟s) for the information manufacturing process. Controls are identified, influencing the KPI‟s. Thus, KGI‟s are measured by KPI‟s and managed by Controls. Finally, the process is executed and, in continuous DMAIC cycles, improved (Boer, Andharia, Harteveld, Ho, Musto, & Prickel, 2006).
18 Create Read Update Delete
6-Feb-12
F. Boterenbrood
Page 33
Research
Improving data quality in higher education
The notion of applying quality cycles to data is recognized by the Massachusetts Institute of Technology (MIT), creating the Total Data Quality Methodology TDQM (Lee, Pipino, Funk, & Wang, 2006) This approach is characterized by five stages: 1. 2. 3. 4. 5.
Identify the problem, Diagnose the problem, Plan the solution, Implement the solution, Reflect and learn.
In addition to TDQM, Larry P. English has introduced the Total Information Quality Methodology, (TIQM), identifying six processes ensuring continuous improvement of information quality (English, 2009): P1: Assess Information Product Specification Quality, P2: Assess Information Quality, P3: Measure Poor Quality Information Costs & Risks, P4: Improve Information Process Quality, P5: Correct Data in Source and Control Redundancy, P6: Establish the Information Quality Environment. In TIQM, Process six (P6) is an overall process, actually being the first process being executed. While in both approaches, we may recognize the recursive quality loop, TIQM more clearly recognizes data (information) to be a product. Even though both TDQM and TIQM recognize the closed quality improvement loop, it is the six sigma approach which offers the most recognized and widely used quality based approach. Therefore, in this research, six sigma practices are positioned at WDQM level five. To be able to fine-tune a process using quality cycles, process control has to be rigorous, leaving little room for workers in the process to deviate from their instructions. This is also known as operational excellence, in which the focus is on creating an as efficient process as feasible (Treacy & Wiersema, 1997). Practices and structure, process, technology, information and staff Now practices improving data quality are found, in this paragraph they will be assigned to structure, process, technology, information and staff. As defined in paragraph 5.5 Concepts Used, Structures, Systems and Staff are part of McKinsey‟s 7-factor model (Pascale, Peters, & Waterman, 2009). Structure deals with the way the organization is constructed (task management, coordination, hierarchy), while Processes, Technology and Information criteria all define the Systems factor. Staff encompasses knowledge management, rewarding, education, morale, motivation and behavior (Pascale, Peters, & Waterman, 2009). Table 3 presents an overview.
6-Feb-12
F. Boterenbrood
Page 34
Research
Improving data quality in higher education
Identification
Structure
Process
Apply proper Project based development, Proper database design, system design Project teams, Project Data integration management Problem Solving
Ad Hoc problem solving
Information as Information Product a Product IP Manager, Demand and supply structure, Data Quality on the business agenda, Data Quality roles and responsibilities are established
Technology
Information
Staff
A ROTAP environment is required
Structured
Data Modelling Knowledge, Domain Knowledge, Project Management competent
root cause analysis, data Data Analysis and Cleaning Unknown profiling and cleaning, tools source rating, schema matching and cleaning, business rule matching and new data acquisition
Analytical competent, Knowledge of technology, business rules and data sources
Information is managed as a Data Quality Analysis and product of a well defined Reporting tools information product process. Supporting Data Life Cycle Management
Structured into an Information Product; Subject to Life Cycle Management, Practical data standards are in use
Commercial skilled (the customer is the consumer ), Understanding the customer needs, Proactive approach to changing data needs
Data Quality Controls are present
Working according to strict instructions
Master Data Management
Deliver quality according to End-to-end process control Defined and role in lifeService Level Agreement cycle (CRUD) documented
Six Sigma
Strict hierarchical
DMAIC, executed according Technology and information 3.4 defects per million to Key Goal Indicators, quality are observed as a opportunities monitored by Key whole Performance Indicators
Staff and information quality are observed as a whole
Table 03: Practices and structure, process, technology, information and staff
Please note that table 3 does not present a maturity model. In a maturity model, levels are organized in a strict hierarchy, in which process areas accumulate over successive levels. To transform the model found into a maturity model, further analysis is required. To do so, a view on maturity levels is created by evaluating multiple level-based maturity models. Table 4 combines process areas (or: best practices, capabilities and activities) of several maturity models into one view: PeopleCMM (Curtis, Hefley, & Miller, 2009), CMMI (Kneuper, 2008), Organizational Project Management Maturity Model OPM3 (Project Management Institute, 2008), Master Data Management Maturity Model (Loshin, 2001) and Caldea (Caballero & Piattini, 2003).
6-Feb-12
F. Boterenbrood
Page 35
Research
Improving data quality in higher education
Practices and maturity levels Level
Focus
People CMM Process Areas
OPM3 Best Practices
MDM Capabilities
1 initial
Processes are ad-hoc
CMMI Process Areas
-
-
Limited enterprise consolidation of representative models, Collections of data dictionaries in various forms, Limited data cleansing
2 Managed
Processess are Requirements characterized Management, Project by the project Planning, Project Monitoring and Control , Supplier Agreement Management, Measurement and Analysis, Process and Product Quality Assurance, Configuration Management
Compensation, Training & Development, Performance Management, Work Environment, Communication & Coordination, Staffing
Standardize Develop Project Charter Process, Standardize develop Project Management Plan process, Standardize project Collect Requirements process, Standardize project Define Scope proces, ….
Application architectures for each business application, Data dictionaries are collected into a single repository, Initial exploration into lowlevel application services, Review of options for information sharing, Introduction of data quality management for parsing, standardization, and consolidation
Data Management Project Management, Data Requirements Management, Data Quality Dimensions and Metrics Management, Data Sources and data Targets Management, Database or data warehouse development or acquisition project management
3 Defined
Processes are Requirements defined by the Development, Technical organization Solution, Product Integration, Verification, Validation, Organizational Process Focus, Organizational Process Definition, Organizational Training, Integrated Project Management, Risk Management, Decision Analysis and Resolution
Participatory Culture, Workgroup Develpment, Competency-Based Practices, , Career Development, Competency Development, Workforce Planning, Competency Analysis
Measure Develop Project Charter Process, Measure develop Project Management Plan process, Measure project Collect Requirements process, Measure project Define Scope proces, ….
Fundamental architecture for shared master data framework, Defined services for integration with master data asset, Data quality tools, Policies and procedures for data quality management, Data quality issues tracking, Data standards processes
Data Quality Team Management, Data quality product verification and validation, Risk and poor data quality impact Management, Data quality standardization Management, Organizational Processes Management
Mentoring, Organizational Capability Management, Quantitative Performance Management, CompetencyBased Assets, Empowered Workgroups, Competency Integration
Control Develop Project Charter Process, Control develop Project Management Plan process, Control project Collect Requirements process, Control project Define Scope proces, ….
SOA for application Data Management Process architecture, Centralized Measurements management of business Management metadata, Enterprise data governance program, Enterprise data standards and metadata management, Proactive monitoring for data quality control feeds into governance program
Improve Develop Project Charter Process, Improve develop Project Management Plan process, Improve project Collect Requirements process, Improve project Define Scope proces, ….
Transaction integration available to internal applications, Published APIs enable straight-through processing, Crossorganization data governance
4. Quantitatively Processes are Organizational Process managed measured and Performance, Quantitative controlled Project Management
5 Optimizing
Continuous Organizational Innovation & Continuous Workforce Process Deployment, Causal Innovation, Organizational Improvement Analysis & Resolution Performance Alignment, Continuous Capability Improvement
Caldea Activities
Causal Analysis for Defect Prevention, Organizational Development and Innovation
Table 04: A combined view on maturity.
In this view, all PeopleCMM, all CMMI-COM and CMMI-DEV process areas and all Caldea activities are shown. With regard to OPM3 and MDM, a subset of best practices and capabilities are included, in order to present a workable overview. Using this view as a guideline, the practices identified in table 3 are assigned to specific maturity levels, resulting in table 5, the Windesheim Data Quality Maturity (WDQM) model. 19 The assignment of practices to WDQM levels is discussed in the next paragraphs.
19 This table is validated in a discussion with M. van Steenbergen, lead architect at Sogeti.(see appendix 6.2)
6-Feb-12
F. Boterenbrood
Page 36
Research
5.1.5
Improving data quality in higher education
Windesheim Data Quality Maturity Model
Level
Focus
Structure
Process
Technology
Information
Staff
1 initial
Processes are ad-hoc
-
-
-
Unspecified
-
2 Managed
Processess are Project based characterized development, Project by the project teams, Ad Hoc problem solving
Data profiling and cleaning, source rating, schema matching and cleaning, business rule matching and new data acquisition.
Data Analysis and Cleaning tools. File Transfer data exchange pattern
Not trusted
Analytical competent, Knowledge of technology, business rules and data sources, Data modeling knowledge
3 Defined
Processes are Programme defined by the management organization
Root cause analysis, Requirements Development, Product Integration, Verification, Validation, Data integration
Technical Solution, A ROTAP environment is available. Data integration through Remote Procedure Invocation
Fit for current use, A canonical data model supports data translations between domains
Domain Knowledge, Programme Management competent, Data responsible
4. Quantitatively Processes are Information Product managed measured and Manager, Data Quality controlled on the business agenda, Data Quality roles and responsibilities are established. Quality is delivered according to Service Level Agreement.
Information is managed as a product of a well defined information product process. Supporting Data Life Cycle Management. End-toend process control.
Data Quality Analysis and Reporting tools, Integration patterns. Message Bus or Message Broker pattern
Structured into an Information Product; Subject to Life Cycle Management, Canonical data model defines data standards as a lingua franca. Data Quality Controls are present
Commercial skilled (the customer is the consumer ), Understanding the customer needs, Proactive approach to changing data needs
5 Optimizing
DMAIC, executed according to Key Goal Indicators, monitored by Key Performance Indicators
Defined and role in life-cycle (CRUD) documented. Technology and information quality are observed as a whole
3.4 defects per million Working according to opportunities. strict instructions, Staff and information quality are observed as a whole.
Continuous Process Improvement
Processes are executed in a strict hierarchy
Table 05: Windesheim Data Quality Maturity model WDQM
Discussion The structure column is characterized by a growth from an ad-hoc approach, via project based development and an integrated programme management approach, to the institution of product quality management and finally total quality management at level five. At this level, the modus operandi for processes execution is operational excellence, requiring employees in the workforce to adhere to strict standards and instructions (Treacy & Wiersema, 1997). The process column replaces the rather limited notion of proper database design by the CMMI level three Process Areas Requirements Development, Technical Solution, Product Integration, Verification and Validation, indicating that the issue at this level is to design, build and implement a well functioning solution. The CMMI process area technical solution is positioned in the Technology column. The activities mentioned under problem solving in table 3 fit maturity level two, which is characterized by ad-hoc problem solving. Root cause analysis however does not fit on this level, since this activity leads to solving data errors at the root of the problem. Root cause analysis is positioned at maturity level three, where it supplements requirements development, enabling integrated, robust solutions. The technology column reveals an evolution in system integration. At level two, system integration still is designed at an ad-hoc, individual manner. At level three, Caldea positions data standardization,
6-Feb-12
F. Boterenbrood
Page 37
Research
Improving data quality in higher education
while MDM mentions having defined services for integration, and according to MDM mastering level four is required for successfully building a SOA application architecture. This is reflected by different system integration styles being utilized (Hope & Woolf, 2008). At level two, the File Transfer pattern is the dominant integration style, offering ease of integration and an excellent universal storage mechanism. At level three, the emergence of a canonical data model opens the way for a more standardized system integration, utilizing the Remote Procedure Invocation integration style (Hope & Woolf, 2008). Along the lines of MDM, at level four a common Messaging style supported by a message broker pattern or message bus pattern (Hope & Woolf, 2008) results in a service oriented application architecture. In the information column, at level one, initial, (management of) the organization is oblivious with regard to data quality. All is assumed to be well, the state data is in however remains unspecified. In the next level, data quality issues have triggered numerous attempts to repair and clean data resulting in a decline of confidence in the reliability of the information. MDM positions a rather isolated view on data quality at level two, whereas at level three an integrated approach is supported by a fundamental architecture for shared data. Again, we may well see the emergence of a canonical data model at level three, enabling data to be transformed at the borders of each domain. At this level, data quality is fit for current use, as indicated by the presence of Caldea‟s risk and poor data quality management process area, whereas at level four data is seen as a product, data quality becomes future proof and at level five data quality reaches six sigma. Staff finally, grows from being analytical competent (a good system programmer) to a commercial skilled worker, being able to assess the data customer‟s needs. This reflects PeopleCMM‟s professional training at level two, competence and career development at level three and the institution of empowered workgroups at level four. PeopleCMM‟s definition of professional training at level two, creates room making the individual entering data responsible for the quality and ultimately the effects of the data entered. This however requires the organization to focus on the process as a whole, which initially is the case at level three. Therefore, an individual may be made responsible for his data entered at level three: data responsible. Level five is characterized by a continuous improvement cycle. Current data quality theories do not include continuous improvement. It seems that data quality theories are focused on improving the data quality to an acceptable level (fit for use). An alternative approach is to be adopted to shape level five. Both TDQM and Six Sigma are aimed at continuous process improvement. When taking a closer look however, TDQM is positioned as a project management approach for solving data quality problems in general (Lee, Pipino, Funk, & Wang, 2006, p. 64) (Kovac, Lee, & Pipino, 1997) ensuring that data errors are improved at the data source, not at the place they create havoc. This implies that a form of continuous improvement cycle has been defined at level two already. However, TDQM does not improve the production process it selves. It springs into action once an obvious data error has been detected, and eliminates the root cause. Six Sigma on the other hand, improves the data production process until data quality has reached an absolute maximum, surpassing the „fit for use‟ boundary. Therefore, to populate level five, Optimizing, the Six Sigma fundamentals fit best. In the remainder of the research, this model will be referred to as the Windesheim Data Quality Maturity model, or WDQM. Data Ownership An issue that remains largely untouched so far is data ownership. From whom is the data, anyway? To be more precise, who owns the data at Windesheim? Take for instance grades assigned to assessments, made by students. Who owns that grade? Is it the student, or the student administration,
6-Feb-12
F. Boterenbrood
Page 38
Research
Improving data quality in higher education
or the IT department perhaps? And what does this all mean for treating information as a product? If information is a product, and it is subject to Service Level Agreements, then who is selling what to whom? In literature, some is said on data ownership. On level 4, in the structure column, it is found that Data Quality roles and responsibilities are established and an Information Product Manager instated (Lee, Pipino, Funk, & Wang, 2006). This issue is more specifically addressed by Danette McGilvray, introducing the Data Steward (McGilvray, 2008) as a replacement for data owner, since in her vision ownership results in a too rigid and inflexible position of stakeholders. Indeed, when interviewed, Thiadens, lector at Fontys university, identified ownership as an obstacle: “The most difficult hurdle to be solved here is to overcome the notion that information is not owned by the decentralized business units‖. (Interview dr. mr. ir. Th. J.G. Thiadens, Appendix 6.3) A data steward on the other hand is a role, acting on behalf and in the best interest of someone else, thus creating room to maneuver and flexibility to implement this role. Gartner seems to agree: ―A data owner owns the data, much like the queen owns the land, while a data steward takes care of the data, much like a farmer takes care of the land‖ (Friedman, 2009). To be able to take care of data, one must have the right tools and responsibilities. A data steward is able to be effective at level 4, quantitatively managed, since at this level information is managed as a product, for the quality of which one can be responsible (Lee, Pipino, Funk, & Wang, 2006). The Information Product Manager therefore is positioned at level 4 and assigned the role of Data Steward. It be noted however, that management involvement remains crucial: ―…the data steward, .. cannot fulfill his role as caretaker for data quality if the means to effectively influence data quality do not come with the job. Since data quality is related to organizational maturity, the means required are managerial rather than technical. To ensure data quality, one may have to be prepared to restructure the organization. Instating data stewardship without the preparedness of taking (perhaps drastic) managerial decisions, restructuring the fabric of an organization, may be in vain. There HAS to be a manager responsible for data quality with the authority to implement change‖. (Interview de Graaf, appendix 6.4) Graphical presentation 5. Optimizing
4. Quantitatively Managed
3. Defined
2. Managed
Constantly improving process and data quality in total quality cycles Treating information as a product, handling data quality problems as product and process faults, controlling the process solving data quality issues through structured system development and rigorous testing
Aware of data quality problems, solving data quality issues on adhoc basis.
Data Quality has not yet been formally identified as the source of problems
1. Initial
Figure 09: graphical representation WDQM
6-Feb-12
F. Boterenbrood
Page 39
Research
5.1.6
Improving data quality in higher education
Alternative views on data quality maturity In the previous paragraphs multiple maturity models were analyzed, resulting in the WDQM. The common denominator between those models is that they are all level based maturity models using process areas (Curtis, Hefley, & Miller, 2009), (Kneuper, 2008), best practices (Project Management Institute, 2008) or capabilities (Loshin, 2001) to achieve goals defining each level of maturity. In literature, other data quality maturity models are described, using a similar level-based description, yet lacking the definition of process areas (or best practices c.q. capabilities) and goals. Therefore, using these models as a source for analysis is difficult, if not impossible. However, now the WDQM has been defined, what can we learn from comparing the resulting data quality maturity model with the other maturity models described in literature? Data Quality Management Maturity Model An alternative view on data quality maturity is developed by Kyung-Seok Ryu, Joo-Seok Park, and Jae-Hong Park (figure 10).
Figure 10: A Data Quality Management Maturity Model (Ryu, Park, & Park, 2006)
In this view on data quality maturity, in each successive maturity level, data management operates on an increased level of abstraction. Where initially data is managed from a rather operational point of view, the physical database scheme, in the second level a data model is present, resulting in a more integrated view on data. Next, this model is standardized using meta data standards, an finally a more holistic view is obtained utilizing a data architecture. This view on maturity takes another approach, solely focusing on the information aspect and utilizing four levels instead of the CMMI five-level approach, whereas in the WDQM at higher maturity levels data transforms into a product, and the focus is on improving the production process. However, similarities may be observed: Data Quality Management Maturity Lev 1 Management of physical data Lev 2 Management of data definitions Lev 3 Management through data standards Lev 4 Holistic data management, architecture
Level 2 Level 3 Level 4 Level 5
Windesheim Data Quality Maturity Focus on repairing physical D.Q. issues Focus on requirements and data design Data is an IP, data standards are in use Continuous improved through DMAIC
As shown, when maturity levels are aligned, level 1 through 3 of the data quality management maturity model bear similarities to the levels 2 through 4 of the Windesheim Data Quality Maturity model. The data quality management maturity model presented by Kyung-Seok Ryu et al may enrich the information column of the WDQM (table 5).
6-Feb-12
F. Boterenbrood
Page 40
Research
Improving data quality in higher education
Gartner Data Quality Maturity Model Another data quality maturity model is defined by Gartner. Gartner recognizes five levels of maturity (Gartner, 2007): “Organizations at Level 1 have the lowest level of data quality maturity, with only a few people aware of the issues and their impact. … Organizations at Level 2 are starting to react to the need for new processes that improve the relevance of information for daily business. ….. Organizations at Level 3 are proactive in their data quality efforts. They have seen the value of information assets as a foundation for improved enterprise performance ….. At Level 4, information is part of the IT portfolio and considered an enterprise wide asset, and the data quality process becomes part of an EIM program. …Companies at Level 5 have fully evolved EIM programs for managing their information assets with the same rigor as other vital resources, such as financial and material assets.” (Gartner, 2007). Even though Gartner does not define process areas and goals for each level, characteristics defining each level are provided in a descriptive text. To analyze this description, table 6 (see next page) is created, containing both the WDQM and the characteristics from Gartner‟s vision on data quality maturity (Gartner, 2007). Again, similarities and differences can be observed. In Gartner‟s view, at maturity level three the organization is already moving beyond project based development, which leads to a bit confusing and less clear cut distinction between maturity levels managed and optimized. Also, the distinction made between an organization responding in a reactive or proactive mode on data quality issues is interesting. Being pro-active and having Enterprise Information Management (EIM) operational at level three already might be a bit steep, considering the fact that at level three, OPM3 positions projects being measurable (not being in control), MDM defines data quality traceable (and positions proactive monitoring at level four), and CMMI focuses on integration (and positions quantitative management at level four) (Curtis, Hefley, & Miller, 2009), (Kneuper, 2008), (Project Management Institute, 2008), (Loshin, 2001).
6-Feb-12
F. Boterenbrood
Page 41
Research
Level
Improving data quality in higher education
Focus
1 initial, WDQM Processes are ad-hoc 1 Aware Gartner Lowest level of data quality maturity
2 Managed WDQM
2 Reactive Gartner
Structure
Process
Technology
Information
Staff
-
-
-
Unspecified
-
Within the entire organization, no person, department or business function claims responsibility for data.
When a problem with data quality is obvious, there is a tendency to ignore it and to hope that it will disappear of its own accord
No formal initiative to cleanse data exists, and information emerging from computers is generally held to be "correct by default."
Business users are largely Only a few people aware unaware of a variety of of the issues and their data quality problems, impact partly because they see no benefit for themselves in keeping data clean.
Processess are Project based development, Data profiling and cleaning, characterized by Project teams, Ad Hoc source rating, schema the project problem solving matching and cleaning, business rule matching and new data acquisition. Reacting to the Although field or service Starting to react to the need for new personnel need access to need for new processes processes accurate operational data that improve the
to perform their roles effectively, businesses take a wait-and-see approach in relation to data quality
relevance of information for daily business.
Data Analysis and Cleaning Not trusted tools. File Transfer data exchange pattern
Analytical competent, Knowledge of technology, business rules and data sources, Data modeling knowledge
Application developers implement simple edits and controls to standardize data formats, check on mandatory entry fields and validate possible attribute values.
Business decisions and system transactions are regularly questioned due to suspicions about data quality.
Employees have a general awareness that information provides a means for enabling greater business-process understanding and improvement.
3 Defined WDQM
Processes are defined by the organization
Programme management
Root cause analysis, Requirements Development, Product Integration, Verification, Validation, Data integration
Technical Solution, A ROTAP environment is available. Data integration through Remote Procedure Invocation
Fit for current use, A canonical data model supports data translations between domains
Domain Knowledge, Programme Management competent, Data responsible
3 Proactive Gartner
Proactive data quality efforts
Organizations have seen the value of information as a foundation for enterprise performance and moved from projectlevel IM to a coordinated EIM strategy.
Major data quality issues are documented, but not completely remediated.
Data quality tools, for tasks such as profiling or cleansing, are used on a project-by-project basis, but housekeeping is typically performed by the IT department or data warehouse teams.
Business analysts feel data quality issues most acutely and data quality is part of the IT charter. Levels of data quality are considered "good enough" for most tactical and strategic decision-making.
Department managers and IT managers communicate data administration and data quality guidelines. The concept of "data ownership." is discussed.
4. Quantitatively Processes are managed measured and WDQM controlled
Information Product Manager, Data Quality on the business agenda, Data Quality roles and responsibilities are established. Quality is delivered according to Service Level Agreement.
Information is managed as a Data Quality Analysis and product of a well defined Reporting tools, Integration information product patterns. Message Bus or process. Supporting Data Message Broker pattern Life Cycle Management. Endto-end process control.
Structured into an Information Product; Subject to Life Cycle Management, Canonical data model defines data standards as a lingua franca. Data Quality Controls are present
Commercial skilled (the customer is the consumer ), Understanding the customer needs, Proactive approach to changing data needs
4. Managed Gartner
Information is an enterprisewide asset
The data quality process is part of an EIM program and is now a prime concern of the IT department and a major business responsibility.
Data quality is measured and monitored at enterprise level regularly. An impact analysis links data quality to business issues and process performance.
Information is part of the IT portfolio and considered an enterprisewide asset.
Multiple data stewardship roles are established within the organization.
5 Optimizing WDQM
Continuous Process Improvement
Processes are executed in a DMAIC, executed according strict hierarchy to Key Goal Indicators, monitored by Key Performance Indicators
Defined and role in life3.4 defects per million cycle (CRUD) documented. opportunities. Technology and information quality are observed as a whole
Working according to strict instructions, Staff and information quality are observed as a whole.
5 Optimized Gartner
Fully evolved EIM programs
Fully evolved EIM programs for managing their information assets with the same rigor as other vital resources, such as financial and material assets.
Data is enriched in real time by third-party providers with additional credit, demographic, sociographic, household, geospatial or market data.
Quality metrics are attached to the compensation plans of data stewards and other employees.
Rigorous processes are in place: ongoing housekeeping exercises, continuous monitoring of quality levels.
Commercial data quality software is implemented. Cleansing is performed either at the data integration layer or directly at the data source.
Unstructured missioncritical information, such as documents and policies, becomes subject to data quality controls.
Table 06: A combined view on the WDQM and the Gartner Data Quality Maturity model
What can be observed is that at level two in Gartner‟s model the emphasis lies on being able to develop the right solution, and at level three the focus shifts towards (pro-)actively monitoring and ensuring data quality (Gartner, 2007). In the WDQM however, at level two the emphasis is on repairing data quality issues in an ad-hoc manner, whilst at level three the focus is shifted toward developing more robust and better aligned solutions. Indeed, the order in which these things take place may be different depending on one‟s viewpoint. One may argue that, in order to experience data quality issues, one must be able to develop applications first. The WDQM is based on sound theories
6-Feb-12
F. Boterenbrood
Page 42
Research
Improving data quality in higher education
on maturity, which state that a subject (data quality) is discarded first, then dealt with on ad-hoc basis (i.e. „repaired‟) and only understood and implemented more robustly at maturity levels three and upwards (see figure 9). Therefore, in this research the WDQM will remain unchanged. 5.1.7
Conclusion In the previous paragraphs a generic data quality maturity model has been found by 1. Identifying data quality improvement practices by literature study and interview; 2. Assigning those practices to organizational structure, process, technology, information and staff, thus creating a balanced view; 3. Assigning the resulting set of practices to maturity levels by linking each measure with a specific process area, creating a maturity matrix. Finally, the resulting Windesheim Data Quality Maturity model WDQM is compared with other data quality maturity models. Differences may be observed, and it is found that, based on one‟s viewpoint, the order of process areas in level two and three may vary, and is therefore open for discussion. Data ownership, being an important issue when discussing data quality, has hardly been mentioned in literature. It is suggested to replace data ownership by data stewardship at level 4. Now a model for Data Quality Maturity has been developed, the data quality threshold is to be established. In the next paragraphs, data quality attributes are defined and the domain business rules are found.
5.2
Data Quality Attributes In this paragraph, the search is on for answers to the following questions: In higher education, what positive and negative correlations between maturity and data quality may be found? What values of data quality attributes correlate with each level of maturity? What do process quality theories describe about positive correlations between quality and maturity? What do process quality theories describe about negative correlations between quality and maturity? Are those observations consistent?
5.2.1
Dimensions of data quality In literature, data quality is defined by dimensions, and those dimensions in turn are measured by data quality attributes (Loshin, 2008) (Batini & Scannapieco, 1998) (McGilvray, 2008). To find the right data quality attributes, the dimensions have to be identified first. This paragraph establishes a view on the dimensions of data quality. What are the dimensions of data quality? When we examine literature on this topic, what we discover is that many dimensions are defined but, unfortunately, naming and definitions vary between sources. Table 7 presents an overview. For each dimension, the table shows the definition, the source that supplied the definition and relationships with other dimensions. This relationship is either specifically supplied by the source (for instance, in the form of a formula) or it is found by comparing definitions (indicating that the dimensions are actually synonyms).
6-Feb-12
F. Boterenbrood
Page 43
Research
Improving data quality in higher education
Table 7 presents an non-normalized view on data quality dimensions found in literature. To create a more usable view, this set of dimensions will be compacted by removing duplicates and synonyms. In some cases, in literature dimensions were mentioned that relates more to quality of software than to quality of data (Ease of use, Maintainability and Presentation Quality). These dimensions are omitted. Dimension
Defintion
Source
Accessibility
Ease of attainability of the data
Accuracy, Database Accuracy, Semantic Accuracy, Syntactic Actuality, Database Completeness
Correctness of data in the database
Lee, Pipino, Funk, & Wang, 2006 Accessibility = 1 - (delivery time - input time) / (outdated time - input time) Zeist, Hendriks, Paulussen, & Trieneken, 1996 Batini & Scannapieco, 1998
Completeness Consistency Consistency Consistency Currency Currency Data Coverage
Decay Duplication Ease of use Format Compliance Integrity, Data Integrity, Referential Maintainability Presentation Quality Reliability Reliability Specifications Timeliness Timeliness
Timeliness Trust Uniqueness Usability Volatility
Closeness of value v to true value v’ Closeness of value v to elements of the corresponding domain D Data in the database is in conformance with reality
Batini & Scannapieco, 1998
Violation of semantic rules over (a set of) data-items The degree in which values and formats of data elements are used in a univocal way A measure of the equivalence of information stored or used in arious data stores, applications and systems Concerns how promptly data are updated
Batini & Scannapieco, 1998 Lee, Pipino, Funk, & Wang, 2006
The degree to which data can be updated, maintained and managed A measure of how information is presented to and collected from those who utilize it Free Of Error The degree in which data represent reality A measure of the existence, completeness, quality and documentation of data standards Timeliness expresses how current data are for the task at hand Timeliness can be measured as the time between when information is expected and when it is readily available for use Or Availability : A measure of the degree to which data are current and available for use A measure of the confidence in the data quality Refers to requirements that entities .. are captured, represented, and referenced uniquely The total fitness of data for use
McGilvray, 2008
Related to
Zeist, Hendriks, Paulussen, & Accuracy, Timeliness Trieneken, 1996 The extent to which data are of sufficient breadth, depth Batini & Scannapieco, 1998 and scope for the task at hand The degree in which elements are not missing from a set Lee, Pipino, Funk, & Wang, 2006
McGilvray, 2008 Batini & Scannapieco, 1998
Currency = delivery time – input time + age Lee, Pipino, Funk, & Wang, 2006 Currency = delivery time – input time + age A measure of the availability and comprehensiveness of McGilvray, 2008 Completeness data compared to the total data universe or population of interest A measure of the rate of negative change to the data McGilvray, 2008 Timeliness A measure of unwanted duplication McGilvray, 2008 Uniqueness A measure of the degree to which data can be accessed McGilvray, 2008 and used The degree in which a modeled object conforms to the Loshin, 2008 set of rules bounding its representation A measure of total data quality McGilvray, 2008 The degree in which related sets of data are consistent Chen, 1976
McGilvray, 2008 Lee, Pipino, Funk, & Wang, 2006 Verreck, Graaf, & Sanden, 2005 McGilvray, 2008 Batini & Scannapieco, 1998
Timeliness = 1 – volatility / currency
Loshin, 2008
McGilvray, 2008
Availability
McGilvray, 2008 Loshin, 2008
Reliability Consistency = f(uniqueness)
Verreck, Graaf, & Sanden, 2005
Usability = Reliability *
Characterizes the frequency with which data vary in time Batini & Scannapieco, 1998
Relevance, U=R2 Decay, Currency
Table 07: An overview of data quality dimensions
One data quality dimension is mentioned, yet rather loosely defined: integrity. Integrity is defined to be an over-all measure of data quality. In this research, data is defined to be integer once it is fit for
6-Feb-12
F. Boterenbrood
Page 44
Research
Improving data quality in higher education
use. (see paragraph 3.1.4 What is data quality). This also means that usability and integrity are synonymous. A dimension that is not explicitly mentioned in literature on data quality is security. Markus Schumacher et al identify four data quality dimensions related to security: Confidentiality, Integrity, Availability and Accountability (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). We may recognize Integrity and Availability to be part of the model already. Integrity is defined to be an over-all measure of data quality, acting as a container for all other dimensions of data quality. In data Quality literature, Availability is commonly known as Timeliness. Confidentiality and Accountability are added to the list of data quality dimensions. Confidentiality is the property that data is disclosed only as intended by the enterprise (Schumacher, FernandezBuglioni, Hybertson, Buschmann, & Sommerland, 2006), while Accountability is the property that actions affecting enterprise assets can be traced to the actor responsible for the action (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). Timeliness is defined by (Batini & Scannapieco, 1998) as 1 – volatility / currency. Volatility being a frequency, and Currency being a timeframe, it is proposed in this research to replace this by a simpler equation: Timeliness = Volatility * Currency. In this case, when Currency < Volatility, Timeliness < 1. When Currency > Volatility, Timeliness > 1. The analysis results in table 8: Dimensions of data quality. Dimension
Defintion
Related to
Accessibility
Ease of attainibility of the data
Accountability
Accountability is the property that actions affecting enterprise assets can be traced to the actor responsible for the action Closeness of value v to true value v’ The degree in which elements are not missing from a set
Accessibility = 1 - (delivery time - input time) / (outdated time - input time) Security
Accuracy Completeness Confidentiality Consistency Currency Integrity, Data Integrity, Referential Reliability Specifications Timeliness Uniqueness Volatility
Confidentiality is the property that data is disclosed only as intended by the enterprise The degree in which values and formats of data elements are in line with semantic rules over this set of data-items Concerns how promptly data are updated The degree in which data is fit for use The degree in which related sets of data are consistent The degree in which data is perceived to represent reality A measure of the existence, completeness, quality and documentation of data standards Or Availability : A measure of the degree to which data are current and available for use Refers to requirements that entities are captured, represented, and referenced uniquely Characterizes the frequency with which data vary in time
Security Consistency = f(Uniqueness) Currency = delivery time – input time + age Consistency
Timeliness = volatility * currency
Table 08: Dimensions of data quality
Batini points out that dimensions could be conflicting: “For instance, a list of courses published on a university web site must be timely though there could be accuracy or consistency errors and some fields specifying courses could be missing‖ (Batini & Scannapieco, 1998). 5.2.2
Data Quality Dimensions Discussed Now the final set of dimensions of data quality is identified, can individual dimensions be assigned to levels of the WDQM? In other words, can it be argued that a certain level of maturity has to be mastered in order to be able to satisfy (a group of) data quality dimensions? In this paragraph, this question is explored by identifying the measures which establishes the data quality dimension, comparing these measures to WDQM process areas, thus binding the dimension to the corresponding WDQM maturity level, and finally defining the corresponding data quality attribute(s).
6-Feb-12
F. Boterenbrood
Page 45
Research
Improving data quality in higher education
Accessibility Accessibility deals with the fact that data needs to be delivered before it becomes insignificant (outdated). This makes for a rather complex, compound dimension. Accessibility is influenced by Volatility (the rate at which data changes), Timeliness (the speed at which data is available for use) and Currency (the speed at which data is updated in the system). Both Timeliness and Currency are positioned at level 4, thus Accessibility can only be guaranteed at level 4, Quantitatively Managed. Accessibility is measured by a ratio, indicating the ease of attainability of the data. Accessibility = 1 (delivery time - input time) / (outdated time - input time) (Lee, Pipino, Funk, & Wang, 2006). Accountability Markus Schumacher et al identifies a series of security patterns especially focused on maintaining Accountability. Security accounting is a service area that performs four functions: Capture, Store, Review and Report data about security events (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). Patterns used to execute this process are security accounting, audit service, audit trail, and intrusion detecting (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). To be able to implement these patterns, a view on data structures and data quality is required, as well as well-defined and independent operating ROTAP environments. It may therefore be argued, that Accountability can be maintained no earlier than at maturity level 3, defined. Attributes involved are actors involved, assets affected, time, date and place of the event, and methods used (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). Accuracy Accuracy is about getting the data right, being as close to reality as possible. Amongst measures ensuring accuracy are data profiling and cleaning. This means that a certain level of Accuracy can be achieved at maturity level 2, managed, be it in a reactive manner and at high costs in terms of time and labor. At this level, flaws in accuracy will continue to return, jeopardizing reliability. At level maturity level 3, Defined, by implementing robust applications, utilizing various types of input checks, an acceptable level of accuracy will be achieved in a more lasting fashion. Accuracy is measured by the Accuracy error, a ratio ranging between 0 and 1, indicating the number of characters, data elements or database tupels being in error as a fraction of the total number of characters, data elements or database tupels (Batini & Scannapieco, 1998). Completeness Completeness is about getting all the data. To get all data elements in time, the business process needs to be well organized and scheduled, with all sub-processes delivering detailed information right on time. Therefore, maturity level 3, defined, is required to effectively organize for completeness. If processes are not controlled efficiently, the process will either continue without the required data or will come to a halt until the required data is delivered, and timeliness will be jeopardized. Completeness cannot be „fixed‟ with data profiling and cleaning techniques, since missing data will only be available once processes responsible for this data have delivered their output. Completeness is measured as a ratio ranging between 0 and 1, indicating the number of data elements missing as a fraction of the total number of data elements (Lee, Pipino, Funk, & Wang, 2006).
6-Feb-12
F. Boterenbrood
Page 46
Research
Improving data quality in higher education
Confidentiality Confidentiality is about securing data from unauthorized access. To this means, a multitude of security patterns exists, each of which may be invoked and combined into security services according to security levels required (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). Security services do well in binding specific security patterns to specific situations, they do not help us positioning Confidentiality in a maturity model. However, Schumacher et al identifies three basic security access types currently in use: the access matrix, the role-based access control model (RBAC) and the multilevel model (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006).The most basic type, the access matrix, provides access to resources by identifying which active entity in a system may access what resources and how (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). Role based access simplifies access right management by grouping active entities into roles, and assigning generic rights or each role (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). This way, once a new participant enters the organization, only the correct roles need to be added to his identification, instead of painstakingly assigning all individual rights. In multilevel security, sensitivity is defined at data level, not at resource level, users receive clearance, and access of users with specific clearance levels to data is based on policies These access types may well help us positioning Confidentiality in the maturity model, since applying these styles requires for different insight in stakeholders and processes. The most simple style, the access matrix, already requires a view on stakeholders involved in the business process and the individual resources required to perform their tasks. This requires processes to be defined by the organization, and therefore confidentiality can only be effective guaranteed at maturity level 3, Defined. On top of this, more advanced modes of access management require processes to be measured and controlled, and therefore both role based access and multilevel security can be deployed effectively once an organization has reached data quality maturity level 4, Quantitatively Managed. Security may be expressed in security services (Schumacher, Fernandez-Buglioni, Hybertson, Buschmann, & Sommerland, 2006). Consistency Consistency is the degree in which values and formats of data elements are in line with semantic rules over the set of data-items. The first observation is that de-duplication of data may improve consistency, since data stored in multiple locations is likely to get corrupted. There is a relation between consistency and uniqueness, in that increasing uniqueness will support consistency. Therefore, consistency will benefit from a holistic view on information processing, in which attention is paid to the dispersion of data within an organization. Such a holistic view is referred to as an Enterprise architecture (Lankhorst, 2005) (Boterenbrood, Hoek, & Kurk, 2005) and this would seemingly fit maturity levels 4 and 5. However, consistency can be achieved using data profiling and cleaning techniques, therefore consistency can be achieved at maturity level 2, Managed, already. Consistency is measured as a ratio ranging between 0 and 1, indicating the number of data elements violating a specific consistency type as a fraction of the total number of data elements (Lee, Pipino, Funk, & Wang, 2006). Currency Currency describes how promptly data are updated and is an function of age (of the data), delivery time and input time: Currency = age + delivery time – input time. Currency is the sum of how old the
6-Feb-12
F. Boterenbrood
Page 47
Research
Improving data quality in higher education
data was when it was received plus a term that measures how long data has been in the information system (Batini & Scannapieco, 1998) (Lee, Pipino, Funk, & Wang, 2006). Currency is targeted by straight-through processing, in which near real time service oriented technologies replace cumbersome batch procedures (Pant & Juric, 2008). This is reflected in Master Data Management, where development of a Service Oriented Architecture is firmly positioned at level 4 (Loshin, 2008). Currency can be measured in days or milliseconds. However, to be able to reach any agreement on currency, business processes need to be measured and controlled. Currency therefore, is indeed a data quality dimension that can only effectively be implemented and discussed at maturity level 4: quantitatively managed. The measure for currency is Time. Data Integrity Data integrity is an indication for the degree in which data is fit for use. The data are integer once they are deemed fit for use. This data quality dimension therefore acts as an container dimension, covering all other aspects of data quality. Referential Integrity Referential integrity refers to the degree in which related sets of data are consistent. It is therefore a special instance of consistency. Referential integrity is introduced by (Chen, 1976), and within one database easily enforced by the implementation of referential constraints. Referential integrity may therefore be well achieved at maturity level 2, Managed, where data rules enforce referential integrity. Referential Integrity is measured as a ratio ranging between 0 and 1, indicating the number of database tupels violating a specific relation type as a fraction of the total number of database tupels (Lee, Pipino, Funk, & Wang, 2006). Reliability Reliability is the measure in which the data is perceived to represent reality. A synonym is trust (McGilvray, 2008). According to the WDQM model, reliability is first achieved at level 3, Defined. Reliability is binary: data is trusted, or it is not. Specifications Specifications is a measure of the existence, completeness, quality and documentation of data standards (McGilvray, 2008). As such, specifications are required for processes like source rating, schema matching and cleaning, business rule matching and new data acquisition. De Graaf mentions Insight as an important dimension of data quality (see appendix 6.4): „Insight in data means that it is clear for an organization what data attributes are required or available, where and why these data attributes are created, what sources were used, where these attributes are used, who guards and tests the attribute, when these attributes are outdated and, once obsolete, how they are dealt with‘ (Interview de Graaf, appendix 6.4). Insight can be seen as a result from valid specifications, and is an important prerequisite for further data quality improvement. Therefore we may expect Specifications to be present at level 2, Managed.
6-Feb-12
F. Boterenbrood
Page 48
Research
Improving data quality in higher education
Specifications is binary: they are either present or absent. Incomplete, faulty or outdated specifications fall in the absent category, since they do not contribute to reliable Insight. Timeliness Timeliness, or Availability is a measure of the degree to which data are current and available for use (Batini & Scannapieco, 1998) (Loshin, 2001) (McGilvray, 2008). Timeliness is measured as a ratio, indicating the availability for use of the data. It is expressed as a function of Volatility and Currency: T = V*C. If currency is larger than the volatility „wavelength‟, timeliness becomes larger than one, meaning it is becoming less fitting. Volatility is a fixed parameter, therefore to increase Timeliness, Currency needs to be reduced. Since Currency is positioned at level 4, Quantitatively managed, an effective implementation of Timeliness requires an organization to have reached level 4 as well. Uniqueness Uniqueness refers to requirements that entities are captured, represented, and referenced uniquely (Loshin, 2008). In the definition given by Loshin (2008), uniqueness is bound to data in a database or file system: “The dimension of uniqueness is characterized by stating that no entity exists more than once within the data set” (Loshin, 2008). This implementation of uniqueness is available at level 2, Managed, already, since data profiling tools and database constrains simply enforce this rule. For uniqueness, no attribute has been published. Therefore, it is proposed to measure uniqueness as a ratio ranging between 0 and 1, indicating the number of data elements being duplicated as a fraction of the total number of data elements in a database or file. Volatility Volatility characterizes the frequency with which data vary in time (Batini & Scannapieco, 1998). A synonym is decay (McGilvray, 2008). Volatility is actually not so much a data quality dimension, it is more a dimension of data itself. Data IS volatile. Therefore, volatility is present at maturity level 1, Initial, be it that it is recognized by just a few specialists within the organization (Interview de Graaf, appendix 6.4). At maturity level 2, Managed, volatility is recognized by business management to be a characteristic of data. At maturity level 3, Defined, systems are build with volatility in mind. The measure for volatility is Frequency. Level 5, Optimizing Surprisingly, in literature on data quality, no specific data quality attributes are defined operationalizing level 5, optimizing. At this level, all data quality process areas have already been mastered20. Therefore, an additional theory is required, extending the reach of data quality into the field of continuous improvement. Six Sigma is such an theory. Six Sigma results in data quality being constantly improved, by implementing the DMAIC cycle, controlled by Key Goal Indicators measured by Key Performance Indicators. The whole data life cycle is observed, leading to 3.4 defects per million opportunities, in accordance with Service Level Agreements (Boer, Andharia, Harteveld,
20 It is to be noted however, that both TDQM (Lee, Pipino, Funk, & Wang, 2006) and TIQM (English, 2009) support the six sigma DMAIC-style quality improvement cycle
6-Feb-12
F. Boterenbrood
Page 49
Research
Improving data quality in higher education
Ho, Musto, & Prickel, 2006). Thus, metrics at this level are process oriented, not strictly data quality oriented. To find metrics for this level, Six Sigma leads the way. According to Six Sigma, it is primarily the spread of errors (unpredictability of a process) that contributes to costs (Boer, Andharia, Harteveld, Ho, Musto, & Prickel, 2006, p. 36). Therefore, a measure of quality on this level is a statistical one: the standard deviation (sigma, σ). For this, a mean and a variance are set as goals. KPI‟s are defined by Controls like External Critical to Quality (Ext CTQ), Internal Critical to Quality (Int CTQ), Unit, Defects and Opportunities, and Population. This level relates to the over-all data quality dimension, Data Integrity. Data Integrity will approach six sigma (6σ) 5.2.3
WDQM Goals The assignment of data quality dimensions and attributes to maturity levels results in the definition of goals for the WDQM process areas. Now, for each level, data quality process areas, goals and metrics are available. However, it proofed to be impossible to base every decision on published and well accepted theories. In many cases, similarities in definitions gave all the information available and sometimes, only rigor of reasoning could shed light on where to position a dimension. Therefore, to increase validity, this model is discussed with an external expert (see appendix 6.4). Table 9 presents the goals for each maturity level in the WDQM model.
6-Feb-12
F. Boterenbrood
Page 50
Research
Improving data quality in higher education
Lev Dimension
Data Quality Practice
Data Quality Dimension Attribute
Data IS volatile, volatility is not yet recognized
Frequency
Accuracy
Data profiling and cleaning
Consistency
Data profiling and cleaning
Integrity, Referential
Establish referential database constraints
Specifications
Specifications Engineering
Uniqueness
Data profiling and establishment of database constrains
Volatility
Volatility is recognized as a characteristic of data
Ratio ranging between 0 and 1, indicating the number of characters, data elements or database tupels being in error as a fraction of the total number of characters, data elements or database tupels Ratio ranging between 0 and 1, indicating the number of data elements violating a specific consistency type as a fraction of the total number of data elements Ratio ranging between 0 and 1, indicating the number of database tupels violating a specific consistency type as a fraction of the total number of database tupels Specifications is binary: they are either present or absent Ratio ranging between 0 and 1, indicating the number of data elements being duplicated as a fraction of the total number of data elements in a database or file Frequency
1, initial Volatility
2, managed
3, defined Accountability
Event history management
Accuracy
Confidentiality
Engineering of robust applications, utilizing various types of input checks Business processes need to be well organized and scheduled, with all sub-processes delivering detailed information right on time Basic Patterns, i.e. Access matrix autorization
Reliability
Level 3, defined, is to be achieved
Reliability is binary: data is trusted, or it is not
Volatility
Build systems with volatility in mind
Frequency
Completeness
Accountability is binary: Updates are accounted for, or they are not See Accuracy, level 2 Ratio ranging between 0 and 1, indicating the number of data elements missing as a fraction of the total number of data elements Security Service Level
4, quantitatively managed Accessibility
Optimize Timeliness (i.e. Currency)
Confidentiality
Advanced patterns, i.e. role based access or multilevel security Create an Enterprise Architecture
Consistency
Currency
Design for straight-through processing, business processes need to be measured and controlled Volatility is a fixed parameter, therefore to increase timeliness, currency needs to be reduced
Timeliness
Accessibility = 1 - (delivery time - input time) / (outdated time - input time) Security Service Level Ratio ranging between 0 and 1, indicating the number of data elements violating a specific relation type as a fraction of the total number of data elements Currency (Time) = delivery time - input time + age Ratio as a function of volatility and currency: Timeliness = V*C
5, optimizing Integrity, Data
Instituting DMAIC, SLA, KGI, KPI, data life cycle management
External Critical to Quality (Ext CTQ), Internal Critical to Quality (Int CTQ), Unit, Defects and Opportunities, and Population. Data Integrity reaches six sigma
Table 09: WDQM Goals expressed in Data Quality Dimensions, Practices and Attributes
6-Feb-12
F. Boterenbrood
Page 51
Research
5.2.4
Improving data quality in higher education
(Time)related dimensions Volatility, Currency, Timeliness and Accessibility describe the interaction of time and data. These dimensions are firmly related, one dimension may actually determine the value of another. Volatility describes at which frequency data changes in the real world, while Currency describes how promptly data are updated in an information system. Currency is age + delivery time – input time, meaning that data, before it is finally delivered, has been lingering both inside and outside the information system for a certain period. Figure 11 shows two different values for Currency: A and B. Timeliness describes the relation between Volatility and Currency. T=V*C, meaning that when Currency is smaller than Volatility (A), Timeliness is smaller than one (T1) and data is changed in the real world before stakeholders have access to this data. When delivered, data is no longer current. Wither this is a problem is determined by Accessibility. Accessibility deals with the fact that data needs to be delivered before it becomes insignificant. It is a ratio: Accessibility = 1 - (delivery time - input time) / (outdated time - input time), in which we may well recognize the relation between Accessibility and Currency. Figure 11 shows Accessibility for Currency value A. Note that Outdated time does not necessarily have a relationship with Volatility nor Currency.
Update
Volatility Rest Frequency
Time
Delivered
Currency Not Delivered A
Accessibility
Time
B
A Age
Outdated time
Time
Delivery time Input time
Figure 11: Related Dimensions
6-Feb-12
F. Boterenbrood
Page 52
Research
5.3
Improving data quality in higher education
Business rules This paragraph focuses on the following questions: In higher education, what positive and negative correlations between maturity and data quality may be found? For this research, what is the relevant set of business rules? How will this set of business rules evolve in time? What data quality attributes are relevant for these business rules? In this paragraph a view on business rules will be established first. Secondly, the business domain this research focuses on will be defined and scoped. Based on design documents, relevant business rules are identified. Finally, the business rules lead to the selection of relevant data quality dimensions, the variables of which are populated using a workshop.
5.3.1
Business rules, a definition In order to find and populate the right data quality attributes, the business rules that need to be met have to be defined first. In literature, the view on what business rules are, slightly differs. Business rules are ― a written definition of a business‘s policies and practices‖ (Agrawal, Calo, Lee, Lobo, & Verma, 2008) or ―… requirements of the business that must be adhered to in order for the business to function properly‖ (Johnson & Jones, 2008). They encompass “…the controls, processes mechanisms, and standard operating procedures (SOPs) that need to be followed” (Conway & Conway, 2008). In the view of D. Agrawal et al, business rules are high level descriptions, guiding the behavior of an organization. Described at this level, the business rules might proof not to be specific enough in order to obtain corresponding data quality attributes. The more specific notion, that of business rules being requirements of the business (Johnson & Jones, 2008), encompassing controls, processes, mechanisms and operating procedures (Conway & Conway, 2008) seems to be more fitting. At this level, they are referred to as production rules by D. Agrawal et al. In this research, the operational notion of business rules, as defined by (Johnson & Jones, 2008) (Conway & Conway, 2008) will be used. To be meaningful, the notation of a business rule is to adhere to certain semantics: “A business rule is a compact, atomic, well-formed, declarative statement about an aspect of a business that can be expressed in terms that can be directly related to the business and its collaborator, using simple, unambiguous language that is accessible to all interested parties: business owner, business analyst, technical architect, customer, and so on. This simple language may include domain-specific jargon‖ (Graham, 2007). The interesting aspects of this definition of a business rule are that it is atomic (self-contained), wellformed (written according to specific rules) and declarative (written in a statement style vocabulary). A well-formed business rule is written in a when – then type of construct (Davis, 2009). Business rules are about making decisions, and for good decisions, valid information is required, also referred to as facts (Davis, 2009).
6-Feb-12
F. Boterenbrood
Page 53
Research
5.3.2
Improving data quality in higher education
Study management A brief history Study management is without a doubt the most important business domain within an educational institution. In a response to the emergence of the European Higher Education Area (EHEA) (Vught & Huisman, 2009), Windesheim developed a new view on study management (Broers, 2007). This new view, identified as student centered education, would offer the student more freedom in selecting education of his own choice (Broers, 2007). This, together with the adoption of the European Credit Transfer System, spawned a redesign of the curricula of the various Schools at Windesheim (Broers, 2007). At the basis of this redesign of curricula, a new didactical process was designed and standards were put in place, guiding the change process. These standards were described and accepted in the Windesheim Onderwijs Standaards21 WOS (Iersel, Loo, Serail, & Smulders, 2009). In 2006, a domain architecture was designed, guiding the development and implementation of new information technology (Jansen, 2006). The domain architecture incorporated the field of education, i.e. management of the education catalogue, the study process itself (minor selection, study process and assessments), and management of grades (manage study progress), as shown in figure 12 (Jansen, 2006). Control Process Planning & Control Cycle
Didactical Process
Request Information
Apply
Create Management Information
Organize Education
Oriëntate Discuss Select Progress student Apply Assess Study Contract Schedule
Graduate
Engage Alumni
opvragen
Supporting processes Beheer Bibliotheek
Beheer bekostiging
Develop Education
Ondersteunen onderwijskundigen Beheer Student gegevens
Digididact: Beheer ELO
Psychologische ondersteuning
Manage Education Catalogue Internationalisering
Decanaat ondersteuning
Schedule Process
Manage Assessments
Werkend leren
Begeleiding uitwisseling buitenland
Manage Study Progress
Leerproces begeleiding
(Studie-) loopbaan begeleiding
Ondersteunen Schools en bedrijfsburo
Figure 12: Domain architecture student centered education Windesheim (Jansen, 2006)
In 2007, the COTS22 application Educator was selected to support study management, and currently implementation of this system is an ongoing process. Looking ahead In study management, potentially interesting experiences are becoming available. After investigating the business rules, two case studies will be performed, collecting experiences from the Windesheim School of Build, Environment and Transport and the Windesheim School of Business and Economics respectively. Based on the importance of this domain to Windesheim, and the availability of
21 Windesheim Educational Standards 22 Commercial Of The Shelve
6-Feb-12
F. Boterenbrood
Page 54
Research
Improving data quality in higher education
potentially interesting experiences in this field, this research will focus on the domain of study definition, education, assessment and grading, supported by the information system Educator. 5.3.3
Business rule mining For study management, the WOS (Iersel, Loo, Serail, & Smulders, 2009) identifies a set of business rules in the form of high level descriptions, guiding the behavior of an organization (Agrawal, Calo, Lee, Lobo, & Verma, 2008). These rules are presented in appendix 6.12. However, the abstraction level of most of these rules is too high. In order to be able to define a data quality threshold, a translation to more specific requirements (Johnson & Jones, 2008) is needed. This translation is offered by the domain architecture (Jansen, 2006) European rules on Higher education (European Commission, 2005) and Educator operating instruction notes. Information on scheduling is provided by (Riet, 2009). And finally, to be useful for further analysis, business rule notation need to adhere to the definitions of (Graham, 2007) and (Davis, 2009). In order to identify the business rules more clearly, the business rules are arranged according to the business processes identified in figure 12, Domain architecture student centered education Windesheim (Jansen, 2006). These, more detailed business rules are documented in appendix 6.13. Now in the domain of study management the relevant business rules have been identified, current and required data quality maturity levels of the Educator domain at Windesheim may be defined.
5.4
Current data quality maturity level study management domain In this paragraph, the following research question is answered: What are the current organizational maturity and current values of data quality attributes? Current data quality maturity at Windesheim can be established by finding data quality practices currently invoked and trying to establish a view on the current values of the maturity dimensions. However, it should be noted that it is more easy to ascertain wither a practice is in place, which is essentially something that is been done or not, than it is to try to figure out what the value of a data quality dimension is, which in most cases requires analysis tools to establish a measurement. Therefore, discussing data quality dimensions was mainly used as a check on completeness, improving research quality, making sure no issue has been overlooked. Nevertheless, the values of data quality attributes, populating the data quality dimensions, are discussed at the end of this paragraph. In interviews with stakeholders current data quality practices and dimension values have been discussed. Stakeholders include representatives of operations, functional support and process design, as well as teaching and management staff from within schools. In total, five members of staff have been interviewed. Each interview collected information on experiences and solutions first, and discussed current values of maturity dimensions later. These interviews are documented in the appendices 6.5 through 6.10.
6-Feb-12
F. Boterenbrood
Page 55
Research
5.4.1
Improving data quality in higher education
Interview results It is found that stakeholders not always agree on topics presented. Some think rather positive of accessibility, while others point out that some functions of Educator are seemingly unnecessary complex and over-engineered, hampering accessibility. Also, accountability and confidentiality are regarded fitting by some, while others reveal breaches in security functions, compromising confidentiality. Interesting is the observation on the role-based access mechanism of Educator, which is deemed far too complex by one and perfectly flexible by the other. Issues interviewees agreed upon are that Educator has far too few reporting options, enabling data to be monitored, and that people entering data into Educator, creating havoc down the line, should be confronted with and made to solve these problems themselves. However, some also noted that process execution and process control require separate functions. An example is the support offices checking milestones, study plans and course definitions. It is found too, that the School of Business and Economics has experienced most problems, being amongst the first using Educator, while the School of Build, Environment and Transport, entering the Educator arena later, has learned from these experiences and strengthened their processes first, before deciding to implement Educator. Surprisingly, even though some improvements on data validity input checks were mentioned, no interviewee believed that input checks could prevent data errors all together. This is supported by the wide spread desire to check data using reports (reactive data quality management), rather than rely on data being checked before it is stored (proactive data quality management).The School of BET has put procedures in place guarding data quality prior to entering data into Educator. Yet, data may get corrupted unnoticed as a result of software bugs or human error. In these cases, issues are corrected once students complain. It is commonly believed that the definition of courses is complex. The current product structure (OE and VOE combined with Semester plan and Semester variant plan) is mentioned not to be used as originally intended (educational process and course definitions are not in conformity). In at least one case, verification and validation was implemented using manual processes outside Educator. Timeliness and Completeness are noticed to be conflicting dimensions. In at least one case it was mentioned that, in order to satisfy timeliness, completeness of data could be sacrificed. Important milestones driving Timeliness are: 1. 2. 3. 4. 5.
5.4.2
Validation and finalizing of course descriptions, Validating and finalizing student activity plans, Grading, Valuating the outcome of the propaedeutics phase, And, in the (near) future, printing diplomas and certificates.
Current Maturity When we observe the data quality process areas of table 5, Windesheim Data Quality Maturity model, we may observe that for 1. Structure, 2. Process, 3. Technology, 4. Information and 5. Staff some process areas are available: 1.
6-Feb-12
Level 2 process areas Project based development, Project teams and Ad Hoc problem solving are present;
F. Boterenbrood
Page 56
Research
Improving data quality in higher education
2. 3. 4. 5.
Level 2 process areas Data profiling and cleaning, Source rating, Schema matching and cleaning, Business rule matching and New data acquisition are NOT present; Level 2 process areas Data Analysis and Cleaning tools are NOT present. The File Transfer data exchange pattern however IS present. Awareness of the relevance of data quality is present and information is not trusted indeed. Level 2 process areas Analytical competent, Knowledge of technology, business rules and data sources, Data modeling knowledge are present.
Additional, we may recognize data quality process areas from higher levels being discussed as well: 6. Level 3 process area Technical Solution is being discussed. 7. Level 3 process area Data Responsible is being discussed and implemented. Given the fact that the collection of level two process areas is only partially met, we may conclude that currently, in the Educator domain, the data quality maturity of Windesheim still remains on level one (Initial). 5.4.3
Current data quality dimension’s attribute values In this paragraph, Table 9: WDQM Goals expressed in Data Quality Dimensions, Practices and Attributes is used as a basis for triangulating the current data quality maturity in the study management domain. This evaluation offers another view on current data quality maturity, validating the observations in the previous paragraph. It should be noted that, since none of the process areas Data profiling and cleaning, Source rating, Schema matching & cleaning and Business rule matching are available, exact values could not be assigned to data quality dimension‟s attributes. However, in some cases the interviews indicated dimensions to be „in control‟ while other dimensions needed more attention. WDQM Level five dimensions Data Integrity not meeting six sigma by far, does not come as a surprise. Therefore, level five of the WDQM has not been met indeed. WDQM Level four dimensions One dimension interviewees agreed upon was Volatility. It was commonly believed that volatility of data in the Educator domain was low. Changes in data occurred once every few weeks, months and even years. And even then, in Educator data does not actually change, in most cases new data is added, extending the information already available. It was found that study information is altered annually, or every half year in some cases. Grades are created quarterly, amounting to about 230.000 grades being registered at Windesheim each study period. Study plans are extended every six months. That said, for the school of Build, Environment and Transport, current volatility of course definitions was still deemed to be too high; it seems course information is adapted every few months, while this type of data should be stable for at least three years. A low-frequency Volatility should be good news for Currency. Currency however was reported to have been troublesome, caused by instability of Educator, and the manual part of the process consuming too much time. This last issue was solved by making the stakeholder entering the data responsible for dealing with the consequences of long waiting times. As a result, Currency had been improved.
6-Feb-12
F. Boterenbrood
Page 57
Research
Improving data quality in higher education
Currency, Volatility and Timeliness are all related dimensions, therefore with Volatility being comfortably low and Currency improving, Timeliness of data may be expected to be in control. However, at this moment Timeliness is still mentioned to be problematic. When new education is to be developed, development has to start well in advance of the targeted study period in order to deliver study information in time. For many, this aspect of the educational process planning is perceived as being complex, and activities are commonly initiated too late. Accessibility is little understood during the interviews, yet when timeliness and currency are not in control, Accessibility is not in control either. An exception to this rule may be found at the school of Build, Environment and Transport. Here, the business process served by Educator is strictly managed manually, having data being entered into the information system only after elaborate checks. The study process itself is highly standardized, resulting in more clarity for stakeholders involved. As a result, the school of Build, Environment and Transport reports Accessibility to be in control. The last dimension populating level four are Consistency and Confidentiality. As a result of the system design, offering a comprehensive role based access mechanism, Confidentiality was perceived to be fitting. One interviewee noted that Educator offered some back-door entries, suggesting possible breaches in Confidentiality. Therefore, even while Educator offers level four compliant authorization mechanisms, Confidentiality is in doubt. On Consistency, it seems the situation has grown from Bad to Better. Educator is said to generate data codes automatically, replacing more and more manual data code definitions, thus improving Consistency. At the school of Build, Consistency was improved by rigid process design. Even though, it has been mentioned that course definitions are not consistently described throughout the system, therefore Consistently too is not being met at WDQM level four. Currently, based on data quality dimension values, data quality has not reached WDQM level four. WDQM level three dimensions Accountability is believed to be adequate, be it that in one situation it is found that an audit trail may be omitted. Since this situation is to be considered a manual correction of erroneous datasets, and this situation is recognized to be in decline, Accountability may be regarded fit for current business rules. At level three, Accuracy is guarded using application input checks. Currently, this is not the case in most instances. Again, the school of Build, Environment and Transport is less pessimistic, using strict data input procedures. Yet, even here it is recognized that there still is room for improvement. Completeness is regarded to be in control by many. However, the current dead-lines in Educator‟s process implementation (Timeliness) is said to have a negative impact on Completeness. Confidentiality at this level is implemented by access matrices. Even though Educator offers rolebased access, reported back-door threads may render confidentiality inadequate. Reliability is reported as being absent. Many inexplicable data quality issues were mentioned, reducing reliability. It was mentioned that using Educator only once in a while, and inadequate training and documentation may well be at the source of doubts. Often teachers make mistakes, blaming the system. The absence of basic reporting facilities was mentioned as another cause of lack of reliability. The school of Build reports to rely on their process design. Volatility. It is not feasible to assess wither Educator was build with volatility in mind. With the exception of Completeness, WDQM level three dimensions have not been met.
6-Feb-12
F. Boterenbrood
Page 58
Research
Improving data quality in higher education
WDQM level two dimensions At level two, Accuracy, Consistency, Referential Integrity and Uniqueness are instated using data profiling and cleaning tools and database referential integrity constraints. The absence of these process areas does not spell any good for these dimensions. Accuracy is reported to have been a problem in the past indeed, however by making the stakeholder entering the data responsible for any problems caused further down the process, Accuracy is said to have improved greatly. And Uniqueness too has been reported to be in control. Consistency has been reported to be greatly improved by replacing manual activities by automated procedures. Therefore, new data being entered into Educator may well be more accurate, consistent and unique. However, historical errors are said to still create havoc in data exchange processes. And Referential Integrity is found to be a problem, partly because the Course Catalogue structure is perceived to be complex. Therefore, until current faults in the database have been corrected, these dimensions are still not met. On a score of 1 to 10, where 1 equals non-existent and 10 equals excellent, Specifications scores a 1.5, or 2 at most. It is safe to say that this dimension is not met. Whither volatility is recognized as a characteristic of data, is unknown. The WDQM level two dimensions have not been met completely, and therefore, this level had not been reached. 5.4.4
Conclusion Since no data quality maturity level was found having all related dimensions properly instated, evaluation data quality dimension attribute values verify that the current data quality values in the study management domain are at WDQM level one (Initial). We may recognize some improvement at the school of Build, Environment and Transport, due to a rather strict definition of the Educator business process. It must be noted that improvements here came at the cost of creating an entire new, manually managed, information system and management process shielding Educator from calamity. Table 10 offers an overview. Past Level
Current Level
> Six Sigma Problematic In doubt Bad Low Problematic Adequate Problematic In Control In doubt Absent Problematic Bad Problematic Absent Unknon Low
> Six Sigma Improved In doubt Better Improved Improved Adequate Improved In Control In doubt Absent Improved Better Problematic Absent Adequate Low
Data Quality Dimension Data Integrity Accessibility Confidentiality Consistency Currency Timeliness Accountability Accuracy Completeness Confidentiality Reliability Accuracy Consistency Referential Integrity Specifications Uniqueness Volatility
Level met? Level Passed 5 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2
No No No Yes No No Yes No Yes No No Yes Yes No No Yes
Table 10: Current data quality dimension values
6-Feb-12
F. Boterenbrood
Page 59
Research
5.5
Improving data quality in higher education
Required data quality maturity level study management domain In this paragraph, the following main research question and sub questions is answered: What values of data quality attributes will define the required data quality threshold and therefore the required maturity structures at Windesheim? a. To support the business rules identified earlier, what values should data quality attributes have? b. What level of maturity is required to enable those data quality attribute values? c. What organizational structure, process, technology, information and staff criteria define the maturity found? The required data quality maturity level will be identified by analyzing the outcome of the data quality workshop (see appendix 6.11) and confronting this outcome with the initial research problem (see par 2.4): At Windesheim, what defines the border between the control and integration stage? What are positive and negative correlations between structures defining organizational maturity and attributes defining data quality, enabling Windesheim to become a near zero-latency organization?
5.5.1
Workshop results To assess the data quality required, a workshop was organized, enabling stakeholders from various departments to translate their knowledge on the Educator domain and business rules into requirements23. In this workshop, specialists were requested to assign data quality dimensions to one of the four phases of the study management process, based on the requirements posed by the business rules involved. To create a functional selection process, the data quality dimensions were valued according to their position in the WDQM (see table 9), and the workshop participants were supplied with a limited amount of „credits‟. The underpinning WDQM model however, was not revealed. For most dimensions, participants had the opportunity to choose between a „must have‟ implementation, paying the full price tag for this dimension, or they had the opportunity to choose for a „should have‟ implementation, paying less but gaining a less satisfying situation. The results of this workshop are summarized in table 11. Based on table 10, he last column reveals wither a dimension has already been met at the maturity level specified.
Manage catalogue Create studyplan Study Data Quality Dimension Level Required Level Required Level Required Accountability Accuracy Completeness Confidentiality Consistency Currency Referential Integrity Reliability Specifications Timeliness
3 3
Should have Must have
4 4
Should have Should have
3 2
Should have
3 3 3
Should have Should Have
2 3
Should have Should have
3 3 3
Must have Must have Must have
4 2
Must have Should have
4
Must have
3 4
Manage progress Overall Level Required Level Required
3 Must have
4
Should have
3 3 3 3 4 4 2 3 2 4
Yes No Yes No Yes No No No Should have No Must have No Must have Must have Must have Should have Must have Should have
Table 11: data quality dimension assessment workshop results
23 See appendix 6.11
6-Feb-12
F. Boterenbrood
Dim. met?
Page 60
Research
Improving data quality in higher education
Table 11 reveals that the study management process is divided into four sub processes: 1. Manage catalogue, resulting into courses being published; 2. Create study plan, resulting in an updated personal activity plan; 3. Study, resulting in grades being assigned; 4. Manage progress, resulting in students receiving certificates, or study rejection letters. The final column represents the overall score. If, in any sub process, a dimension is labeled Mandatory, this dimension becomes mandatory for the whole domain. The reason for this is that the process is one seamless cycle and in each step in this cycle all organizational units play an equal role. It is simply not possible to have one step assigned to single unit that could be more mature than others. For some dimensions, participants could choose between implementations on different levels in the WDQM. This is the case for Accuracy, which can be implemented both at level 2 and level 3. In that case, the highest level required prevails. 5.5.2
Discussion It is interesting to see, that even though these are „expensive‟ dimensions, the workshop results in a massive interest in WDQM level 3 data quality dimensions. All level 3 dimensions (Accountability, Accuracy, Completeness, Confidentiality and Reliability) are labeled to be „must have‟ requirements. In the current situation, timing poses many problems. It is therefore no surprise that Currency and Timeliness are mentioned as „Must have‟ dimensions. The high demand for data being timely and current implies that data should be delivered before it gets updated (Timeliness < 1, see paragraph 5.2.4). Currency describes how promptly data are updated and is an function of age (of the data), delivery time and input time: Currency = age + delivery time – input time. Timeliness is measured as a ratio, indicating the availability for use of the data. It is expressed as a function of Volatility and Currency: T = V*C. Since Volatility is constant, Timeliness is improved by reducing Currency. And Currency is reduced by minimizing the age of data, or the gap between input time and delivery time. Therefore, the result of the workshop can be interpreted as a demand to reduce waiting times (age and (delivery time – input time)). In the interviews, multiple references are made about data being entered into the system well beyond all deadlines. This is not so much a technical issue, The interviews reveal that this issue is related to the age of data before it is entered into the system. Therefore, actions here should aim on improving waiting times in the manual part of the study management process. Having Consistency defined at level four as a „Should have‟ is a bit surprising. It seems that the Dutch awareness of costs has played a role here, buying a high-level dimension at a fraction of the price. Table 10, paragraph 5.4.4, identified Consistency to be available already. However, the workshop does leave no room for misinterpretation. If Educator is to succeed in fully supporting the study management process, the organization needs to reach WDQM level three (defined), and for some time related aspects, WDQM level four (quantitatively managed).
5.5.3
Initial Research Problem This research was started as a result of Windesheim experiencing surprising problems while, in its quest to become a near zero latency organization, implementing near real time integration solutions. Is an organization operating at level three of the WDQM sufficiently tailored to address this initial problem? Or is the initial research problem solved with a less far reaching solution, does a simpler solution fit? Or is a WDQM level three implementation still not mature enough, does real time integration call for a even more robust solution? In paragraph 2.2.1, data quality errors were identified:
6-Feb-12
F. Boterenbrood
Page 61
Research
Improving data quality in higher education
Enrolment of students results in duplicate accounts; Painful mistakes like sending notifications to deceased students; Due to database corruption, management reports are rendered useless; Sometimes fields contain text-strings stating that „Debbie has to solve this problem‟; Names of students are completely missing, student addresses are incorrect, information is entered in invalid fields; Location (room) numbers are missing or contain special, unexpected codes; Data is outdated or is valid in / refers to different time periods between information systems; It was found that at least in one instance, lack of data quality caused a class to be scheduled in a stair case. These errors are mainly faults in Accuracy and Completeness. To solve these issues, Accuracy and Completeness have to be addressed. Completeness is addressed at data quality maturity level three (defined). Accuracy is available at level two (managed) already, be it at a rather reactive manner, repairing errors once they appear in the database. This is too late, since by then these errors have propagated through the automated interfaces, causing havoc in other applications. This means that a data quality maturity level three (defined) implementation of Accuracy is required. As table 10, paragraph 5.4.4. identifies, this is currently not the case. The Master Data Management model too positions the definition of services for data integration at level three, yet requires organizations to reach level four for implementing a Service Oriented Architecture (Loshin, 2008). Addressing the initial research problem calls for Windesheim to organize at WDQM level three indeed, while further growth to level four is required if a fully fledged Service Oriented Architecture is to be developed. 5.5.4
A data quality maturity level three (Defined) organization An organization acting at data quality maturity level three (Defined): Has a business-wide process view instead of a localized departmental view; Conducts an effective programme management; Develops systems based on formal Requirements; Has an integrated view on its products and the (quality of its) product development cycle; Has an integrated view on its corporate data; Has learned to identify and address root causes when problems emerge; Has data quality pro-actively guarded by technical barriers (input checks); Implements new functionality after rigorous testing and accepting in separated environments; Connects systems using available application interfaces; Is provided with data which is fit for use; Is monitoring data quality using a canonical data model; Is serviced by staff having deep domain knowledge and being responsible for data quality.
5.5.5
Level 4 (quantitatively managed) requirements To satisfy both the research goal and the study management process, Windesheim does not have to fully implement data quality maturity level four (quantitatively managed). However, Currency and Timeliness were mentioned as required data quality dimensions. In paragraph 5.5.2, age is discussed influencing Timeliness and Currency. Addressing age causes an organization to enter the realm of
6-Feb-12
F. Boterenbrood
Page 62
Research
Improving data quality in higher education
data quality maturity level four(quantitatively managed). There may be more compelling reasons to implement data quality at this level. When interviewed, de Graaf (see appendix 6.4) made a strong case for organizations to try and reach data quality maturity level four: ―Especially beyond level three, data quality becomes a matter of special interest to organizations, opening up a whole new realm of possibilities. What we can see beyond level three in practice today are cloud computing for data quality initiatives, new business generated and successful one-on-one business models based on reliable data‖ (Interview de Graaf, appendix 6.4) Data simply becomes more valuable once an organization manages to reach data quality maturity level four (quantitatively managed).
5.6
Growing from current to required maturity Now that data quality and organizational maturity and the relation between the two are understood, an instrument based on this relation has been developed, current and required maturity levels are identified, in this paragraph the final research question can be addressed: What is the gap between current maturity structures & data quality threshold and required maturity structures & data quality threshold in the light of enabling Windesheim to become a near zero latency organization? a. What is the gap between the current and required organizational structure, process, technology, information and staff criteria? b. What conclusions and recommendations may be derived from this gap?
5.6.1
Gap analysis Paragraph 5.4 Current data quality maturity level study management domain has determined the current data quality maturity level to be one (Initial), while paragraph 5.5 Required data quality maturity level study management domain identified the required data quality maturity level to be three (Managed). This level of data quality maturity is required both to operate the Educator business domain and to be able to deploy near real-time system integrating technologies successfully. In the next paragraphs, the process areas in the field of structure, process, technology, information and staff identified to be missing are discussed. This discussion is based on table 5, paragraph 5.1.4. Structure At level two, it is found that an organization has implemented a structured project based development approach. In this research, Windesheim‟s project management capabilities have not been evaluated. It is unknown therefore to what extend Windesheim has mastered project based development. Assessing an organization‟s project management capabilities properly requires for a separate research, for which in this project resources were unavailable. At data quality maturity level three, Projects are to be managed in relation to each other, as a programme, defined by a portfolio of projects serving a common goal. In this research, Windesheim‟s programme management capabilities have not been evaluated. It is unknown therefore to what extend Windesheim has mastered programme based change. Again, assessing an organization‟s programme management capabilities properly requires for a separate research, for which resources were unavailable in this project.
6-Feb-12
F. Boterenbrood
Page 63
Research
Improving data quality in higher education
Process At data quality maturity level two (managed) data quality is repaired reactively by implementing data profiling and cleaning activities, source rating, schema matching and cleaning, business rule matching and new data acquisition, resulting in improved accuracy, consistency, referential integrity and up-to-date specifications. Referential Integrity and Specifications in particular are dimensions mentioned to be problematic and absent respectively, therefore these Process Areas require attention. Requirements development, Product Integration, Verification and Validation are all level three related process areas aimed at constructing a product from different components, and assuring that the product complies to requirements (Ahern, Clouse, & Turner, 2008). In at least one case, verification and validation was implemented using manual processes outside Educator. Currently, it is found that the Educator process is perceived to be complex. Some structures are not used as originally intended (the current educational process and original requirements are not in conformity). This may point towards a change in business rules. The required data quality threshold is related to business rule demands, therefore, when business rules change, data quality dimensions may well change with it. Even though workshop attendees formally agreed upon the business rules presented, during interviews the digital course catalogue was identified as an area where, in practice, these business rules may well have been redesigned. Timeliness and Completeness are found to be conflicting. Timeliness is expressed to be required, however Educator requires the description for a course to be complete before it is accepted. Course information is not always complete that early in the process. In many cases, information on assessments is finalized later, seemingly conflicting with timeliness. However, scheduling and selection of courses might not have a strong relation with the final modes of assessing student‟s capabilities, leaving room perhaps for entering assessment information later. Technology To monitor data quality, data quality analysis tools are required. These tools are not available. During interviews, the absence of insight in data quality is mentioned as one of the main obstacles towards improvement. Wither current ROTAP environments are sufficient has not been a subject during this research. Assessing an organization‟s research, develop, test, acceptance and production environments and strategy properly requires for a separate research, for which resources were unavailable in this project. Information Currently, Information is known to be not trusted. This is likely not to change until WDQM level three (defined) is reached. The main Key Performance Indicator of a data quality improvement programme may be that in the end, Educator data has become fit for use and therefore trusted. Staff At data quality maturity level two, staff is analytical competent, has knowledge of technology, business rules and data sources, and has data modeling knowledge. There have been no indications that these competences were missing, it is assumed therefore that currently, at Windesheim these competences present. At level three, staff is responsible for data quality, extending the view from a single process step to the end-to-end business process. During interviews, it was mentioned that the process in the Educator domain was perceived as complex and difficult to oversee. Activities, like entering course definitions, have to be planned well ahead of execution, enabling both scheduling and students to choose their minors. Teachers were unaware of deadlines or unable to finalize education this early in the process.
6-Feb-12
F. Boterenbrood
Page 64
Research
Improving data quality in higher education
This discussion may signal a problem. When crossing the barrier between having a local view on matters and having a more holistic business process wide view, the technological discontinuity presents itself (Paragraph 2.3.4) (Zee, 2001). This discontinuity is experienced as a setback. New structures replace trusted old ones and are for the time being (perceived as) not as good as the ones being replaced. A discussion on loosing the perceived freedom of changing educational definitions up to the last moment may well be one of these new versus old structure discussions. The fact that these new structures are required for coping with future challenges may not be recognized by all. The effects of the technological discontinuity may strictly spoken not be part of the WDQM model, yet when not taken into account, they may well prevent an organization from reaching data quality maturity level three (managed). 5.6.2
Migration In the field of organizational maturity, as it is in real live, no organization can skip levels. This means that in the Educator domain, Windesheim has to master level two and three data quality maturity process areas successively, as defined by table 5, paragraph 5.1.4. The recommendations presented here aim to enable Windesheim to bridge the gap between current and required situation, building on best practices identified, strengthening and accelerating the change process. The transition is defined as a two-staged process. First, data quality maturity level two (Managed) is to be implemented. Once this level is established, the second step can be initiated, moving Windesheim from data quality maturity level two (Managed) to data quality maturity level three (Defined). Data Quality Level two (Managed) Level two is characterized by project based structures, re-active data cleansing processes and technologies, and staff having local knowledge of business rules, data sources and data modeling. Structure and Process It is recommended to evaluate and (re)confirm Windesheim‟s project management capabilities and strategies. It is recommended to initiate an Educator data quality improvement project. In this project: Extending appendix 6.13, business rules are re-established and described in great detail, identifying areas in which business rules have changed over time. An area of concern is the digital course catalogue; Using these detailed business rule descriptions, the Educator database is being profiled; Data sources are rated and new data may be acquired; The database is cleaned, i.e. data not matching established business rules is repaired; Up to date Data Specifications are written. These actions will establish Referential Integration and Specifications, and will improve data quality maturity level two Accuracy, Consistency and Uniqueness.
6-Feb-12
F. Boterenbrood
Page 65
Research
Improving data quality in higher education
Technology It is recommended to have data quality analysis reports developed based on the business rules defined earlier. The data quality analysis reports are based on the business rules established in the previous step. This will improve Referential Integration, Specifications, Accuracy, Consistency and Uniqueness. Information In this level, information is likely to remain Not Trusted. WDQM level two (Managed) data quality dimensions Accuracy, Consistency, Referential Integrity, Specifications and Uniqueness should all be satisfied now, be it that this may still be in a reactive manner, using reports and data quality cleaning tools for repairs. Data Quality Level three (Defined) Once level two has been established, a formal transition to level three can be initiated. In this transition, the focus shifts from a local view to a more holistic, process wide view. Structure It is recommended to evaluate and (re)confirm Windesheim‟s programme management capabilities and strategies. It is recommended to define a lasting Educator enrollment programme, aimed at supporting Educator at Windesheim. It may be noted that programme management is monitored by Key Business Indicators (Ahern, Clouse, & Turner, 2008). This programme management is to be guided by valid Windesheim Key Business Indicators, enabling Windesheim management to be in control of the ongoing programme. Key Business Indicators may be found by observing the study management baselines, i.e. Catalogue Management, Study Planning., Study & Grading and Progress Management. Key Business Indicators may express the numbers of data errors in the catalogue, study plans, grade assignments, rejection letters and certificates. Furthermore, the programme includes all recommendations mentioned below. Process It is recommended that, where experiences gave rise to new insights and changed business rules, new Educator requirements are developed and that the functionality of Educator is changed accordingly. An area of concern is the digital course catalogue and the way data is shielded from unauthorized access. During interviews, the ability to access data via back-door entries was mentioned. This action will establish maturity level three Confidentiality. It is recommended that, when data quality related issues arise, a formal root cause analysis is initiated. This root cause analysis will identify the source of the data quality issues at hand. It is recommended to implement formal requirements development. Based on the root cause identified, new requirements will be developed and prioritized. These requirements present a foundation for system adaptations and further development. It is recommended to improve Educator‟s support of verification and validation. Examples are input checks and referential integrity checks, as well as improved process support, reminding lecturers of upcoming timeframes and baselines. The verification and validation is to be based upon the business rules described earlier. The actions described above will establish maturity level three Accuracy and Reliability.
6-Feb-12
F. Boterenbrood
Page 66
Research
Improving data quality in higher education
Technology It is recommended to have Educator accept changes in course information up to the moment grades are actually assigned, thus reducing the conflict between Completeness and Timeliness of data. Data objects not to be changed after a course is selected by students, is to be made mandatory during course definition. Other data objects may well be made optional. It is recommended to have the current ROTAP environments and practices evaluated and formalized. It is recommended to continue data integration using near-real time system interfaces. Starting a Service Oriented Architecture altogether however, would require a further growth in data quality maturity. These actions will support the implementation of maturity level four Currency and Timeliness. Information At level three, information should have become fit for use. The WDQM level three (defined) data quality dimensions Accountability, Accuracy, Completeness, Confidentiality and Reliability should be satisfied. Presence of Reliability in particular signals the success of the data quality programme. It is recommended to develop a canonical data model, supporting a corporate wide view on data being exchanged between systems. This action will improve Uniqueness, Referential Integrity and Specifications. Staff Implementing data quality maturity level three, focus shifts from a localized, departmental view to an integrated, Windesheim-wide view and care has to be taken to overcome the technological discontinuity (Zee, 2001) (paragraph 2.3.4 Growing Pains). Interviews revealed that good results were gained from making personnel responsible for data quality throughout the data life cycle. It is therefore recommended to: For the variable student centered study period, make lecturers entering course information responsible for assigning semester (variant) plans to student´s personal activity planning, as opposed to having these activities assigned to support offices instead, Make lecturers entering course information responsible for solving conflicts when course definition and course execution differ. These practices help in making the process transparent to both the lecturer and the student. They may also help in overcoming the technological discontinuity. Staff may need to be trained in the field of root cause analysis and requirements analysis. One issue specifically targeted is the technological discontinuity. The programme (see: structure) should give special attention to communicating the reasons of the change, have stakeholders participate and recognize the difficulties associated with it. It should be recognized that education is a very diverse environment, with personnel ranging from hard-core technical competent to deeply social and artistically engaged. A programme just enrolling a new system is a recipe for disaster. A short exploration of this specific element will reveal the multi-colored nature of Windesheim: In „Leren veranderen‟24, Caluwé en Vermaak (2006) group people into different modes of thinking, labeled by colors (Yellow for power-based, Blue for process-based, Red for relation-based, Green for
24 Learning to adapt
6-Feb-12
F. Boterenbrood
Page 67
Research
Improving data quality in higher education
learning-based and White for freedom-based thinking). For each color, people appreciate and respond to change differently, need different guidance and require a specific approach. Now, what colors define Windesheim? Let‟s allow ourselves a little freedom of thinking in exploring the situation: It is interesting to see that education in itself very much used to be a blue process. A student defined a goal („I want to become a plumber‟) chose an institution and found itself suddenly part of a fixed process, in which he will transform from novice to educated plumber in a given period. Today, this process is (at least partially) transformed from Blue into Green, in which the student is encouraged to evaluate and re-set their own goals, make his own choices along the way, is offered tailored learning situations and intensified personal coaching. We may recognize this practice by the existence of the activity plan in which semester plans are selected. Schools themselves may be described as White, in which teachers are professionals, only needing space (and removal of obstacles) to prosper and follow their calling to create and transfer knowledge. Then there are the supporting departments, like Finance, Personnel, Facility Management and IT. These departments are very Yellow by nature. Rules are defined by which personnel is hired, assessed and rewarded, financial rules for accountancy are strictly observed, and instrumentation for education is highly standardized. Instead of Yellow, IT is more a Blue type of organization, trying to organize work in terms of fixed and predictable processes. Management is a main stakeholder too. Management of supporting departments tend to be adopters of a Yellow point of view, keeping control over their business, while management of education is more Green of nature, enabling their teachers to learn, reflect and grow. Management therefore, is a rather multi-colored stakeholder. This short exploration of the field of change reveals that at Windesheim, we may discover at least Blue, Green, White, and Yellow oriented stakeholders, a clear indication that implementing change using only one (perhaps blue oriented) technology driven approach will NOT succeed in overcoming the technological discontinuity. It is therefore recommended to implement a broad programme to support change, including as much stakeholders as possible. Data Quality Level four (Quantitatively Managed) It is recommended to, in order to improve Timeliness and Currency even further, communicate the educational process baselines and associated deadlines as clearly as possible. Examples may be by means of posters and leaflets, distribution of up-to-date Educator manuals and by training-on-the-job. Finally, it is recommended to, once level three maturity is reached, not to stop, but to start an discussion on extending data quality initiatives into WDQM level four.
5.7
Concluding
5.7.1
Conclusion In this research, it is found that data quality is related to organizational maturity. This relation is defined in the Windesheim Data Quality Maturity (WDQM) model. The model defines five separate maturity levels, each defined by the presence of specific process areas (best practices) and resulting in data having specific quality dimensions (characteristics). Levels range from Initial, through Managed, Defined, Quantitatively Managed and finally, Optimized.
6-Feb-12
F. Boterenbrood
Page 68
Research
Improving data quality in higher education
Interviews with specialists within Windesheim have revealed that currently, in the field of data quality maturity, Windesheim is still at maturity level one (initial). This is indicated by the incidents found to be occurring in the Educator domain, the way these incidents are being dealt with and the tools available to monitor and correct data quality faults. This indication is then confirmed by observing the process areas and data quality dimensions implemented. In order to execute the educational process as supported by Educator in a reliable and efficient manner, and to be able to implement near real-time message based application interfaces, it is discovered that mastering data quality maturity level three (defined) is required, while getting data through the process in time, requires some level four process areas to be implemented. It is to be noted that the business value of data increases dramatically once an organization succeeds in implementing data quality maturity level four (quantitatively managed). 5.7.2
Recommendations A two-phased approach is recommended, implementing data quality maturity level two first, and level three later. Data quality maturity level two (managed) is reached by solving the immediate data quality problems. This is done by starting an Educator data quality improvement project. This project will describe the study management business rules in great detail, investigate Educator database quality using these business rules and repair errors found. At the end of the project, data managed by Educator is documented (creating insight) and reports are present, enabling operations to compare actual data quality with the required data quality as documented. Data quality maturity level three (defined) is reached by creating a holistic view. This means that change is managed as a coherent programme, rather than in the form of multiple isolated projects, when problems arise a formal root cause analysis is performed first, resulting in requirements being designed and results being tested against these requirements. With respect to the study management process, the process as a whole is observed. This view may trigger new requirements and changes in the current implementation of Educator, examples of which may be the assignment of process responsibilities to teachers, organizing for more flexibility in the process, adding process schedule support and simplifying structures. At data quality maturity level two, data quality is guarded in a rather reactive manner, using reports, analysis tools and repair tools to correct issues. At data quality maturity level three, data quality is guarded in a pro-active manner, using input checks and integrity checks to guard quality before it is stored. Extending the documentation created at level two, in data quality level three data being communicated between systems is described using a canonical data model. Once level three is mastered, reaching for a full implementation of data quality maturity level four (quantitatively managed) is not required. However, timing issues demand some level four process areas to be implemented and, again, it is noted that once arrived at level four, organizations start to yield great benefits from their data. It is therefore additionally recommended to communicate the educational process baselines and associated deadlines as clearly as possible, and once level three maturity is reached, not to stop, but to extend data quality initiatives into data quality maturity level four. It is most important that, by crossing the border between having a local departmental view and having a Windesheim-wide view, it is recognized that the organization may experience a crisis, known as the technological discontinuity (see paragraph 2.3.4 Growing Pains). Care has to be taken to overcome this discontinuity. Activities here are people oriented: give special attention to communicating the
6-Feb-12
F. Boterenbrood
Page 69
Research
Improving data quality in higher education
reasons of change, recognize the difficulties associated with it, involve stakeholders and respect the different concerns each group of stakeholders have, communicate process milestones using bi-annual calendars on poster format. Issues not addressed in this research are the current status of project management, programme management and ROTAP environments at Windesheim. Investigating these issues properly requires for separate research projects, time and resources of which were not available during this graduation project. 5.7.3
Stakeholder Value At the start of the project, three groups of stakeholders were identified (paragraph 2.5). Committed stakeholders are the CIO, Information Manager and Science. The scientific value of this project is discussed in the next paragraph. For the CIO, this research has provided an instrument guiding project portfolio management, linking change required with business objectives. The trigger and initial problem for this research were the difficulties experienced whilst trying to become a near zero latency organization. During the research it became clear that the actual business benefit reaches much further than that. The instrument enables the CIO to fine-tune investments in improving the study management process. For the Information Manager, the instrument acts as a guide in reducing errors in data processing, increase efficiency, manage responsibilities, improve business intelligence and save costs by reducing rework. Involved stakeholders are Management, Operations, Functional Support, System Integration and the Security Manager. At the end of the process, management will be provided with reliable data, supporting business process management. Operations and Functional Support will spend less time on rework and error correction. System Integration will be able to produce reliable and stable near real/time integration services. The Security Manager is will notice an increased integrity and availability of data. Affected stakeholders are the Board, Staff and Students. The Board will notice improved image, student satisfaction and process efficiency. Students will notice prompt responses when assessments are graded, reduced complexity and a reduction in errors. Staff will experience simplified administrative tasks, a clear-cut study management process and direct communication with students.
5.7.4
Achieved Reliability and Validity Many theories on maturity and quality were discussed and balanced. The results were checked by a survey amongst specialists. Those specialists were chosen based on their experience with data quality and organizational maturity. Population of quality attribute values was performed by a workshop involving Windesheim specialists, enabling them to reflect on the process and results. Windesheim specialists were chosen based on their experience and involvement in data management in study management. Care was taken to involve participants from a department known to have had trouble ensuring data quality and a department known for successfully solving data quality issues. In many cases, triangulation was used to cross-check results. This was done by comparing multiple aspects of an outcome, or by evaluating an observation using multiple theories. Examples of this are the observation of both process areas and maturity dimensions to ascertain the current maturity, confronting the WDQM model with multiple theories, and explicitly validating business rules found during interviews and the workshop. Building on multiple, accepted sources, reflection on results acquired and open discussion ensured internal validity, while applying the grounded theory approach ensured external validity.
6-Feb-12
F. Boterenbrood
Page 70
Research
Improving data quality in higher education
During interviews, experts involved in this research agreed on the business rules and the model as presented, with only a few modifications to be made. Which in fact, may make one a bit wary, for where is the discussion? The WDQM model has been crafted and is used once, and one may wonder of this is enough proof of its qualities. Wither it is balanced enough and incorporating the right process areas and goals should be ascertained in multiple assignments and open discussion. Which calls for a whole new research. 5.7.5
Scientific Value and Innovativeness With this research, based on recent theories on data quality, an up-to-date instrument has been created and used, pinpointing required data quality dimensions to satisfy business process‟ needs, and translating these data quality requirements into organizational measures to be taken. The instrument is based upon well established general theories on production quality improvement and specific theories on data quality and combines these theories into one framework.
5.7.6
Generalisation The instrument developed has been successfully used at Windesheim, yet it is not bound by the study management domain, or even one type of organization. Even though business rules require for the study process to be implemented at WDQM level three, solving the initial business problem requires for a WDQM level three implementation too, and this alone includes ALL Windesheim processes. The instrument is „solution independent‟, since it is based on open and well established models and theories, and thus can be used in other domains within Windesheim, other higher education institutions, organizations in other branches and even other nations. The theories behind it apply to all organizations – as long as these organization rely on data being processed.
5.7.7
Research Questions Answered Main Q1: Observing theories on maturity and data quality, and external benchmarks, what positive and negative correlations between structures defining maturity and data quality attributes may be found? 1.
What structures define maturity? a. What levels of maturity do exist? Five levels of maturity have been defined, ranging from Initial through Managed, Defined, Quantitatively Managed to Optimizing. b.
2.
In higher education, what positive and negative correlations between maturity and data quality may be found? a. For this research, what is the relevant set of business rules? The relevant set of business rules is documented in appendix 6.13. b.
6-Feb-12
What maturity structures in the field of organizational structure, process, technology, information and staff describe each level? The maturity structures in the field of organizational structure, process, technology, information and staff are documented in table 5, paragraph 5.1.4.
How will this set of business rules evolve in time?
F. Boterenbrood
Page 71
Research
Improving data quality in higher education
It is found that some parts of the educational process are perceived to be complex. A reduction of complexity may well be in order. An area mentioned is the digital course catalogue. c.
What data quality attributes are relevant for these business rules? The data quality attributes relevant to the set of business rules are Accountability, Accuracy, Completeness, Confidentiality, Consistency, Currency, Referential Integrity, Reliability, Specifications and Timeliness.
d.
What values of data quality attributes correlate with each level of maturity? The relation between data quality attribute values and maturity levels is documented in table 9, paragraph 5.2.2.
e.
What do process quality theories describe about positive correlations between quality and maturity? In most cases, Process Quality theories are derived from CMMI and therefore tend to describe a common picture: structured initiatives prevail over individualistic initiatives, holistic initiatives prevail over structured initiatives, repetitive processes including feedback loops prevail over holistic initiatives. An exception has been found in the Data Quality Management Maturity Model (paragraph 5.1.5), adding a higher abstraction level to data management in each successive maturity level.
f.
What do process quality theories describe about negative correlations between quality and maturity? In literature, this item is not addressed explicitly.
g.
Are those observations consistent? This question has become redundant.
Main Q2: What values of data quality attributes will define the required data quality threshold and therefore the required maturity structures at Windesheim? 1.
To support the business rules identified earlier, what values should data quality attributes have? Required Accuracy is to be pro-active and Confidentiality is required to be basic. Furthermore, data should be Current en Timely.
2.
What level of maturity is required to enable those data quality attribute values? Windesheim is required to implement data quality maturity level three (defined) completely, and for Timeliness and Currency some level four (quantitatively managed) process areas.
3.
What organizational structure, process, technology, information and staff criteria define the maturity found? The minimal list of structure, process, technology, information and staff criteria defining the maturity found are defined by the process areas of maturity level two and three of table 5, paragraph 5.1.4.
Main Q3: What are the current organizational maturity and current values of data quality attributes? The current organizational maturity and values of data quality attributes correspond with data quality maturity level one (initial).
6-Feb-12
F. Boterenbrood
Page 72
Research
Improving data quality in higher education
Main Q4 (Central research question): What is the gap between current maturity structures & data quality threshold and required maturity structures & data quality threshold in the light of enabling Windesheim to become a near zero latency organization? 1.
What is the gap between the current and required organizational structure, process, technology, information and staff criteria? This gap is documented in paragraph 5.6.1.
2.
What conclusions and recommendations may be derived from this gap? Detailed conclusions and recommendations are documented in paragraph 5.6.2. This paragraph is summarized as follows: In order to execute the educational process as supported by Educator in a reliable and efficient manner, and to be able to implement near real-time message based application interfaces, it is discovered that mastering data quality maturity level three (defined) is required, while getting data through the process in time, requires some level four process areas to be implemented. A two-phased approach is recommended, implementing data quality maturity level two first, and level three later. At data quality maturity level two, data quality is guarded in a rather reactive manner, using reports, analysis and repair tools to correct data quality issues. At data quality maturity level three, data quality is guarded in a pro-active manner, using input checks and integrity checks to guard quality before data is stored. Extending the documentation created at level two, in data quality level three data being communicated between systems is described using a canonical data model. Reaching for a full implementation of data quality maturity level four (quantitatively managed) is not required However, to resolve timing issues and increase benefits, it is additionally recommended to extend data quality initiatives into WDQM level four (quantitatively managed). It is most important that care is being taken to overcome the technological discontinuity by involving all stakeholders in the migration process.
5.7.8
Recommendation on further research In this research, three elements have remained virtually untouched: Current and required status of project management; Current and required status of programme management; Current and required status of ROTAP environment management. Addressing these issues is imperative for reaching WDQM levels two and three. It is recommended therefore, to investigate the current and required status of these process areas and advice on a migration strategy when applicable. Now the route towards a fitting level of data quality has been designed, this route will have to be travelled. An intervention based research may be started, evaluating the progress of growth in maturity, looking for problems during implementation and delivering advice on solving these problems.
6-Feb-12
F. Boterenbrood
Page 73
Research
5.7.9
Improving data quality in higher education
Reflection At the start of this graduation project, I have set three goals as a target. The first goal was to deliver „value for money‟, to achieve a result that would justify the investment my employer has made in my education. The second goal, equally (or perhaps even more) important to me, was to reach the end of the project with good (not just satisfactory) results. And the third goal was to make a difference, to learn and add new knowledge to the IT profession. During the execution of the project, I have witnessed reactions and gained insights to contemplate upon. Let‟s start with the last goal, and work our way back to the first. I have discussed the WDQM with experts on data quality and maturity. In two occasions, interest in the WDQM as an instrument was raised, and I was invited to publish my experiences in the form of an article in the future. In one instance, I was even invited to join in writing a book on this matter. Therefore, I feel confident that I have stumbled upon something interesting here. The box on goal number three is ticked. To reach the end of this graduation with good results is a more difficult goal to predict. I feel confident that the results will be satisfactory, but are they going to be good? There is a lot of uncertainty here at this moment. However, what I know is that I have done my absolute best – I have enjoyed this graduation project and could simply not have done things any better than this. To me, the diagnosis and research part of this project is covered to my very best abilities. Therefore, I consider this box to be ticked too. Yet, on the first goal, it was more difficult to get a grip on the matter. It was quite uncertain if the answers to the questions asked would produce result specific enough to deliver usable advice. And in the end, the subject proofed to be very comprehensive. A migration in data quality maturity involves many aspects, some of which could only be addressed briefly with the resources available for this research. Indeed, issues like project management, programme management and ROTAP management require a research project of their own, and deserve more attention than received right now. What would I do differently next time? In this project, the development of an instrument to measure data quality maturity was required prior to analyze the business problem and advice on solving this. In fact, we may have well executed two research projects here: a design oriented research resulting in an instrument and a diagnostic research, resulting in advice. The absence of a detailed and up-to-date instrument forced this research to create one, and the need to supply valuable advice required for the instrument to be applied. There was no escape of combining the two types of research. It may well be that on the second part of the project, I could have involved the current organization more, exploring areas more deeply that, right now, may have only been touched on the surface.
6-Feb-12
F. Boterenbrood
Page 74
Research
Improving data quality in higher education
6.
Appendices
6.1
Interview Report Windesheim Integration Team Interview report system integration team Windesheim Attendees: Tonny Butterhoff,
System Integration
Gerben Meuleman,
System Integration
Gerben de Wolf,
System Integration
Albert Paans,
Information Management
Alex Geerts
IT Front office
Windesheim, 11/11/2009 What are the responsibilities of the system integration team? Currently, the system integration team (formerly known as KOAVRA: Koppelen onder architectuur voor Vraagsturing25) is connecting systems at Windesheim in a service oriented architecture. First of all, the process being supported by real-team coupling is the HRM process. When a new employee is hired, account information is send in real time granting the new employee immediate access to information systems. A similar process, aimed at proliferating student account information is currently being tested and is planned to be accepted for production shortly. Finally, real-time service based information exchange processes concerning study information and supporting study processes are being built. What technologies are being utilized? Systems being integrated are all standard packages: CATS student information system, Oracle HRM, Planon Facility Management, Decos document management, Educator Learning Environment, Blackboard Learning Environment. The Enterprise Service Bus is delivered by Cordys. For some systems, building a service interface layer was quite simple. Decos for instance is an up-todate system supporting the use of web services. Planon and Oracle HRM are at the other end of the scale, offering no support for web services at all. For these packages, an interface utilizing database injection code had to be developed. In the near future however, Planon at least promises to offer a more modern solution. What issues in relation to data quality are found?
25 Coupling Under Architecture for student-centric education
6-Feb-12
F. Boterenbrood
Page 75
Research
Improving data quality in higher education
Data quality related issues are commonly found. An example of a recurring problem is that female students enroll themselves multiple times, using either their maiden name or the family name of their spouse by mistake. Other well known and regular issues are cases in which the name of a student is completely missing, student addresses are incorrect, information is entered in wrong fields. In the facility management system, new personnel is assigned to a room using four zero‟s as a room identification. When new personnel arrives, manual processes ensure the assignment of the correct room number. Unfortunately, in many cases these processes fail to correct the number in time (or seem to correct the number not at all). In Oracle HRM, sometimes missing information is replaced by text-strings stating that „Debbie has to solve this problem‟. DECOS seems to have become a victim of migration efforts. Even though being one of the most modern systems, interfacing resulted in a myriad of errors and unexplainable results. Upon closer inspection, the database seems to be corrupted, as if multiple migration attempts had been made, all of which at some point failed, leaving about 10% of the DECOS database in ruins. Recently management focus was on correctly clearing information in case of student decease. Even though hard evidence is missing, it seems that in the recent past information was send to deceased students‟ addresses. What has been found too is that sometimes the timeframes of business processes itself causes problems. In some cases students receive no formal clearing, for instance due to the fact that study fees have not been paid. Even though those students may start their study, they do not receive a student account yet. This situation may be misinterpreted as a data quality error. A final issue is that all cooperating systems store data in different time frames. Oracle HRM is very good in keeping a historic track of all data, where the student information system CATS seems to be able to store the present situation only. What consequences arise as a result of these issues? Information about duplicate accounts is propagated into a adjacent information systems (currently using file oriented interfaces) and removing those duplicate accounts take considerable effort. Even worse perhaps, is that the existence of duplicate accounts may lead to errors in the student head-count, leading to uncertain financial budgets. Propagation of errors is an effect that is related to almost every issue found. Consequences are that information systems produce incorrect information, resulting in loss of confidence. Secondly, it is difficult and time-consuming (costly!) to find and repair those errors. Painful mistakes like sending the wrong mail to deceased students may lead to serious image damage. In the standard file-based integration processes, every day time is spend by operations checking the batch files for errors manually. In fact, manual inspections and correction processes are found everywhere. DECOS for instance is used as a source for management information, Due to the fact that the DECOS database is corrupted, those management information reports are checked manually. If suspicious information is found, the numbers are corrected manually. Not only does this ad-hoc practice consume time, it effectively renders the management reports useless. Where data quality errors in HRM lead the way to the solution („Debbie should solve it‟ – so let‟s ask Debbie), problems related with facility management tend to keep everyone in the dark. When material
6-Feb-12
F. Boterenbrood
Page 76
Research
Improving data quality in higher education
(like a computer) is ordered for new personnel, getting the equipment delivered proves to be a challenge, since information regarding a valid location is not available. Not only does this practice lead to time lost, new personnel is not served professionally on their first day on the job, damaging Windesheim image. Differences in storing data mutations over time may lead to incorrect system responses, like sending information to a student‟s new address prior to the actual date of moving. However, it is unknown if anything like this has happened already. Are there any causes and solutions identified already? The use of Commercial Off The Shelve applications seems to contribute to the inventive use of data fields. Packages not always support a flexible and proper implementation of a business process, and sometimes an inventive implementation will have to be found. And then again, standard solutions not always offer the input checks one would like to see. Even the national student portal Studielink (www.studielink.nl) allows for students to fill out their application forms incorrectly. It would help if correctness of data was enforced „at the source‟. The distinction between correct and flawed data is not always clear. To solve this problem, development of a canonical data model is planned.
6.2
Interview Report WDQM Marlies van Steenbergen Discussion
Marlies van Steenbergen MSc Lead Architect Sogeti
Subject
Validity WDQM model
Date
12 march 2010
First of all, it is noted that at level 4, emphasis lies on being able to manage a process quantitatively, which implies the presence of a measurement mechanism. In the WDQM, having no process area‟s defined at level 1 is recognized to be correct. At level 2, the initial positioning of root cause analysis in the process column is questionable. When properly conducted, a root cause analysis leads to identification of underlying problems and enables a more lasting solution. Therefore, root cause analysis may be positioned at level 3 instead. Information being unspecified at level 1, not trusted at level 2 and structured at level 3 is not directly based on evidence in literature. The reasoning behind these labels became clear during the discussion and are recognized, but may need some further explanation. At level 3, the focus is on being able to manage multiple changes in harmony and creating synergy. Therefore, the term project management may better be replaced by Program management. And in many cases, portfolio management is used to indicate synergy. Yet, portfolio management is often used in conjunction with (IT) Governance, utilizing frameworks like COBIT, BiSL and ITIL. These frameworks may fit level 4, quantitatively managed better, since their focus is on supporting the whole process / product life cycle. Therefore, using program management instead of project management at level 3 seems to be appropriate. The explanation of the process activities on level 3 may be made more explicit. Technical solution might be more appropriate at level 2 in the technological column. And is data integration an activity that may better be positioned in the technical column at level 3? What are the relevant data integration
6-Feb-12
F. Boterenbrood
Page 77
Research
Improving data quality in higher education
patterns here? It may be argued that at this level, data is integrated with other sources using translation routines at the borders of each source. Supporting these translations may well be translation script, resulting in the emergence of a canonical data model: a bottom-up description of data being transferred between sources. In the information column at level3, we may see the emergence of a canonical data model. At level 3, in the staff column, data modeling knowledge is positioned. This raises the question how personnel is able to develop information systems and solve data quality problems at level 2 in the first place. It is therefore recommended to reposition data modeling knowledge at level 2. In this cell, project management skill may be replaced by better fitting programme management skills. One may argue that at this level, staff is “synergytical” competent, since staff has learned to create synergy from combining multiple transformations (projects). At level 4, data is approached as a product. The presence of an information product manager at this level makes good sense. But at level 3, data may be recognized to be raw material, building blocks, a commodity perhaps. At level 3, who is responsible for this material? Since level 4 incorporates end-to-end business process management, all measurement and analysis instruments to enable level 5 may be present at level 4 already. In the technology column, which integration patterns apply here? In the information column, the canonical data model may well be used to define a common information language to which all data sources adhere. At level 5, the absence of general theories on data quality is not completely surprising. It seems that data quality theories are focused on improving the data quality to an acceptable level (fit for use). Applying six sigma may work, yet in some cases it has occurred that level 5 is discarded all together, since the organization in question had no intention to reach this level. Delevring quality according to a service level agreement does seem to fit level 4 better. It is advised to reposition this process area at level 4. The organization being structured in a strict top-down hierarchy is based on theories from Treacy and Wiersema. This should be explained in more detail. A rather interesting issue may be that from level 3 onwards, it is implicitly described that data errors are improved at the data source, not at the place they create havoc. This means that a continuous improvement cycle has been defined at level 3 already. What does this mean for level 5?
6.3
Interview Report Data Quality in Education Th. J.G. Thiadens Attendees:
dr. mr. ir. Th. J.G. Thiadens, lector IT Governance, Fontys university of applied science F. Boterenbrood
Doorn,
15-03-2010
This discussion is about data strategies in higher education. Issues discussed are the historical perspective on IT Governance, the current status, regular problems and common solutions, and the future of IT in higher education. Fontys university of applied science is characterized by 35 separate schools. The decentralized structure of the organization resulted in the presence of about 600 simultaneous projects, all resulting in an IT solution. 10-15 of these projects are centrally managed. The remaining projects are local initiatives within the 38 schools. The governance of the 10-15 centrally managed projects is transparent, while the remaining projects are executed without central guidance. One feels, that a portfolio of IT projects should deal with all 600 projects.
6-Feb-12
F. Boterenbrood
Page 78
Research
Improving data quality in higher education
This is in fact a position many universities are experiencing today. The Dutch universities of applied science are the result of a merger of many smaller institutions in higher education. The resulting institutions are large organizations, be it rather decentralized. Currently, a move towards more centralized modes of governance is visible. However, data quality may not always benefit from centralizing. Procedures involving data being transferred between systems manually are prone to errors. The books of Starreveld mention that manual record keeping can lead to up to 5% errors in data quality. In many cases, data quality may be improved by shifting responsibilities as low as possible down the hierarchy. Examples are: •
•
•
•
•
Problems in grade assignment may be solved by making the lecturer directly responsible for correct and timely grading. Lecturers are corrected by students when grade assignment is late or questionable. Registration of lecturer availability may be much improved if the lecturer is made personally responsible for this information, and is given the right tools to manage this information. The effects of not having registered the right information on time (the lecturer finds himself scheduled at undesired moments) may be a fitting incentive to have this information up to date. Within schools, items are ordered and these items will have to be billed. Billing processes should make the school which placed the order responsible for paying the bills. In this way, schools are directly confronted with financial consequences of choices, and not at the end of the year, by means of a error-prone budgeting process. This may be implemented by positioning financial controllers at decentralized positions. Monitoring study progress is a responsibility which could be both centralized and decentralized simultaneously. Student centered education requires for study progress monitoring to be decentralized, allowing for study coaches to closely monitor individual student‟s progress, while business intelligence processes supply management with over-all corporate controls. Examples of responsibilities that should remain centralized are strategic management and setting the rules for employee benefits.
Transferring responsibilities to the individual is in line with current use of technological developments like the internet, in which the individual has gained in influence. Information is perceived to be an individual asset. This will lead to an individual approach of information. An example is given by Harvard University, where students are presented by individual schedules every day, including proposals for alternative classes the student may wish to attend this day. In many cases, information systems are not trusted. Often, managers rely on information acquired from alternative sources or different indices. The number of employees working at an organization for instance can be found by looking at the number of monthly salary deposits. In the future, it is to be expected that information processing is centralized even further. Private cloud computing has a role to play, enabling multiple institutions to share services. Virtualization too supports the emergence of shared service centers, while respecting decentralized needs. The most difficult hurdle to be solved here is to overcome the notion that information is not owned by the decentralized business units. This requires excellence on academical level to be present at both management and workforce.
6-Feb-12
F. Boterenbrood
Page 79
Research
6.4
Improving data quality in higher education
Interview WDQM dimensions Report Arjen de Graaf Attendees
Arjen de Graaf, Founder / CEO Arvix Frank Boterenbrood
Subject
Validity WDQM model
Date
09 april 2010
Introduction As founder and CEO of Arvix, a company focused on safeguarding and improving data quality, Arjen de Graaf has deep knowledge of data quality and its relation with organizational maturity. In this meeting, the WDQM goals as described in table 9 are discussed. Ownership, stewardship and a business case for data quality. In many organizations, an employee is assigned responsibility for the quality of data. However, once asked for the means available to monitor and correct this data, the answer is not always satisfactory. Effective means to influence data quality are absent in many cases. Absence of means results in a situation where one can feel responsible for data quality, but in reality, one can not actually be responsible. In other words, the data steward, as mentioned in this research, cannot fulfill his role as caretaker for data quality if the means to effectively influence data quality do not come with the job. Since data quality is related to organizational maturity, the means required are managerial rather than technical. To ensure data quality, one may have to be prepared to restructure the organization. Instating data stewardship without the preparedness of taking (perhaps drastic) managerial decisions, restructuring the fabric of an organization, may be in vain. There HAS to be a manager responsible for data quality with the authority to implement change. What can be observed is that organizations assign data quality governance not to one employee or role, but instate a business intelligence department or data quality department. This department is assigned the task of providing the organization with valid business indicators, directly influencing operational processes and management decisions. In this case, data quality and business performance are visibly connected, displaying a clear business case for data quality. Talking of business cases: businesses are confronted with the situation that customers have directly access to operational data and demand near real-time responsiveness. Today, when data is flawed, an organization does not have the means nor the time to correct this data in internal processes and procedures, and the business runs the risk of finding itself on one of the prime-time consumer platform television shows, explaining why it all went so horribly wrong. Value of data quality It is important to be able to express data quality as a valuable asset of an organization. This means that data quality has value, it can and must be expressed in terms that have meaning to the business. In the current model, this approach towards data quality seems to be rather instrumental and the business view seems to be missing. Value of data quality can be expressed in terms of financial value or business urgence. The business management view may include elements of recognizing new patterns, generating new business based on data mining, turning data into new money. Or costs can be reduced by –for instance- recognizing patterns indicating cases of fraud. Reasons for attention turning to data quality are competitiveness (creating new business), being master of business data and therefore able to not only manage but also lead an organization, exploring client demand (instead of sending a mailing „to all‟).
6-Feb-12
F. Boterenbrood
Page 80
Research
Improving data quality in higher education
Insight One main dimension of data quality therefore, seemingly missing from the current model, is Insight. Does the organization, the data steward, the manager responsible for data quality have insight in its data and the quality thereof? Insight in data means that it is clear for an organization what data attributes are required or available, where and why these data attributes are created, what sources were used, where these attributes are used, who guards and tests the attribute, when these attributes are outdated and, once obsolete, how they are dealt with. Accreditation Data Quality is becoming recognized as a major contributor to (or: prohibitor from when absent) business success. We may expect a data quality standard to emerge in the near future, and organizations may become data quality accredited using this standard. Needless to say that Insight is one of the first dimensions required to be instated. For now, an organization may well embark on a journey towards data quality improvement because new management has just entered the organization, and is in doubt about the reliability of his data: he is not sure wither the data is right or not. In this case, the new entrant acts as a maverick: not obstructed by any corporate rules and customs, data quality is doubted and questions are asked, demanding unambiguous answers. Volatility In the current model, volatility is mentioned not to be recognized at WDQM level one. This does not seem to be right, since at operations, the importance of data quality is recognized right from the start. The experts from operations however have a hard time communicating the importance of data quality, and at level one, it is mostly management which is unaware of the importance of data quality. Beef Where is the beef? The current model is technically correct, yet it seems to be lacking real-world business attention. For instance, current level labels are quite technical and difficult to understand. What is meant with „quantitatively managed‟? Who is to understand this – it is not very likely to generate management attention instantly. Please describe a „WDQM for Dummies‟ using management benefits. Especially beyond level three data quality becomes a matter of special interest to organizations, opening up a whole new realm of possibilities. What we can see beyond level three in practice today are cloud computing for data quality initiatives, new business generated and successful one-on-one business models based on reliable data. Make data quality more sexy!
6.5
Interview report Current Data Quality Educator Gerrit Vissinga Attendees:
Gerrit Vissinga, process engineer Educator Frank Boterenbrood
Windesheim,
17-03-2010
Introduction The Educator project is in turmoil. It has taken the best part of three years now, and full implementation may well take another three. In the future, issuing diplomas and certificates will become part of Education. The current graphical representation of Educator‟s scope of influence is not quite right: the process education development is not within Educator‟s scope.
6-Feb-12
F. Boterenbrood
Page 81
Research
Improving data quality in higher education
Issues, causes and solutions Management of study definitions in the catalogue is difficult. In particular updating course definitions is tricky, since the user has to identify the type of update up front. If the update is identified as „complex‟, a new version of the study definitions is generated. If the update is identified as „simple‟, current data is updated in place and no new version is created. To the user, the distinction between „simple‟ and „complex‟ updates is not made perfectly clear, and the consequences of a „complex‟ update remain unknown to many. One of the consequences is, that once entered, a new version of study definitions need to be linked to semester variant plans. Often, this step is overlooked, resulting in study information not being made available to the student, since the student adds semester variant plans to their activity plans, never individual courses. Indeed, in the catalogue orphaned study definitions may be found. Errors like these are caused by an over engineered and complex solution. Currently, a simpler more straight forward system design is being discussed. Another issue is caused by the fact that a student may enroll himself into a study that differs from the one agreed upon with study coordinators. This mistake is prevented by having the supporting offices assigning semester plans to student‟s activity plans, or to have the study coordinator check activity plans in great detail. Anyway, some issues remain unsolved, since the focus is still on supporting the primary process. Other issues are checked by functional support or head lecturers. These tasks however are delegated to support offices. One may question the quality of the checks performed. Discussion on Data Quality Dimensions Accessibility. Seems to be OK. Accountability. This seems to have a relationship with confidentiality. This seems to be OK. Confidentiality. The amount of roles available in Educator is rather large, resulting in complex role management. Consistency. The technical implementation of Educator may not be adequate to prevent data from becoming inconsistent. An example is the issue regarding definition updates. Currency. Integrity, Referential. See Consistency. Integrity amongst different information systems seems to be a problem, since data integration with Educator is to a large extend manual. Reliability. In some cases, once grades were assigned, courses were removed from student‟s activity plans, causing grades to disappear. This was caused by a notion that the study plans were in error: the reliability of the data was in question. It is not known if any solution preventing this type of error has been implemented. Specification. Leaves room for improvement Timeliness. Uniqueness. Volatility. In particular course definitions are prone to alterations. It seems that lecturers designing their courses change their mind on how to execute or assess their education too often.
6-Feb-12
F. Boterenbrood
Page 82
Research
6.6
Improving data quality in higher education
Interview report Current Data Quality Educator Gert IJszenga Attendees:
Gert IJszenga, manager education School of Build, Environment & Transport Frank Boterenbrood
Windesheim,
15-03-2010
The School of Build, Environment & Transport (BET) is in a process of migrating all student information from the old Student Information System (SIS) CATS to the new SIS Educator. Starting from year 2008-2009, the registration of the digital course catalogue, the student‟s personal study planning and student grades are registered in Educator. Currently, grade information of students starting in preceding years is being migrated from CATS to Educator. When this process is concluded, the School of BET is planning on utilizing Educator‟s portfolio capabilities. In implementing Educator, the School of BET applies a gradual approach. First, three years ago, the processes of education development and definition, student activity planning, assessment and grade registration were formalized more strictly, creating a situation in which the School of BET was in control of these processes. Secondly, once these processes operated reliably on the current infrastructure, process support was switched from CATS to Educator. The challenge was to create a process in which: education including assessment rules were defined correctly, students create their study plan on time and correctly the first time round, freedom of choice was balanced against predictability (of resource claims), registration of grades is completed within a two week window without major disruptions. The issue here was to create a situation in which information stored in Educator could be checked against base-line documents, resulting in usable data quality controls and enabling well-informed choices in case errors have to be corrected. The specific question addressed was: “What process design ensures every student to be linked to the right courses, supporting the assignment of the right grades?” The leading principle at the School of BET implementation is that control over data entered into the system is mandatory. This principle is implemented in three areas: The Digital Educational Catalogue DOC (Digitale Onderwijs Catalogus), the student personal activity plan PAP, and grading. DOC process control A curriculum does not spring into existence by accident. Leading up to the registration of course information in DOC, a process of design and discussion is executed. These activities are reflected in planning and design documentation being present, resulting in a base line enabling control over definitions in DOC. The School of BET therefore requires course planning and design documents to be present prior to entering course definitions in DOC. These documents are an instrument guiding and monitoring the quality of the course catalog. Personal Activity Planning Once the student completes the propaedeutics phase, the School of BET offers a variable study programme in which the student has freedom of choice. One of the problems here is that if the student does not use Educator to enroll himself in the courses he is attending in time, grades cannot be assigned. Secondly, it is hard to plan education execution efficiently if participation of students is uncertain up to the very last moment. Therefore, the student is required to create a complete plan for
6-Feb-12
F. Boterenbrood
Page 83
Research
Improving data quality in higher education
his study career early in his study. To support decision making, for each study three alternative study paths are available, each study path offering limited additional freedom of choice. The School of BET has structured the choices available in a study planning chart, visualizing the different routes. Finally, if a student does not complete his personal activity planning in time, he will not be allowed to participate during one semester. This personal activity planning results in a set of study plans, which are easily converted into files and imported into Educator, linking students to courses, groups and classes. To enable this import, the use of free format data structures in Educator (known as labels) is standardized. And again, if problems are detected, the individual study plans are a benchmark against which data in Educator can be checked. The Windesheim Educational Standards (WOS: Windesheim Onderwijs Standaard) refers to the use of semester variant plans. These semester variant plans are in fact an educational planning tool encompassing a twenty week period. The School of BET may not use the Semester Variant Planning structure literally, yet the process in use does have exactly the same effect. Grading Once 1) the digital course catalogue is correct and 2) the student are enrolled in the right courses in time, assigning grades does not pose any problems. Issues the School of BET meet here are performance issues, i.e. the speed at which the system reacts to input, bugs for which workarounds are to be used and reporting facilities which currently are not yet available. These issues indicate that during development and implementation Educator still was in a experimental state and they are currently being dealt with in the Educator development project. The major issue at this moment is getting grips on the time it takes to assess student results and assign a grade. Ideally, this should be completed within a two-week period; however instruments to control this service level are not yet available. Key moments Important deadlines in these processes are: 1.
The moment courses are published in the digital education catalogue;
2.
The moment the student submits his personal activity planning;
3.
The moment grades are assigned and finalized.
Conclusion In this discussion, not all information relevant to the research project was discovered, since one hour proofed to be insufficiently long. A second date was set, in order to continue this meeting.
6.7
Interview report Current Data Quality Educator Gert IJszenga Continued Attendees:
Gert IJszenga, manager education School of Build, Environment & Transport Frank Boterenbrood
Windesheim,
25-03-2010
In this interview, the current values of data quality dimensions are discussed.
6-Feb-12
F. Boterenbrood
Page 84
Research
Improving data quality in higher education
Accessibility. At this moment, reports enabling control over Accessibility are missing. Some rather elaborate manual checks are available. However, due to process design, it is believed that Accessibility for most is sufficient. There may be an issue with assignment of grades. An estimated 80% or so is believed to made accessible for students within 10 days after an assessment. It is mainly the lecturers motivation keeping Accessibility within limits. Accountability. Educator offers build-in mechanisms to safeguard accountability. An audit-trail is available, logging all data updates. In the real world however, only exams are stored, student reports and other end-products are handed back to the student after examination. It is therefore not feasible to reproduce the product that was assessed. Another issue is the absence of a fall-back administration, in case errors cause Educator to fail. In one instance, deletion of courses already being graded, caused the deletion and loss of all grades, leaving the organization without a backup. The system should prevent this. Accuracy. Course information is described in many documents outside Educator. As a result, the information entered into the system is (wrongfully) regarded to be of minor importance. This information is often less detailed as should be. This is an area with room for improvement. How serious are we about the data in our study support systems? Completeness. Educator requiring all course data to be available before one, fixed, deadline is perceived to be a problem. There is no room for a more gradual approach, in which required data is stored first, and additional, more optional data is added later. The current binary method causes course information to be entered as late as possible, jeopardizing currency and timeliness. Confidentiality. Is well take care off. It is hardly impossible to adjust a grade, since this function is protected using a token (strong authentication). Using social hacking techniques, one may gain access to student grades, be it read-only. Consistency. Due to strict process design, consistency is believed to be managed at a pro-active level. Currency. Some educator functions are troubled, Educator does show hick-ups from time to time. An example is the limited choice of web browsers supported by the system, making it difficult to gain access to the data in time from different devices and locations. Integrity, Data. Course data is considered to be right for about 75%. The integrity of student activity plans and grade management may well approach a score of 100%. Integrity, Referential. Is guarded by strengthening the process. Reliability. At the school of Build, Environment and Transport the data within Educator is qualified as reliable as a result of well defined business processes. Specification. Specifications is not quite ready yet. Currently, much knowledge stillis convined to Gert alone, the situation is not quite transparent, i.e. ready to be shared. There is still room for improvement here, too. Timeliness. This is an issue which is being worked on as we speak. It is perceived to be troublesome due to the fact that processes are started reactively. It is the process stakeholder who decides on starting a process, and time and again it proves difficult to start processes in time. Time is poorly planned.
6-Feb-12
F. Boterenbrood
Page 85
Research
Improving data quality in higher education
Uniqueness. The design process creates a barrier against data being duplicated. Lecturers work in teams on courses development. Course data however is described in multiple documents between which discrepancies are possible. Volatility. Currently, course data may well be too volatile. The organization, learning to use the system, is changing course information much too often. A good course definition should last for a minimum of three years, and for many courses this may well be eight years. Yet, courses are updated multiple times each year now.
6.8
Interview report Current Data Quality Educator Klaas Haasjes Attendees:
Klaas Haasjes, operational support Educator Frank Boterenbrood
Windesheim,
17-03-2010
Introduction Klaas Haasjes, as a member of operations, is responsible for the correct operation of Educator and the exchange of data between Educator and its adjacent information systems. Data exchange between Educator and Blackboard for instance is a labour intensive process. In Educator, executing a query results in a comma separated file, which is then imported into Blackboard, an information system supporting secured document exchange between students and teachers. Issues, causes and solutions It is found that data entered in Educator results in problems in Blackboard. For instance, for each course (VOE) in Educator, a module in Blackboard is generated. In this process, for each module only one teacher is linked to the module, being the teacher responsible for the course. In some cases, in Educator multiple teachers or groups of teachers are linked to a course, which leads to one random teacher or no teacher at all being linked to a module in Blackboard. Operations does not correct this problem. It is found that these issues are corrected by the user in Blackboard manually. In the past, courses in Educator could be renamed. When this happened, the consequence was that course names in Educator and Blackboard became different, rendering course selection for students a mission impossible. To prevent this confusion, Educator has been modified, preventing course names from being altered. However, ghosts from the past still remain, causing 472 errors during data integration runs. Life cycle management of data is a problem in many cases. At www.studielink.nl, student can select a study. Once a student selects Windesheim and www.studielink.nl submits their information, an account is created at Windesheim. However, students are free to un-enroll themselves and indeed frequently do so. Their account at Windesheim is not terminated, leading to literally thousands of ghost-accounts. In many cases, these ghost-accounts are assigned to the mandatory part of the programme of the study the students initially enrolled for. Once that has happened, removing these accounts becomes difficult, since they have become intertwined with educational registrations. A solution for this problem is currently being investigated. Student ghost-accounts may cause havoc with software licensing strategies. When a license strategy is based on maximum number of enrolled students, ghost accounts may cause maximum thresholds to be exceeded. Discussion on Data Quality Dimensions Accessibility. Many students may still not be aware of the existence of the digital catalogue. Indeed, many lecturers may not be aware of its existence. It may be observed that seemingly the educational process is not fully understood by many. Wither actions are planned or taken to improve the situation is unknown. Accuracy. Values entered in the catalogue are checked against general agreed upon guidelines. However, these guidelines do not seem to be known by many.
6-Feb-12
F. Boterenbrood
Page 86
Research
Improving data quality in higher education
Consistency. In the past, the meaning of grades could be defined by the lecturer designing the course. This led to a plethora of grade value interpretations. One issue in particular caught the attention: grade values indicating a score being insufficient, sufficient or a course being dispensated altogether. These scores were represented by a 4 and 6 respectively, much to the dissatisfaction of students graduating cum laude, who, much to their surprise, were presented with one or more sixes amongst the row of „straight A‟s„. Now, grade values definitions are defined by Educator automatically. How values for sufficient, insufficient and dispensated are currently processed is unknown. Currency. In many cases, information is entered into the system too late. This is not primarily a fault of the information system, it is the human factor causing delays. Examples are grades and course definitions being entered too late. Student plans tend to be finalized in time, since being late with student planning results in the student not being able to attend to classes for one semester. Integrity. In Educator, courses with no credits attached have been defined. It is apparently not feasible to assign checks to every data attribute entered. Reliability. Data is reliable as long as they are entered correctly. Specification. Documentation supporting Educator is rather thin. However, documentation is being improved. Timeliness. The human factor proves to be a large contributor to information being available late. Knowledge on how processes rely on information being timely seems to be missing. Implementation of Educator seems to be left in the hands of the individual schools. Uniqueness is a dimension which is strictly observed and guarded. Volatility. Information in the world of Educator does not change very frequently. Peaks are found when new students enroll themselves at Windesheim.
6.9
Interview report Current Data Quality Educator Louis Klomp Attendees:
Louis Klomp, ICTO Coordinator school of Business & Economics Frank Boterenbrood
Windesheim,
18-03-2010
Introduction Louis Klomp is teacher and ICTO coordinator (Information and Communication Technology in Education) at the school of Business & Economics (BE). Louis was engaged in the use of the first version of the digital education catalogue, and has participated in the migration to the current catalogue. As a teacher, Louis has hands-on experience with Educator. Educator does not support development of courses; the focus of the model currently presented is too wide. Since at BE, Educator is used from the very first moments on, printing diplomas and grade certificates are supported by Educator this year. Now BE is focusing on defining and registering standards and thresholds, such as the 45 EC threshold associated with the propaedeutics phase. In time, these thresholds will be assigned to student‟s personal activity plan (semi) automatically. Issues and actions Much to anyone‟s surprise, during grading teachers were confronted with the fact that when the definition of assessments of a course in the catalogue did not align with the way a course was assessed in real-life, grading of that course was difficult, if not impossible at all. At first, teachers were supported by coordinators removing the course from student‟s activity plans, correcting the course definitions and re-inserting the course in the activity plans, restoring previously earned grades manually. Later on, this support was dropped and teachers had to deal with the issues themselves. This rather rigid support policy proved to be beneficial for data quality: teachers became much more aware
6-Feb-12
F. Boterenbrood
Page 87
Research
Improving data quality in higher education
of getting the definitions in the catalogue right the first time round. Now, the mindset has been transformed from a deadline being debatable and final being questionable to a deadline being the limit and final being definite. Course definitions were entered by personnel of the BE supporting office. Communication regarding course definitions between lecturers and supporting personnel was based on notes and print-outs. These went missing regularly, causing mistakes and miscommunication. Now, it has become the lecturer‟s responsibility to enter the course definitions. It proved to be impossible to link student requests for a re-assessment to the exact moment a course had been scheduled in the past. In Educator, the moment a course had been scheduled is not registered. In order to be able to create useful management reports and to assign student rework to the correct course, all BE courses in Educator are copied and renamed each year, inserting the current year into the name of the course. The lecturer responsible for the course has to agree upon the course definition still being valid. This procedure caused course information to be improved and enabled student requests to be assigned to the right, historical, course definitions. Many reports enabling management of Educator data are still missing. Currently, it is hard to get a view on study progress, since relevant reports are not available. Migration of grades between information systems in the past has introduced errors; however lack of reports does not help identifying these errors. Annual duplication of all course descriptions results in a growth of the database, adding to the need of management reports. Now Educator has been used for three years, initial assumptions of how education is organized are reevaluated. A redesign seems beneficial, in which the structure of the catalogue may be greatly simplified, improving availability and understandability of the catalogue. Now it seems that items like OE and VOE (Onderwijs Eenheid and Variant Onderwijs Eenheid) may best be combined into one course entity, while the entity Semester Plan seems to be redundant completely. Having used Educator for three years also means that next year, the first section of students will have their diplomas printed by Educator. For reasons unknown, calculation of a final grade does not function properly. In rare cases, students are being presented with insufficient grades, while final re-assessments, resulting in grades being sufficient, should have shown a more positive result. Reports created by the software manufacturer did not clarify this mystery. Now, an approach is used in which problems are investigated once students complain. In short, many issues are related to the absence of proper management reports. In some student activity plans, courses and grades students earned were migrated from the previous study support system manually. Again, when these courses were attended to by the student and the time the grades were earned was not registered in Educator. Now this information is unavailable. Discussion on Data Quality Dimensions Accessibility. Currently, the system is over engineered, too complex, limiting accessibility. Accountability. Is OK. Accuracy. Initially accuracy proved to be a problem. By assigning responsibilities to the right functions, and confronting stakeholders with the consequences of their actions, accuracy has been improved greatly. Completeness. See Accuracy Confidentiality. This is OK, Educator offers comprehensive role management functions. Consistency. It is found that the level of detail in which courses are explained in additional descriptions is not consistent. Some teachers describe their courses in great detail, while others spend only a few words. No actions are defined to correct this situation.
6-Feb-12
F. Boterenbrood
Page 88
Research
Improving data quality in higher education
Currency. Grading may well be a problem. No reports exist monitoring the grading process. Integrity, Data. The data integrity is questioned because many relevant management reports are missing, the real quality of data is unknown. Integrity, Referential. The relations between VOE, OE, Semester plans and Variant Semester plans are questionable and in many cases, absent. Simplifying the digital catalogue would greatly improve this situation. Reliability. Even though Roel is positive on the reliability of Educator, many colleague teachers may disagree. Using Educator only once in a while, and inadequate training and documentation may well be at the source of this attitude. In Roels experience, often teachers make mistakes, blaming the system. Specification. On a scale of 1 to 10, where 10 equals excellent and 1 is non-existent, specification scores a poor 1.5 or 2 at most. Timeliness. For many, the planning of the educational process is perceived as being complex. When new education is to be developed, development has to start well in advance of the targeted study period in order to deliver study information in time. Uniqueness. Is OK. Volatility. Study information is altered annually, or every half year in some cases. Grades are created quarterly, amounting to about 230.000 grades being registered at Windesheim as a whole each study period. Study plans are extended every six months.
6.10
Interview report Current Data Quality Educator Viola van Drogen Attendees:
Viola van Drogen, Functional support Educator Frank Boterenbrood
Windesheim,
16-03-2010
Introduction. The business domain of this research is focused on the business domain supported by Educator. To visualize this domain, the domain architecture as designed by (Jansen, 2006) is used as an information source. Currently, this domain architecture is being discussed. Issues, causes and solutions In Educator, data may be entered and updated by many stakeholders, while in many cases Educator does not offer input checks, resulting in data being in error the moment they are stored in the system. Causes identified by functional support are: • no workflow has been defined for the specific data set; • on individual fields, no data checks are available; • the stakeholder operating Educator lacks vital knowledge on the effects of erroneous input; • time to develop fitting reports are missing. In the experience of functional support, many stakeholders agree on the fact that data needs to be correct, however, this attitude seems to be missing with regard to one‟s own actions. Errors in data are revealed once grade certificates are printed. On these certificates, it becomes clear that grades are not assigned to the right courses and that descriptions of courses are in error. It is revealed that in many cases errors are caused by inadequate data entry. An example of inadequate data entry is the situation in which grades are entered twice. This may seem to be an innocent mistake, since in the end, the result of this mistake is that the student receives the right grade. It seems that no harm is done, yet a student is granted only one chance to redo an assessment when a grade is insufficient – and entering a grade twice counts as rework!
6-Feb-12
F. Boterenbrood
Page 89
Research
Improving data quality in higher education
Currently, Educator produces grade certificates for all first grade students and, at some Schools, second grade students. In the study process, the digital education catalogue needs to be finalized first, and the student‟s personal activity plan (PAP) as well as the schedule may be created next.If either the catalogue or a student‟s activity plan are incomplete, teachers may not be able to assign grades. It is found that the digital catalogue is used as an experimental course development stage, instead of a catalogue of predefined and finalized course definitions, resulting in frequent change requests on previously accepted definitions. Wither or not the semester plans and semester variant plans are actually being used is unknown. In order to get a grip on changes in DOC and enable smooth scheduling of classes, and to support the PAP creation process, the option of freezing the digital education catalogue in April is currently being discussed. Preceding this lock-down of the digital catalogue, a mechanism using red and green „traffic lights‟ may be implemented, reminding stakeholders of the effects of changes in DOC. Discussion on Data Quality Dimensions The research has defined a list of data quality dimensions. Which data quality dimensions are currently of importance? When looking from the student‟s point of view, Accessibility of data in Educator leaves room for improvement. To find the right course in the catalogue proves to be a challenge at times, since naming conventions may exist yet rarely used. This results in a plethora of course identification codes, looking rather identical in many cases. An action planned to improve this situation is the implementation of automatic course code generation, replacing the manual assignment of a course code by a code that is generated automatically based on course parameters. Another action is to assign timing information to course information, indication the semester and period in which the course is going to be scheduled. Accountability is almost 100% implemented. All actions modifying data sets in Educator are logged, creating an audit trail binding stakeholders to actions performed. Unfortunately, an audit trail is not created if an instance of a course is removed from activity plans, and re-attached to those plans once the course is modified. This action however is under debate, since it is an action which should under normal circumstances not be required and seems to create problems at entering grades. Accuracy seems to be in control, yet in some cases student data is found to be corrupted. The source of these problems is believed to be a data migration between the old and new student information systems. However, students entering faulty data in the online admission system studielink (www.studielink.nl) are a likely source of data corruption too. Students confronted with data quality problems may have their data corrected at the student administration. Their data will be corrected in the main student information system first, and then transferred to secondary systems later. Confidentiality is under discussion. Functional support is able to modify student grades, and other stakeholders are able to view these datasets. It is rather likely that this is an undesired situation. Currency is of importance at entering grades. At Windesheim, it is agreed that grades become available within two weeks after an assessment. However, no instruments to monitor this period are present currently. Integrity of data seems to be under control, be it that in the past entities were discovered in which required data fields were missing. Specialists from both Windesheim and the software supplier were not able to find a cause for this anomaly. The situation is corrected, yet the cause remains unknown. A report has been defined, offering a control monitoring integrity. Currently, reliability of Educator is being questioned. Educator does not offer basic reports, reports are being created using Business Objects. Unexpected collapses of Business Objects result in unavailable or unreliable reports. Currently, re-instating Educator‟s reporting capabilities is being discussed. To solve data quality issues, an Educator database quality taskforce is created.
6-Feb-12
F. Boterenbrood
Page 90
Research
6.11
Improving data quality in higher education
Data Quality Workshop On Tuesday 30th of March, a workshop establishing future data quality requirements was conducted. In this paragraph, the outcome of this workshop is documented. Date and Time: Location: Attendees present:
Tuesday, 30-03-2010, 14:00 – 16:00 IT services, Windesheim G. Spoelman (Teamleader Software Development), G. Kwakkel (Software Development), K. Haasjes (Operations), A. Polderdijk (Information Security), G. Vissinga (Process Design), G. IJszenga (Education Management), R. Slagter (Project Management), M. van den Berg (Operations), A. Paans (Information Management), H. Tellegen (Operations), A. Jaspers (Operations), F. Boterenbrood (Research).
Workshop Schedule: 14:00 Welcome, Problem Definition (A. Paans) and Workshop (F. Boterenbrood) discussion. 14:30 Discussing Educator process and business rules (All) 15:00 Explanation on Data Quality Dimensions (F. Boterenbrood) 15:15 Selection of future Data Quality Dimensions (All) 15:45 Discussion of results (All) 15:00 Wrap-Up Workshop Preparation For this workshop, a large room providing both free space for workshop activities and a large table for a „round table‟ discussion was arranged. For each attendee, the Educator process and a set of business rules was printed. The Educator process was divided into four main sections, each sub process resulting in a baseline as established during interviews. For each section, a paper sheet was taped to the wall, enabling workshop attendees to visibly choose data quality dimensions suitable for the sub process discussed. For each section, a set of A4 sized sheets were printed, each sheet defining one data quality dimension. Every data quality dimension was given a value according to its position in the WDQM (value = (level – 1)2 ). An option was offered to assign a reduced value to a dimension, resulting in the data quality dimension being partly met. The value of a dimension was expressed in „credits‟. For each section, a set of 20 green and 10 red labels was provided, limiting the number of data quality dimensions to be selected. Prior to workshop execution, selection of data quality dimensions was tested on colleagues within the School of Information Sciences. Based on experiences collected from these tests, data quality dimension definitions were improved. Workshop Execution In a round-table setting, the Educator process and business rules were discussed. This discussion resulted in some business rules being dropped, while others were altered. The data quality dimensions were discussed. Care was taken not to reveal the WDQM yet. The participants were grouped into four groups. During 15 minutes, each group discussed a sub process, assigning 20 green labels to data quality dimensions, each label corresponding with one „credit‟.
6-Feb-12
F. Boterenbrood
Page 91
Research
Improving data quality in higher education
After this initial round, groups switched and validated data quality dimensions assigned to a sub process by another group. Alterations were indicated by red labels. The total number of labels was not to exceed 20. Finally, the results were discussed. The participants showed confidence in the results gained yet expressed doubts regarding the way these were to be interpreted. The WDQM was discussed. Workshop Results The business rules were validated, in some cases altered, and agreed upon. For sub processes, data quality dimensions were assigned: (Sub process, DQ dimension,
Required
,
WDQM level)
o
Manage Digital Education Catalogue Milestone: courses are published Completeness (Must have) Currency (Should have) Accuracy (Should have) Reliability Specifications (Should have) Consistency (Should have) o Orientate, Select, Apply and Contract Milestone: Student‘s activity plan is agreed upon Timeliness (Must have) Reliability Completeness (Should have) Accuracy (Should have) Accountability o Schedule, Study and Assess Milestone: grades are assigned Accuracy (Should have) Referential Integrity (Should have) Completeness (Should have) Currency (Must have) Timeliness (Should have) o Discuss Progress and Manage Study Progress Milestone: Student receives a Certificate Completeness (Must have) Accuracy (Must have) Reliability Confidentiality (Must have) Currency (Must have)
6-Feb-12
F. Boterenbrood
3 4 3 3 2 4
4 3 3 3 3
2 2 3 4 4
3 3 3 3 4
Page 92
Research
Improving data quality in higher education
Workshop Sheets Used
Beschikbaarheid
Betrouwbaarheid
Consistentie (reactief)
Beschikbaarheid beschrijft hoe lang het duurt voordat gegevens beschikbaar zijn voor de deelnemers in een bedrijfsproces.
Betrouwbaarheid beschrijft de mate waarin alle gegevens die een informatiesysteem beheert door besluitvormers vertrouwd worden.
Consistentie beschrijft in hoeverre alle data elementen hetzelfde beschrijven / betekenen. Consistentie kan worden verkregen door gegevens achteraf te corrigeren.
Eenheid :
Tijd, B = delivery time - input time + age
Eenheid :
Binair 1 of 0, Gegevens worden vertrouwd, of zij worden niet vertrouwd
Eenheid :
Hoog
:
Meetbaar met een klok
Wel
:
1
Hoog
:
1
Laag
:
Meetbaar met een kalender
Niet
:
0
Laag
:
0
Afhankelijk van: -
Credits:
Vereist (hoog)
8
Afhankelijk van: Nauwkeurigheid,Volledigheid
Gewenst (minder hoog)
4
Credits:
C = afwijkend / totaal
Afhankelijk van: -
Gegevens worden vertrouwd
4
Ratio 0 - 1, afwijkingen ten opzichte van totaal elementen.
Credits:
Vereist (laag)
2
Gewenst (minder laag)
1
Consistentie (proactief)
Herleidbaarheid
Integriteit
Consistentie beschrijft in hoeverre alle data elementen hetzelfde beschrijven / betekenen. Consistentie kan worden verkregen door systemen te conformeren aan een enterprise architecture.
Herleidbaarheid beschrijft de mate waarin herleidbaar is wie verantwoordelijk is voor welke wijziging van de waarde van gegevens.
Met de term integriteit wordt hier bedoeld dat gegevens van de allerhoogste kwaliteit moeten zijn. Gegevens zijn integer als per miljoen gegevens er minder dan 3.2 fouten optreden (Six Sigma).
Eenheid :
Eenheid :
Binair 1 of 0, Mutaties zijn herleidbaar, of zij zijn niet herleidbaar
Eenheid :
Ratio 0 - 1, afwijkingen ten opzichte van totaal elementen. C = afwijkend / totaal
Hoog
:
1
Wel
:
1
Hoog
:
Laag
:
0
Niet
:
0
Laag
:
Afhankelijk van: -
Afhankelijk van: -
Credits:
Vereist (laag)
8
Gewenst (minder laag)
4
Credits:
Sigma
6σ 3σ
σ
3.2 fout per miljoen 67K fout per miljoen (93% foutvrij)
Afhankelijk van: Diverse procesindicatoren
Mutaties zijn herleidbaar
4
Credits:
Vereist (6sigma)
16
Gewenst (3σ)
8
Nauwkeurigheid (proactief)
Nauwkeurigheid (reactief)
Referentiële integriteit
Nauwkeurigheid beschrijft in hoeverre gegevens in overeenstemming met de werkelijkheid zijn. Nauwkeurigheid kan worden verkregen door gegevens vooraf te screenen.
Nauwkeurigheid beschrijft in hoeverre gegevens in overeenstemming met de werkelijkheid zijn. Nauwkeurigheid kan worden verkregen door gegevens achteraf te corrigeren.
Referentiële integriteit beschrijft in hoeverre aan elkaar gerelateerde verzamelingen conform de formele relatie zijn vastgelegd. Referentiële integriteit wordt bewaakt door database constraints.
Eenheid :
Eenheid :
Eenheid :
Ratio 0 - 1, fout ten opzichte van totaal aantal elementen.
Ratio 0 - 1, fout ten opzichte van totaal aantal elementen. N = fout / totaal
N = fout / totaal
Ratio 0 - 1, fout ten opzichte van totaal aantal relaties. R = fout / totaal
Hoog
:
1
Hoog
:
1
Hoog
:
1
Laag
:
0
Laag
:
0
Laag
:
0
Afhankelijk van: -
Afhankelijk van: -
Credits:
6-Feb-12
Vereist (laag)
4
Gewenst (minder laag)
2
Credits:
Vereist (laag)
2
F. Boterenbrood
Afhankelijk van: -
Gewenst (minder laag)
1
Credits:
Vereist (laag)
2
Gewenst (minder laag)
1
Page 93
Research
Improving data quality in higher education
Workshop sheet used - continued
Specificatie
Tijdigheid
Toegankelijkheid
Specificatie beschrijft of de gegevensverzameling en bedrijfsregels voldoende gedocumenteerd zijn.
Tijdigheid beschrijft de mate waarin gegevens beschikbaar zijn en geschikt voor het gebruik.
Toegankelijkheid beschrijft de mate waarin toegang tot gegevens ontstaat voordat zij irrelevant zijn.
Eenheid :
Binair 0 - 1
Eenheid :
Onbepaald, T = Vluchtigheid * Beschikbaarheid
Eenheid :
Ratio, T = 1- (delivery time input time) / (outdated time input time)
Voldoet :
1
Hoog
:
B = Golflengte Vluchtigheid
Laag
:
0
Afhankelijk van: -
Credits:
Afhankelijk van: Vluchtigheid, Beschikbaarheid
Vereist (Voldoet)
2
Gewenst (Voldoet bijna)
1
Credits:
Vereist (hoog)
8
Gewenst (minder hoog)
4
Afhankelijk van: Beschikbaarheid
Credits:
Vereist (hoog)
8
Gewenst (minder hoog)
4
Uniciteit
Vertrouwelijkheid
Vluchtigheid
Uniciteit beschrijft de mate waarin gegevens eenduidig zijn verkregen, opgeslagen en weergegeven.
Vertrouwelijkheid beschrijft de mate waarin alle gegevens afgeschermd zijn voor ongeautoriseerd gebruik
Vluchtigheid beschrijft de snelheid waarmee gegevens in het bedrijfsdomein wijzigen.
Eenheid :
Ratio 0 - 1, dubbelingen ten opzichte van totaal aantal entiteiten. U = dubbel / totaal
Eenheid :
Vertrouwelijkheid rust op vele maatregelen. Criterium voor indeling: BIV codering
Eenheid :
Frequentie
Hoog
:
1
Hoog
:
Essentieel
Hoog
:
Dagelijks
(f>5/W)
Laag
:
0
Midden :
Belangrijk
Redelijk :
Wekelijks
(f>5/M)
Laag
Wenselijk
Matig
:
Maandelijks
(f>5/S)
Laag
:
Semester
Afhankelijk van: -
Credits:
Vereist (laag)
2
:
Afhankelijk van: Beschikbaarheid en Integriteit
Gewenst (minder laag)
1
Credits:
Essentieel Belangrijk Wenselijk
8
4
2
Afhankelijk van: -
Credits:
Vereist
0
Gewenst
0
Volledigheid Volledigheid beschrijft de mate waarin alle gegevens die voor het proces vereist zijn, zijn vastgelegd.
Eenheid :
Ratio 0 - 1, Aantal missende gegevens ten opzichte van totaal. V = gemist / totaal
Hoog
:
1
Laag
:
0
Afhankelijk van: Volledigheid kan op gespannen voet staan met Tijdigheid
Credits:
6-Feb-12
Vereist (laag)
4
Gewenst (minder laag)
2
F. Boterenbrood
Page 94
Research
6.12
Improving data quality in higher education
Business rules according to the Windesheim Educational Standards The Windesheim Educational Standards (Iersel, Loo, Serail, & Smulders, 2009) identify a set of business rules in the form of high level descriptions, guiding the behavior of an organization (Agrawal, Calo, Lee, Lobo, & Verma, 2008). The educational model is student centered and competence based. Students will be offered a broad set of choices. Students will be guided in acquiring internationally accepted qualifications (CROHO26competences). Students will be guided in acquiring nationally accepted generic domain competences. Students will be coached during their study. A school offers one or more educational programmes. The effort an educational programme requires is measured in EC (European Credits). A programme will be constructed using one major and two minors. A major is a set of courses and workshops. A major defines the mandatory part of a programme of education. A major is 180 EC in size. A minor is a set of courses and workshops. A minor defines the optional part of a programme of education. A minor is 30 EC in size. At least one minor will result in the student having completed the first cycle (bachelor level). A course is defined as an onderwijseenheid (OE). The maximum size of an onderwijseenheid is 30 EC. The minimum size of an onderwijseenheid is advised to be 3 EC. Every onderwijseenheid will result in at least one variant (VOE). Onderwijseenheden are clustered into a semesterplan. Variants of an onderwijseenheid are clustered into a semestervariantplan. Students are free to choose minors from within their programme of education, from another programme of education, or from another institution, nationally or internationally. Programmes may restrict the choice of minors, based on their contribution to the competences to be acquired. Assessments are competence based. Competence based assessments observe students knowledge, insights, skills and attitude. Every onderwijseenheid is concluded by an assessment. An onderwijseenheid is either project-based or a theoretical of nature. Every programme has a propaedeutics phase. The propaedeutics phase has a size of 60 EC. The propaedeutics phase is concluded with a propaedeutics assessment. A student is advised whiter or not to continue his study, based on the results of the propaedeutics assessment. The advice is a mandatory opinion. Windesheim does support the Associate degree (Ad). The effort to acquire an Associate degree is at least 120 EC. Windesheim does support the second cycle (Master Degree) Education in the second cycle has no major/minor structure.
26 Centraal Register Opleidingen Hoger Onderwijs: Central Registration of Schools in Higher Education
6-Feb-12
F. Boterenbrood
Page 95
Research
Improving data quality in higher education
During his study, the student will receive personal guidance. Effort required for personal development will amount to 8 EC at least and 16 EC at most. Personal development will be assessed. Windesheim does offer part-time studies and courses. A part-time study does not necessarily have a major/minor structure.
6.13
Detailed Business Rules Manage Digital Education Catalogue 1. 2. 3. 4. 5.
When the development of a course is completed, it will be described in the Digital Education Catalogue. When a course is described in the Digital Education Catalogue, it will be assigned to a semesterplan. When a course is described in the Digital Education Catalogue, it will be assigned to a major or a minor. When a course is described in the Digital Education Catalogue, for each type of education (daytime education, part-time education) a variant will be described. When a variant of a course is described in the Digital Education Catalogue, it will be assigned to a variant semesterplan.
Orientate 6. 7.
When a student engages a new semester, he will work on his Personal Activity Plan (PAP). When a student works on his PAP, he may use the Digital Education Catalogue as a source to choose from.
Select 8.
When a student works on his PAP, he may choose semester variant plans from the Educational Catalogue and add them to his PAP, thus creating an individual study programme. 9. When a student is enlisted in a study, the mandatory major of his programme will have to be executed first. 10. A student‟s personal activity plan may in Educator may not be managed by the student. It may actually be managed by the back-office of a School27. Apply 11. When a student adds a semester variant plan to his PAP, including only minors offered by the programme the student initially enlisted for, the addition is agreed upon automatically. 12. When a student adds a semester variant plan to his PAP, including minors offered by programmes other than the one the student initially enlisted for, an examination committee will have to agree first. 13. When a minor is either full or cancelled, the student may have to choose another semester variant plan for his PAP.
27 As identified in the workshop of 30-03-2010, see appendix 6.11
6-Feb-12
F. Boterenbrood
Page 96
Research
Improving data quality in higher education
Contract 14. When a PAP is agreed upon, and the minor(s) selected by the student is/are still available and not booked already, the PAP is finalized. Schedule 15. When the execution of minors is agreed upon, a schedule is created by individual or collaborating Schools. 16. When a schedule is created, it takes into account the number of students attending to a course, the specific characteristics and educational needs of a course (type and size of classrooms and equipment), the availability of teaching staff assigned to the course and the order in which courses are to be scheduled. 17. When the schedule is finalized, it is published. Study 18. When the student is working on his study, he will create a portfolio. 19. When the student is working on his study, he may work with other students on a project 20. When students work in projects, they will share items in their portfolio. Assess 21. When an item in a portfolio is ready for assessment, the student will transfer ownership of that item to the teacher. 22. When an item is assessed, a grade will be assigned to it. 23. When a grade is assigned to an item, it may no longer be changed. 24. When all assessments of a course are finalized, the end result will be calculated. 25. When an end result is calculated, rules as defined in the Digital Course Catalog for the course at hand are executed. 26. When all results exceed the minimal requirements as defined in the Digital Course Catalog for the course, the student is granted the European Credits (EC) associated with this course and as defined in the Digital Course Catalog. Discuss Progress 27. When a semester is finished, the student‟s progress is discussed. 28. When a student fails to collect the required EC‟s during the propaedeutics phase within a limited period, the student is not allowed to continue his study at Windesheim. 29. In some cases, when the student has collected 120 EC, an Associate degree may be assigned. 30. When the student has collected 210 EC, the (final) graduation minor may be started. 31. When the student has executed the graduation minor successfully, the first cycle is completed and a Bachelor‟s degree is assigned. 32. When a student wishes to earn a Master degree, he may engage in a study for the second cycle. 33. When, while studying in the second cycle, the student collects a minimum of 60 EC, the second cycle is completed and the Master degree is granted.
6-Feb-12
F. Boterenbrood
Page 97
Research
Improving data quality in higher education
Manage Study Progress 34. When a product from a student is assessed and graded, the grade is stored digitally and made available to the student. 35. When credits are granted to a student, these credits are stored digitally and made available to the student. 36. When a first attempt to be assessed is not successful, a second assessment will be offered during the same study year. 37. When a course is changed between assessments, the rules and number of credits associated with the course the student originally attended to, apply.
6.14
Project Flow During discussing the project‟s progress, the constituent of the project drew a map representing the flow of the project as he visualized it. Being indeed an accurate description of this project, this flow was agreed to be documented in order to be able to discuss progress in the future. This appendix contains the constituent‟s vision on the flow of the project.
WDQM model Data Quality Theory Maturity
Practice Issues
Data Quality practices
Current Desired Solutions
What can be seen is that, based on theories on data quality and maturity, a data quality maturity model is created. This model is used to investigate issues as experienced in the current practice. Data Quality practices, defined by the model, describe a desired situation in terms of solutions. This finally, leads to new information adding to the body of knowledge (theory).
6-Feb-12
F. Boterenbrood
Page 98
Research
6.15
Improving data quality in higher education
Literature Agrawal, D., Calo, S., Lee, K.-W., Lobo, J., & Verma, D. (2008). Policy Technologies for SelfManaging Systems. Boston: IBM Press. Ahern, D. M., Clouse, A., & Turner, R. (2008). CMMI® Distilled: A Practical Introduction to Integrated Process Improvement, Third Edition. Boston: Pearson Education, Inc. Arvix. (2009). Wacht u tot de rookmelder afgaat. Retrieved november 8, 2009, from www.arvix.com: http://www.arvix.com/user_files/file/wacht_u_tot_de_rookmelder_af_gaat_v12_web.pdf Baida, Z. S. (2002). Architecture Visualization, Master Thesis in Computer Science. Amsterdam: VU University. Bakker, J. G. (2006). De (on)betrouwbaarheid van informatie, je staande houden in het informatiegeweld. Benelux: Pearson Education. Batini, C., & Scannapieco, M. (1998). Data Quality, Concepts, Methodologies and Techniques. New York: Springer Berlin Heidelberg. Besouw, F. v. (2009). Samenhang tussen bedrijfsregels, bedrijfsprocessen en gegevenskwaliteit. Retrieved november 8, 2009, from Arvix: http://www.arvix.com/user_files/file/samenhang_bedrijfsregels_bedrijfsprocessen_gk.pdf Boer, S. d., Andharia, R., Harteveld, M., Ho, L. C., Musto, P. L., & Prickel, S. (2006). Six Sigma for IT Management. Zaltbommel: Van Haren Publishing. Boterenbrood, F., Hoek, J. W., & Kurk, J. (2005). De Informatievoorzieningsarchitectuur als scharnier. Den Haag: Academic Service. Broers, H. (2007). Onrust in de wijngaard, de wording van Windesheim. Zwolle: Waanders. Caballero, I., & Piattini, M. (2003). CALDEA: A Data Quality Model Based on Maturity Levels. Proceedings of the Third International Conference On Quality Software (pp. 380-387). Washington: IEEE Computer Society. Caluwé, L. d., & Vermaak, H. (2006). Leren veranderen, Een handbioek voor de veranderkundige. Deventer: Kluwer. Champlin, B. (2002, 01 14). Beyond The CMM: Why Implementing the SEI's Capability Maturity Model Is Insufficient To Deliver Quality Information Systems in Real-World Corporate IT Organizations. Retrieved 02 07, 2010, from DAMA Michigan: www.dama-michican.org Chen, P. (1976). The Entity Relationship Model: Toward a Unified View on Data. ACM Transactions on database systems , 166 - 193. Conway, S. D., & Conway, M. E. (2008). Essentials of Enterprise Compliance. Hoboken, New Jersey: John Wiley & Sons. Curtis, B., Hefley, W. E., & Miller, S. A. (2009). People CMM: A Framework for Human Capital Management, Second Edition. Boston, MA: Pearson Education, Inc.
6-Feb-12
F. Boterenbrood
Page 99
Research
Improving data quality in higher education
Data Quality Task Force. (2004, 12). Forum Guide to Building a Culture of Quality Data. Retrieved 11 30, 2009, from ies national center for educational statistics: http://nces.ed.gov/forum/pub_2005801.asp Davis, J. (2009). Open Source SOA. Greenwich: Manning Publications. English, L. P. (2009). Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems. Indianapolis: John Wiley & Sons. European Commission. (2005). The Framework of Qualifications for the European Higher Education Area. Retrieved 03 10, 2010, from The official Bologna Process Website: http://www.ond.vlaanderen.be/hogeronderwijs/bologna Fishman, N. A. (2009). Viral Data in SOA: An Enterprise Pandemic. Boston: Pearson plc publishing as IBM Press. Friedman, T. (2009, 09 09). Gartner Webinar: Data Quality Do‘s and Don'ts. Retrieved 02 10, 2010, from Gartner: www.gartner.com Gack, G. A. (2009). Connecting Six Sigma to CMMI Measurement and Analysis. Retrieved 12 9, 2009, from i Six Sigma: http://software.isixsigma.com/library/content/c050316b.asp Gartner. (2007, 02 07). Gartner's Data Quality Maturity Model. Retrieved 02 10, 2010, from Gartner Research: http://my.gartner.com Goodhue, D. L., Wybo, M. D., & Kirsch, L. J. (sept 1992). The Impact of Data Integration on the Costs and Benefits of Information Systems. MIS Quarterly, Vol. 16, No. 3 , 293-311. Graham, I. (2007). Business Rules Management and Service Oriented Architecture. Hoboken: John Wiley & Sons. HBO-raad Lectorenplatform. (2006). Lectoren bij hogescholen. Diemen: Villa Grafica. Hendriks, P. (2000). De noodzaak van een nieuwe norm voor procesverbetering? Wat behelst ISO 15504 - SPICE? Retrieved 12 9, 2009, from Esprit project no 27700: http://www.serc.nl/espinode/informatie/SPICE.htm Hoermann, K., Mueller, M., Dittmann, L., & Zimmer, J. (2008). Automotive SPICE in Practice: Surviving Interpretation and Assessment. Santa Barbara: Rocky Nook. Hope, G., & Woolf, B. (2008). Enterprise Integration Patterns. Boston: Pearson Education, Inc. Iersel, J. v., Loo, F. v., Serail, I., & Smulders, L. (2009). Windesheim Onderwijs Standaard versie 5.0. Zwolle: Windesheim. Jansen, J. (2006). Domeinarchitectuur vraaggestuurd onderwijs Windesheim. Zwolle: Windesheim. Johnson, E., & Jones, J. (2008). A Developer‘s Guide to Data Modeling for SQL Server: Covering SQL Server 2005 and 2008. Boston: Pearson Education, Inc. Kneuper, R. (2008). CMMI: Capability Maturity Model Integration A Process Improvement Approach. Santa Barbara, CA: Rocky Nook.
6-Feb-12
F. Boterenbrood
Page 100
Research
Improving data quality in higher education
Kovac, R., Lee, Y. W., & Pipino, L. L. (1997, 10). Total Data Quality Management: The Case of IRI. Retrieved 02 24, 2010, from The MIT Total Data Quality Management Program: http://web.mit.edu Lankhorst, M. (2005). Enterprise Architecture At Work. Berlin: Springer-Verlag Berlin and Heidelberg GmbH & Co. KG . Lee, Y. W., Pipino, L. L., Funk, J. D., & Wang, R. Y. (2006). Journey to Data Quality. Cambridge, Massachusetts: The MIT Press. Loshin, D. (2001). Enterprise Knowledge Management, the data quality approach. San Diego: Academic Press. Loshin, D. (2008). Master Data Management. Burlington: Morgan Kaufmann OMG Press. Marble, R. P. (1992). A stage theoretic approach of information system planning in existing entities of recently established market economies. Retrieved 11 11, 2009, from System Dynamics Society: http://www.systemdynamics.org/conferences/1992/proceed/pdfs/marbl405.pdf McGilvray, D. (2008). Executing Data Quality Projects. Burlington, MA: Elsevier, Inc. Mosley, M. (2008). DM BOK: Data Management Body of Knowledge. Retrieved 11 07, 2009, from Data Management International: www.dama.org Nolan, R. (march-april 1979). Managing the crisis in data processing. Harvard Business Review no 79206 . Object Management Group. (2008, 06 01). Business Process Maturity Model (BPMM). Retrieved 12 9, 2009, from Object Management Group: http://www.omg.org/spec/BPMM/ Olle, T. W. (1978). The Codasyl Approach to Data Base Management. New York: John Wiley & Sons. Pant, K., & Juric, M. (2008). Business Process Driven SOA using BPMN and BPEL: From Business Process Modeling to Orchestration and Service Oriented Architecture. Birmingham: Packt Publishing. Pascale, R., Peters, T., & Waterman, R. (2009). McKinsey's 7-s framework model. Retrieved 12 9, 2009, from Value Based Management.net: http://www.valuebasedmanagement.net/methods_7s.html Porter, M., & Millar, V. (1985, Juli-August). How information gives you competitive advantage. Harvard Business Review . Project Management Institute. (2008). Organizational Project Management Maturity Model OPM3. Newtown Square, Pennsylvania: Project Management Institute. Riet, P. v. (2009, 10). Knelpunten in de plannings- en roosteringsprocessen van de hogescholen. Retrieved 02 18, 2010, from Lectoraat ICT en Onderwijsinnovatie: www.licto.nl Ryu, K.-S., Park, J.-S., & Park, J.-H. (2006). A Data Quality Management Maturity Model. ETRI Journal vol.28, no.2, Apr. 2006 , 191 - 204.
6-Feb-12
F. Boterenbrood
Page 101
Research
Improving data quality in higher education
Schumacher, M., Fernandez-Buglioni, E., Hybertson, D., Buschmann, F., & Sommerland, P. (2006). Security Patterns, Integrating Security and System Engineering. Chchester: John Wiley & Sons Ltd. Software Engineering Institute. (2009). Capability Maturity Model Integration Overview. Retrieved 12 9, 2009, from Software Engineering Institute / Carnegie Mellon: http://www.sei.cmu.edu/cmmi/ Starreveld, R., Leeuwen, O. v., & Nimwegen, H. v. (2004). Bestuurlijke informatieverzorging deel 2a - Fasen van de waardekringloop. Leiden: Stenfert Kroese. Tan, D. (2003). Van Informatie management naar Informatie Infrastructuur management. Leiderdorp: Lansa Publishing. Treacy, M., & Wiersema, F. (1997). The Discipline of Market Leaders: Choose Your Customers, Narrow Your Focus, Dominate Your Market. New York: Perseus Books. Vermeer, B. H. (2001). Data Quality and Data Alignment in E-business. Eindhoven: CIP-Data Library Technische Universiteit Eindhoven. Verreck, O., Graaf, A. d., & Sanden, W. v. (2005, augustus). Meten en verbeteren van gegevenskwaliteit. Tiem - 9 , pp. 36 - 42. Vught, F. A., & Huisman, J. (2009). Mapping the Higher Education Landscape. Dordrecht: Springer Science + Business Media B.V. Windesheim. (2004). IT Architectuur 2004 ICT v3.doc. Zwolle: Windesheim dienst ICT. Zee, P. d. (2001). Business Transformatie en IT, Vervlechting en ontvlechting van ondernemingen en informatietechnologie. Retrieved 11 11, 2009, from Management en Consulting: http://managementconsult.profpages.nl/man_bib/ora/vanderzee01.pdf Zeist, B. v., Hendriks, P., Paulussen, R., & Trieneken, J. (1996). Kwaliteit van Softwareprodukten Ervaringen met een kwaliteitsmodel. Deventer: Kluwer Bedrijfswetenschappen.
6-Feb-12
F. Boterenbrood
Page 102
Research
6.16
Improving data quality in higher education
List of figures and tables Figure 01: Windesheim Context Diagram............................................................................................... 4 Figure 02: The Windesheim application landscape ............................................................................... 11 Figure 03: IT service department and system integration organization................................................. 11 Figure 04: Nolan‟s stage model ............................................................................................................ 15 Figure 05: Era‟s and discontinuities, (Zee, 2001) ................................................................................. 16 Figure 06: Project stakeholders ............................................................................................................. 20 Figure 07: Research Model ................................................................................................................... 25 Figure 08: Concepts Used ..................................................................................................................... 28 Figure 09: graphical representation WDQM ......................................................................................... 40 Figure 10: A Data Quality Management Maturity Model (Ryu, Park, & Park, 2006) .......................... 41 Figure 11: Related Dimensions ............................................................................................................. 53 Figure 12: Domain architecture student centered education Windesheim ............................................ 55
Table 01: Stakeholder analysis .............................................................................................................. 20 Table 02: Research Material ................................................................................................................. 29 Table 03: Practices and structure, process, technology, information and staff ..................................... 36 Table 04: A combined view on maturity. .............................................................................................. 37 Table 05: Windesheim Data Quality Maturity model WDQM ............................................................. 38 Table 06: A combined view on the WDQM and the Gartner Data Quality Maturity model................. 43 Table 07: An overview of data quality dimensions ............................................................................... 45 Table 08: Dimensions of data quality.................................................................................................... 46 Table 09: WDQM Goals expressed in Data Quality Dimensions, Practices and Attributes ................. 52 Table 10: Current data quality dimension values .................................................................................. 60 Table 11: Data quality dimension assessment workshop results ........................................................... 61
6.17
6-Feb-12
Glossary Accessibility
Ease of attainability of the data
Accountability
Accountability is the property that describing that actions affecting enterprise assets can be traced to the actor responsible for the action
Accuracy
Closeness of the value of the data to the value in the real world
Business Rules
A set of high level descriptions, guiding the behavior of an organization
Business rule matching
Comparing data values found in a database with valid values according to business rules
Canonical Data Model
A thesaurus of all data being exchanged between systems
Completeness
The degree in which elements are not missing from a set
Confidentiality
The property that data is disclosed only as intended by the enterprise
Consistency
The degree in which values and formats of data elements are in line with semantic rules over this set of data-items
Correctness
The degree in which values and formats of data elements are in line with the current state of an object in the physical world represented by the data
COTS
Commercial Off The Shelve. An acronym for packaged, ready to use applications
F. Boterenbrood
Page 103
Thesis
6-Feb-12
Improving data quality in higher education
CRUD services
Create, Retrieve, Update and Delete data manipulation services
Currency
Concerns how promptly data are updated
Data profiling
A set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as for exploring relationships that exist between value collections within and across data sets
Discontinuity
A change in values perceived to be a setback
DMAIC
Quality improvement cycle including Define, Measure, Analyze, Improve and Control phases
Endless loop
See loop, endless
Information
Data, fit for use, available in a context
Input check
A control guarding data quality when entered
IP
Information Product
Integrity, Data
The degree in which data is fit for use
Integrity, Referential
The degree in which related sets of data are consistent
Latency
Idle time in a process
Loop, endless
See endless loop
MDM
Master Data Management maturity model
New data acquisition
An activity in which suspect data is replaced by newly retrieved data
Overloading
Assigning values to a variable, indication a system state the variable was not originally intended to signal
OPM3
Organizational Project Management Maturity Model
Process
A set of business rules, started by a single trigger, when executed results in a predictable outcome
Process area
A cluster of related practices, part of a maturity level
Reliability
The degree in which data is perceived to represent reality
Root cause analysis
A technique to identify the underlying root cause, the primary source resulting in the problems experienced
ROTAP
Research, Ontwikkel (Develop) Test, Accept and Production environments
Schema cleaning
Transforming a conceptual schema in order to achieve or optimize a given set of qualities
Schema matching
To create a mapping between semantically correspondent elements of two database schemas
Service center
Department within an organization supporting the main business processes
SOA
Service Oriented Architecture
Source Rating
Assessing sources on the basis the quality of data they provide to other sources
Specifications
A measure of the existence, completeness, quality and documentation of data standards
Staff
Personnel involved in a process
Structure
Describes the way an organization is structured
Technology
Tooling required to execute a process
TDQM
Total Data Quality Methodology
Timeliness
Or Availability: A measure of the degree to which data
F. Boterenbrood
Page 104
Thesis
Improving data quality in higher education
are current and available for use
6-Feb-12
TIQM
Total Information Quality Methodology
Uniqueness
Refers to requirements that entities are captured, represented, and referenced uniquely
Volatility
Characterizes the frequency with which data vary in time
WDQM
Windesheim Data Quality Maturity Model
F. Boterenbrood
Page 105