Applying data models to big data architectures
P. O’Sullivan G. Thompson A. Clifford
A key message from the early adopters of big data is that technologies such as HadoopA, NoSQL (Not Only Structured Query Language) databases, and stream computing should not be seen as completely separate technologies but are more valuable when deployed in conjunction with more traditional data management components. There is an urgent need for an overall blueprint for treating both the new and traditional data management components in a holistic and integrated manner. A models-driven approach ensures consistency across this data management landscape in terms of management, governance, and efficiency. This paper focuses on the data modeling considerations relating the big data deployment using the examples of transaction data and mixed unstructured data to ensure that data components are evolved to maximize business value and development efficiencies.
Introduction Most of the fundamental components and infrastructure of the data warehouse landscape were defined in the late 1980s and early 1990s [1]. Core principles were established, such as a central cross line-of-business (LOB) data warehouse that focused on storing historical data in a normalized format, layers of extract-transform-load (ETL) processing to populate this central warehouse from source systems, and a series of associated data marts to provide the data in formats suited to the various different end users. These components are underpinned by security and metadata layers. Significant technological advancements in the intervening years have enabled storage of more data, faster processing of real-time data, and new end-user tools to analyze this data [2]. However, the key principles as defined in the original publications have remained remarkably stable [3]. A common feature of this data warehouse landscape is the use of data models to assist with the development and ongoing maintenance activities [4]. Models are used as a means of ensuring that there is adequate communication between the business sponsors and the IT development staff, ensuring a consistent definition of data elements and enabling the managed growth of the enterprise-wide data warehouse. The advent of big data technologies have prompted a review of the traditional data management landscape into Digital Object Identifier: 10.1147/JRD.2014.2352474
which they are being deployed. In many ways, their capabilities are revolutionizing the way in which the data management landscape is being used [5]. Both Hadoop** [6] and NoSQL (Not Only Structured Query Language) [7] technologies allow organizations to store radically more data in a more efficient manner. Streaming technologies allow the dynamic processing of incoming time-sensitive data, and federated search technologies allow the querying of multiple new sources of data in real-time. Far from heralding the extinction of the existing data warehouse infrastructures, there is growing consensus that this collection of new technologies should be seen as a means of supplementing these traditional data management capabilities [8]. A challenge facing organizations is how to grow their existing data warehouse deployment and data models in a way that protects their previous investments but also enables them to exploit business benefits from big data technologies [9].
Big data architecture Recent publications have outlined the possible new architectures to accommodate big data technologies and how they might coexist with more traditional data management infrastructure [10]. Before describing how models can interact with big data technology it is necessary to first describe the typical big data architecture components: •
Data sourcesVThese sources include all of the different types of data that might be input to the big data architecture. This ranges from more traditional structured
Copyright 2014 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/14 B 2014 IBM
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
P. O’SULLIVAN ET AL.
18 : 1
•
•
•
•
•
•
data originating from well-understood internal source systems to less structured data from internal sources (such as voicemail, email, documents, and call centers), to external data from social media, streaming sensor data, or simply traditional data in much higher volumes. Information ingestionVThis component is needed to assimilate the different data from various sources in an appropriate, timely, and efficient manner. This component might be a combination of traditional ETL technology, data federation, streaming, and MapReduce capabilities. This component might also provide the initial persistence of operational information, for master data management (MDM) or operational data stores (ODS). Real-time analyticsVThis set of data capabilities is used in identifying and acting on the patterns emerging in the data as it comes into the big data architecture in real-time. Landing areaVThese repositories of data are stored for use over time by different aspects of the enterprise. This component may also support functions concerning exploration and archiving activities. As well as providing the basis for the data to be subsequently input to the data warehouse, this area is also an important source for the exploration activities of the data scientists. In addition, the landing area may also be used to store output from the activities of data scientists. Data warehouseVThis set of data stores includes relational warehouses, data marts, and analytic appliances that contain the subset of data needed for specific reporting and analysis by the broader population of business users. The data warehouse may also be augmented with Hadoop technology as required by the characteristics of the data. Information governanceVThis set of components provides the necessary metadata, lineage, and security capabilities. Analytical applicationsVThese applications are the consumers of the data assets of the big data architecture, whether it is for the reporting and analytical needs of the majority of business users or the advanced exploration and discovery activities of data scientists.
Models and the big data architecture In the traditional data management landscape, there is a varied use of different types of models by organizations addressing a multitude of objectives. The range and type of models used traditionally would depend on a number of factors such as: past experience with models, what types of models had worked, what did not, the collective experience and knowledge of the key technologists in the organization, and the overall culture in an organization towards the models [11]. In the case of the big data architecture, it is reasonable to expect a similar widely varied adoption of models. Figure 1
18 : 2
P. O’SULLIVAN ET AL.
illustrates the categories of data design and business models that are potentially significant in the definition and management of a big data landscape. Business vocabularies Business vocabularies are sets of models that enforce a common semantic understanding across business users or between business and IT. Business vocabularies can also be used to define key performance indicators (KPI), associated business rules, and business policies. Business vocabulary models include simple glossaries of terms, more precise hierarchical taxonomies, or a representation of structured knowledge such as an ontology. Business vocabularies support the semantic consistency of either the design-time or run-time activities. A close connection between the semantics at design and run time provides a firm business grounding for any design activities and can provide a degree of discipline or consistency to run-time vocabularies that might otherwise be missing [12]. In the context of the big data architecture, a standard semantic reference point that can be mapped to various technical components is an important mechanism in enforcing consistency of meaning across this hybrid data management landscape. Analysis data models Analysis data models are the first description of business relationships and constraints in formal modeling structures such as Unified Modeling Language (UML) or Entity Relationship (ER) models. Often referred to as conceptual models, these analysis data models are typically cross-LOB and independent of both platform and any design intent. In some cases, the roles of these models are played by structured business vocabularies such as taxonomies to provide the necessary constraints [13]. These are logical data models and are the basis from which the various design data models are derived. This provides a degree of structural consistency across the different design data models being deployed across the big data architecture. Design data models Design data models provide the necessary definition of the different technologies that make up the hybrid big data architecture. The various technologies will require design data models to encapsulate the different design intent for each component. Thus, the design data models that are used in conjunction with the big data architecture would often be as varied as the physical components they are describing. Design data models include relational atomic warehouse models (AWM) (which have a normalized nature optimized for central storage) or dimensional warehouse models (DWM) (which are optimized for supporting reporting requirements). Design data models also relate to more
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
Figure 1 Relationship between modeling and runtime environment components.
operationally focused models describing the repositories in the information ingestion area, and non-relational models describing certain parts of the landing area. A key theme across these design data models is that they should retain as much of the lineage from the overarching conceptual data model as possible. Achieving such a level of semantic and structural consistency assists in enabling reuse and integration of the different physical components of the big data architecture. Using an integrated set of models The range and types of models used in specific instances will vary greatly, depending on the complexity of the big data architecture being deployed. For some organizations, the presence of models is limited to data models used to define common relational database components, with the other components being deployed from pre-built applications. However, other organizations might seek to enforce as much consistency as possible across this landscape using an integrated set of models to provide a common semantic and structural standardization across all components. This would
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
underpin and assist the overall governance, growth, and ongoing management of the big data architecture [14].
Considerations influencing the data modeling When approaching the creation of models to describe and define the different aspects of the overall big data landscape, it is helpful to think of the different considerations that influence either the models themselves or the means by which the models are deployed to the physical environment. These considerations should be taken into account when building design data models intended to reflect the precise aspects of a physical design. Depending on the type of design artifact intended to be generated from the particular model, the different considerations, illustrated in Figure 2, will influence the shaping of these design data models for different parts of the big data landscape. Schema considerations A schema refers to the formal definition of physical data structures. There are various considerations relating to the level of schema required to be enforced as part of the design
P. O’SULLIVAN ET AL.
18 : 3
table and columns of a relational model to minimize redundancy and dependency, or the use of Bassociative[ tables that store complex relationships between tables. It might be decided that flexibility is somewhat sacrificed in order to focus on the storage of very large amounts of data. Ownership considerations Ownership considerations are related to which users or organizations will own the deployed data structures. With centralized or enterprise-level ownership, the need to enforce a common schema becomes critical, whereas with locally or personally owned artifacts or with artifacts owned by an external organization, such schema considerations are far less important. A high level of enterprise or central ownership of a particular set of artifacts means a stronger need for a commonly agreed upon structure or definition, hence the increased need to have such artifacts derived from a model.
Figure 2 Six modeling considerations for big data zones.
of the data structures. This might range from little or no schema to be applied in the case of persisting incoming sensor data or social media data in the landing area, to a very precise schema being required in the case of defining a portion of the data warehouse. The increasing use of SQL-style interfaces such as IBM BigSQL when deploying and using Hadoop allows the use of model-driven schemas, alongside Bschema-less[ deployments. In general, the higher the level of schema for an artifact, the more compelling it is to consider this to be deployed from the models. For example, if one is planning to store trillions of device sensor records where one knows the data structures in advance, then one should define the data structure through a model. Storage considerations Storage considerations are related to the expected size or storage limitations associated with a particular physical design. In many cases, such storage considerations will inherently shape the modeling activities for a particular physical design. For example, in deciding whether a model is to be deployed to Hadoop versus a relational database management system (RDBMS), the intended storage or size considerations would influence the design such as the level of normalization, that is, the process of organizing the
18 : 4
P. O’SULLIVAN ET AL.
Access considerations Access considerations are related to the expected level of end-user access to the artifacts deployed from the particular design model. Aspects that are important here include frequency of access, number of concurrent users, level of aggregation or transformation to present to users, and expected level of technical skill of the intended users. While direct access to the Hadoop layer might be initially limited to data scientists and analysts, these users can still use model-based structures. This is especially important as big data solutions make the transition from experimentation to production where business report processes begin to access certain data stored in Hadoop. Data latency considerations Data latency considerations are related to the anticipated temporal nature of the design artifacts being deployed from the model. This would range from streaming or real-time data, to data updated on a near real-time basis, to data only requiring periodic batch updates. Latency considerations would have a strong influence on the design data model structures, so typically there would be a tendency to denormalize, that is, introducing some managed redundancy for performance. Typically, there is strong correlation between the level of data latency and the level of data processing. Data processing-level considerations Data processing-level considerations are related to the level of processing that needs to be applied to the data in a particular design in order for it to fulfill its intended function. These considerations would range from the need to onboard raw data as-is from a source to the need for various levels of adjustment, transformation, and cleansingVto data structures that store highly calculated or aggregated data. As big data is transformed and aggregated, there is an
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
information governance imperative that ensures that it conforms to enterprise dimensional definitions and hence the greater need for models to enforce this conformity.
Trading off the considerations Each of the above considerations is not addressed in isolation from one another, but rather there is a degree of trade-off and cross referencing between these considerations that greatly influence the different design data models in the big data landscape. The level of schema will have a high correlation with the data processing level of the data. Similarly, it would be difficult to address the necessary storage considerations without also considering the level of schema. For example, a dimensional warehouse model specifies information that has a well-defined schema and a high level of processing but is often implemented in data marts owned by different LOBs. On the other hand, the atomic warehouse model has a very high degree of common or enterprise ownership with a high level of specified schemas, and also has some degree of processed data that is typically not intended for direct user access. While the landing area would also be seen as an enterprise owned asset, the level of processing and data access are quite low, and parts of this area would have very weak schema definition or would be schema-less. In the following two sections, the examples of transaction data and mixed unstructured data are used to show how these considerations can influence typical big data deployment patterns.
Applying modeling considerations to transaction data Enterprises are making the pragmatic decision to focus their big data activities on extracting insight from existing internal transaction data [15]. Transaction data is generated by a wide variety of source systems such as financial payments, Internet traffic logs, instrumented medical devices, and mobile technology. This data is often well understood by data analysts and can be directly related to business processes and outcomes. Raw transaction data is represented within the existing business intelligence (BI) and warehouse solutions, but is often only available for analysis in aggregated data marts that were designed to meet specific analysis use cases. This is due to the costs and challenges of scaling existing RDBMS-based warehouses in order to process and maintain data at the petabyte level [16]. When historical transaction data is retained, it is often placed in tape archives to support regulatory or audit requirements, where it is not available for the purposes of data exploration and discovery. Once the decision has been made to retain transactions as a source for analytics, for example all historical financial transactions in a bank, it is necessary to extended existing
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
information governance and data modeling methods to include this archive. A key challenge in providing access to fine–grain information for various analytical purposes is ensuring that any existing data aggregations are correctly supported. Transaction data models must support data security and governance processes that will ensure that the results of analytics throughout the information supply chain can be trusted and do not contradict one another. Figure 3 shows the flow of both internal and external detailed financial transactions across the big data landscape. Trusted transaction data from internal core banking system are loaded directly to the Hadoop warehouse to form the basis of the transaction archive (see label 1 in Figure 3). Transaction data from third-party sources is persisted in the landing area in its raw format, for example JSON (JavaScript** Object Notation) or CSV (Comma Separated Values) (2). Data scientists experiment with data in the landing area and identify a subset of the raw data that can be safely combined into the conformed schema in the Hadoop warehouse (3). Real-time financial analytics are supported by making transactions from the last four financial quarters available in the high performance RDBMS warehouse (4). Complex aggregations of transactions are performed in Hadoop and loaded directly into data marts (5). Analysts can trust the historical data in the Hadoop warehouse when calculating or predicting transaction profitability (6), or relate transaction aggregates to conformed entities such as customer or account in the data marts (7). Schema considerations There might be multiple transaction source systems for any given data entity in the warehouse, each with its own schema. Source schemas will also be subject to change, especially in the case for external sources where the enterprise might not have control. While each source may have a well-defined interface, such as a message specification or API (application programming interface), it is necessary to conform the data into the schemas of the warehouse. The source schema can be mapped to both design data models and the conceptual data model. Raw transaction data can be persisted in its source schema in the Hadoop landing area to build a deep data resource available for analysis by data scientists. Although the data has not been transformed at write-time to the AWM, the mappings from the source schema to the conceptual data model at the time of analysis provide a late-binding of data to schema. Transaction data is typically mapped, conformed, and aggregated during ingest to the data warehouse. The same transaction record might be transformed in multiple different ways to populate the data repositories that each support different analytical and access needs. For example, the logical AWM provides the same design data model for both
P. O’SULLIVAN ET AL.
18 : 5
Figure 3 Overview of transaction landscape.
the RDBMS and Hadoop areas of the data warehouse. A strong link between these mapping and the conceptual data model is critical if the downstream combination of data from the different repositories is to be accurate and trusted. Aggregations of transaction data are typically modeled by the dimensional schemas of the DWM optimized for specific analytical purposes such as profitability or operational analysis of transactions over time or geography or business unit. Analysts working with such aggregates need consistent and stable data structures with which they can routinely make predications and drive business processes. Storage considerations Transactions are typically generated at very high volumes, and the buildup over time leads to storage challenges. Two strategies for handling this in the traditional data warehouse are a) limiting the number of attributes from each transaction record that are ingested, and b) frequently archiving fine-grain transaction data to tape or other lower-cost storage devices.
18 : 6
P. O’SULLIVAN ET AL.
Hadoop allows transaction data to be persisted to form a historical data asset that can be queried and manipulated in a cost efficient manner. These storage economies also facilitate the duplication and movement of transaction data to sandbox areas for exploratory analysis and experimentation. With the existing data warehouse schemas extended and applied to these large volumes of data, data analysts can execute queries in familiar languages such as SQL (Structured Query Language). Similarly, the sharing of a common schema, using the AWM, for example, between the repositories simplifies the historical off-load of data from the RDBMS to Hadoop for archive purposes. Ownership considerations The mapping of the transaction data to business vocabularies and models helps identify which part of the enterprise is the internal owner of the data. The business unit that owns the source system data should not overly influence the use and schema of transaction data as it is deployed into the warehouse. For example, while the marketing or on-line
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
application teams may see themselves as the owners of web site transaction data, they should not have the power to limit the types or volumes of transaction made available other users. Access considerations The manner in which transaction data is accessed is determined by the type of analysis being carried out on it. End users using a variety of analytical tools are not concerned about how and where the large volume data is stored, providing it is available to them in a suitable format for their analysis tool [17]. The requirement is often to make structured data available through standard data interfaces such as SQL. This allows users to access the data either through existing reporting applications or direct queries. Performance considerations apply in the choice of technology for the data warehouse repositories; for example, columnar RDBMS is more suited for fast access times to dimensionally modeled data, whereas the conventional row-organized RDBMS or Hadoop is better suited to highly normalized repositories used for historical reference. In contrast to the business users, data analysts are typically more adept at finding and manipulating data using a variety of tools. Information governance and metadata catalogues that facilitate publication of design data models associated with transaction data allow analyst to both identify and evaluate the data that they have access to. Metadata associated with the schema can assist in tasks associated with working with large volume sets, for example ensuring that data sampling does not lead to skewed distributions. Data latency considerations Transaction data is commonly generated in a constant manner available either as a real-time stream or in high frequency batches. The desire to match the velocity of data analysis to the velocity of data ingest means that there is a focus on the latency of data moving though the varied analytical data sources. For example, individual financial transactions from payment devices might be streamed instantly into the warehouse, while the reconciled aggregates of the transaction for a given account might be loaded from a core banking system some hours later. Such latency differences are managed in the metadata and mapped to conceptual data models to avoid any confusion or inappropriate combination of data. Data processing-level considerations The manner in which transaction data is to be processed and analyzed strongly influences the design data models. These influences can be characterized by the types of users or applications that are performing the analysis, and the variety or flexibility of the questions being asked of the data.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
Transaction data that is aggregated for reporting shows decision makers what is happening in their business. For example, it may show the average value of sales for customers in each store on the previous day. As source transaction data is often raw and structured for operational efficiency rather than business understanding, it must be transformed for end-user consumption. The resulting information needs to be conformed to the conceptual data model and business vocabularies appropriate to the specific domain of business users, for example, marketing or regulatory compliance, allowing them to examine different facets or dimensions of business events related to the transactions. Data scientists work with fine-grained transactions to analyze and understand patterns in the data [18]. Data exploration and predicative analytics requires a rich data set where interesting features or anomalies have not been cleansed or filtered to meet standard reporting needs. Data models are required for predictive analytics to combine data sets from multiple sources, as they ensure that combined data sets are being interpreted appropriately. This includes the late-binding of schema where the raw source data structures are mapped to the common logical data model only at analysis time. This can reduce data transformation effort while ensuring analytical results can be safely combined with other data in the model.
Applying modeling considerations to unstructured data It is estimated that more than 80% of the data that exists in corporations is unstructured, that is data which either has no predefined schema or organization [19]. In a 2012 survey of 255 BI and data management professionals, 45% of respondents mentioned Bhuman-sourced[ information as a data source being used or planned to be used in their big data projects [20]. This human-sourced information includes external social data (such as Twitter** and Facebook**), image content (e.g., pictures and video), streaming audio (such as call center voice logs), and human-generated documents such as email. A promise of big data is not just that more and better number crunching can be carried out on large volumes of traditional structured data sources such as transactions, but rather that significant operational efficiency and insight can be obtained by combining these traditional sources with other new unstructured data sources. However, big data considerably broadens the integration challenge across the hybrid data management landscape, with much of the data remaining in Hadoop rather than the RDBMS [21]. Applying the business vocabularies, the conceptual data model, and the design data models will help to ensure consistency across the landscape. While there is obviously a broad spectrum of unstructured data sources, one interesting example is the call center.
P. O’SULLIVAN ET AL.
18 : 7
Figure 4 Overview of the unstructured landscape.
It is worth looking at two aspects: legal requirements and requirements to improve operational efficiency. Figure 4 outlines the movement of data across the big data landscape for the call center. Enterprises are subject to litigation and governmental investigations that require the preservation of potential evidence such as e-mail, documents, and audio [22]. In order to comply with these legal requirements, the original audio recording of the conversation with the customer is required to be retained in its original form for a specified period, for example twelve calendar months. For this reason, the audio recordings are persisted for the duration to the landing area (see label 1 in Figure 4). Additionally, for operational insight, automatic speech recognition (ASR) processing is performed on the audio recording to generate text transcripts for further analysis. These transcripts are written to the standardized schema in the Hadoop data warehouse (2). Text analytics using natural language processing (NLP) is performed on the call transcript and any notes entered by the agent. This analysis includes entity and relationship extraction, for example mentions of a specific product or the customer’s employer, and also obtains a sentiment polarity for the call. The output of this analysis is loaded to the RDBMS warehouse and combined with metrics
18 : 8
P. O’SULLIVAN ET AL.
from structured data sources such as customer relationship management (CRM) systems using the customer MDM identifier (3). From there, the combined data is aggregated in the dimensionally modeled schema in the data marts and used for various operational purposes, including developing a better understanding of the customer to provide an improved call experience and help prevent churn (4). Schema considerations The raw audio recordings are schema-less, and are simply on-boarded to the Hadoop landing area and therefore are not featured in the enterprise design data models. The resulting textual transcript of the call exhibits a low level of schema, typically including the customer MDM identifier, and some metadata such as the time of the call, agent identifier, and the unstructured transcript payload. Over time the schema of the transcript and associated metadata might evolve and could benefit from the agility offered by flexible schema such as JSON. The features and sentiment polarity extracted from the unstructured transcript payload, and the resulting conformed dimensional representation, are entirely structured, and the enterprise design data models are applied here. The same
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
schema structure is asserted for this data in both the Hadoop and RDBMS areas of the data warehouse. Storage considerations The raw audio recordings are expected to be relatively large at an estimated 3 MB per minute of recording. The audio recording is an immutable, Bwrite-once[ piece of data, and is ideally suited for persistence in binary format in the landing area. As the raw recordings are required to be retained for a finite period of time only, the data placement should be such to allow for convenient rolling-out of aged data, by nesting directories on the Hadoop file system by year and month. The generated call transcripts for subsequent text analysis are retained indefinitely as a deep data resource to allow for iterative improvements in the analytical models constructed by the data scientist to incorporate new patterns discovered over time. The grain of extracted features and sentiment polarity is such that the data volumes involved are well within the design capacity of the RDBMS data warehouse. This extracted data is persisted in the RDBMS warehouse according to the enterprise retention policies and indefinitely thereafter on the Hadoop warehouse as a queryable archive. The same design data model is applied to these repositories. The sharing of a common schema simplifies this historical off-load. Ownership considerations The call center is an enterprise-wide resource, with centralized ownership and data models for these shared assets in the data warehouse. The call transcripts can include personal identifying or personal sensitive or commercially sensitive data. Therefore, this data must be subject to the necessary security and access requirements as applied in the call center source system. Data protection and data anonymization rules apply here. Access considerations The raw audio recordings residing in the landing area are expected to be accessed on an infrequent basis only, when a litigation or governmental investigation requires recordings to be retrieved and presented as evidence. Therefore, information governance metadata is captured for the audio recordings so that they are linked to traditional structured data via the conceptual data model. The raw transcripts of the calls, stored in the Hadoop data warehouse will initially be accessed by the data scientist only for experimentation in order to build and refine the text-analytics models. The metadata catalog helps the data scientist to identify and evaluate candidate data sources to complement the features extracted from the transcripts. Later, when the insight gained is deemed to be of value, the structured data extracted from the audio transcript, which
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
is of a determined level of veracity, is combined with metrics from transaction data sources using the customer MDM identifier and written to the RDBMS warehouse in schema structures deployed from the AWM. From there, this combined data can be aggregated and written to DWM schema in the data marts for reporting and analysis by the various LOB users for operational efficiency purposes. Additionally, the raw transcripts are indexed by words, phrases, and names to create topic clusters and aligned with existing dimensions for exploration using a full text search engine. Logical organization and clustering of the information is made through taxonomies and ontologies derived from the business vocabularies. These metadata Btags[ on unstructured data can be aligned to dimension keys in the RDBMS, allowing the linkage of the unstructured and structured data for detailed analysis and exploration [23]. Data latency considerations In many cases, batch analysis will suffice for unstructured data processing. However, real-time insight might be required, for example, when the call center relates to an emergency service. In such cases, the audio-recordings and call transcripts can be ingested using a streaming service. Business rules defined in the business vocabularies might be incorporated in this case to act on the streaming data in real-time to determine events or next best action. Data processing-level considerations The raw audio recordings are simply onboarded to the landing area with no processing and consequently do not feature in the design data models. The resulting transcripts are subject to additional text analytics processing, and later indexing, driven by the Hadoop framework. This processing will benefit from some lightly modeled schema as asserted in the AWM. The need to combine unstructured data with existing enterprise data from transaction systems requires various levels of adjustment, transformation, and cleansing. This level of processing implies the data is strongly governed and subject to the enterprise design data models.
Conclusion This paper has outlined the role that various types of data and vocabulary models can play in supporting the managed and ordered evolution of a data management infrastructure that leverages big data technology. The reasons that led organizations to use models to underpin the development of the different components of the data management landscape still remain in the age of big data. Indeed, the increased level of complexity and variety of technologies being employed point to an even greater need for the
P. O’SULLIVAN ET AL.
18 : 9
enforcement of standardization and consistency of structure. As the big data landscape continues to evolve, there is a need for a parallel growth of models that support the description and definition of structures aligned with emerging business opportunities. It is not only necessary to look at the evolution of the business content in the models and their deployment capabilities but also to look at the changing relationship between models and the data management landscape they describe. The traditional distinction of design time models deploying to separate run time environments is diminishing; instead, there is a growing focus on the fusion of the design-time and run-time worlds. An example of this is the desire to push the ownership and definition of the business vocabulary towards the business users and away from the data or business analyst in order to get better buy-in and visibility by the end-user. Another example is the growth of self-service BI, enabling departments to define and select their own database subsets for their own use. The existence of an integrated set of models that accurately describe the business and IT aspects of the data warehouse landscape is key to the efficient growth of this democratization process of that landscape. This paper has predominantly focused on the Hadoop-related aspects of the big data infrastructure with only brief references to other big data technologies such as streaming and data virtualization. However, in these other areas it is expected that there will still be a requirement for the same model-driven consistency and standardization, albeit in potentially different forms than would have been seen up to now. Additionally, it is important to remember that these new big data technologies are not deployed in isolation from the existing traditional data management technologies but need to coexist with them. There is a need for an evolution of data and vocabulary models in terms of their business coverage, the technologies they deploy to, the associated modeling methods, and the creation of new types of models to support totally new physical artifacts. The ongoing expansion of the big data landscape is resulting in an increase of the pressures that led organizations to create and represent cross-domain models in the past, the need to enforced standardization of data structures across different technologies, the need to have a common business language between the business and IT, and the need for a blueprint to guide the growth of an ever-more complex environment. The creation of an integrated set of cross-domain models is essential for the successful growth, governance, and management of a hybrid big data landscape. **Trademark, service mark, or registered trademark of Apache Software Foundation, Sun Microsystems, Twitter, Inc., or Facebook, Inc., in the United States, other countries, or both.
18 : 10
P. O’SULLIVAN ET AL.
References 1. B. Devlin, Data Warehouse: From Architecture to Implementation. Reading, MA, USA: Addison-Wesley, 1996, vol. 27. 2. C. Ballard, F. Hasegawa, G. Owens, S. R. Pedersen, and K. Subtil, Moving Forward With the On Demand Real-Time Enterprise. New York, NY, USA: IBM Redbooks, 2006. 3. W. H. Inmon, C. Imhoff, and R. Sousa, Corporate Information Factory. Hoboken, NJ: Wiley, 2001. 4. M. Delbaere and R. Ferreira, BAddressing the data aspects of compliance with industry models,[ IBM Syst. J., vol. 46, no. 2, pp. 319–334, 2007. 5. J. P. Isson and J. S. Harriott, Win With Advanced Business Analytics: Creating Business Value From Your Data. Hoboken, NJ, USA: Wiley, 2013. 6. T. White, Hadoop, The Definitive Guide, 3rd ed. North Sebastopol, CA, USA: O’Reilly, 2012. 7. P. J. Sadalage and M. Fowler, NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Reading, MA, USA: Addison-Wesley, 2012. 8. P. Russom, Best Practices Report. Evolving Data Warehouse Architectures in the Age of Big Data. Renton, WA, USA: TDWI Research, 2014. 9. T. H. Davenport, Enterprise Analytics: Optimise Performance, Process, Decisions Through Big Data. Upper Saddle River, NJ, USA: FT Press, 2013. 10. K. Krishnan, Data Warehousing in the Age of Big Data. San Mateo, CA, USA: Morgan Kaufmann, 2013. 11. J. G. Carney, BIndustry models for enterprise data management in financial markets,[ IBM J. Res. Develop., vol. 54, no. 2, pp. 6:1–6:13, Mar./Apr. 2010. 12. M. Godinez, E. Hechler, K. Koenig, S. Lockwood, M. Oberhofer, and M. Schroeck, The Art of Enterprise Information Architecture: A Systems-Based Approach for Unlocking Business Insight. Cranbury, NJ, USA: IBM Press, 2010. 13. W. H. Inmon, B. O’Neil, and L. Fryman, Business Metadata; Capturing Enterprise Knowledge. San Mateo, CA, USA: Morgan Kaufmann, 2008. 14. M. D. Chisholm, Definitions in Information Management. San Francisco, CA, USA: DesignMedia, 2010. 15. M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano, Analytics: The real-world use of Big Data. How Innovative Organizations are Extracting Value From Uncertain Data. New York, NY, USA: IBM Institute for Business Value, 2012. [Online]. Available: http://www.ibm.com/smarterplanet/ global/files/se_sv_se_intelligence_Analytics-The_real-world_ use_of_big_data.pdf. 16. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, BHive-a petabyte scale data warehouse using Hadoop,[ in Proc. IEEE 26th ICDE, 2010, pp. 996–1005. 17. C. C. Pasquale, B. Ionescu, D. Ionescu, V. DiLecce, and A. Gurerero, BVirtual data warehouse architecture for real-time WebGIS,[ in Proc. IEEE Conf. VECIMS, 2008, pp. 80–85. 18. M. Chessell and H. Smith, Patterns of Information Management. New York, NY, USA: IBM Press, 2013. 19. W. H. Inmon, D. Strauss, and G. Neushloss, DW 2.0: The Architecture for the Next Generation of Data Warehousing. San Mateo, CA, USA: Morgan Kaufmann, 2008. 20. B. Devlin, S. Rogers, and J. Myers, Big Data Comes of Age. New York, NY, USA: IBM. [Online]. Available: http:// www.9sight.com/big-data-survey-2012.htm. 21. R. Kimball and M. Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed. Hoboken, NJ, USA: Wiley, 2013. 22. S. Mohanty, M. Jagadeesh, and H. Srivatsa, Big Data Imperatives: Enterprise Big Data Warehouse, BI Implementations and Analytics. New York, NY, USA: Apress, 2013. 23. A. Reeve, Managing Data in Motion. Data Integration Best Practice Techniques and Technologies. San Mateo, CA, USA: Morgan Kaufmann, 2013.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
Received April 3, 2014; accepted for publication May 12, 2014 Pat O’Sullivan IBM Software Group, IBM Technology Campus, Damastown Industrial Estate, Mulhuddart, Dublin 15, Ireland (
[email protected]). Mr. O’Sullivan is a Senior Technical Staff Member at the IBM Research - Ireland Lab. He received a B.Sc. degree in computer applications from Cork Institute of Technology in 1985. He has more than 20 years of experience in data warehousing and data models with IBM. He received IBM Outstanding Achievement Awards in 2007 for his work on the IBM Banking Data Warehouse and in 2014 for his work on IBM industry models and big data.
Gary Thompson IBM Software Group, IBM Technology Campus, Damastown Industrial Estate, Mulhuddart, Dublin 15, Ireland (
[email protected]). Mr. Thompson is an information architect at the IBM Research - Ireland Lab. He received a B.A. degree in management science and information systems from Trinity College Dublin in 1998. Mr. Thompson has 15 years of experience in systems integration and data warehouses with IBM. He received an IBM Outstanding Achievement Award in 2014 for his work on IBM industry models and big data.
Austin Clifford IBM Software Group, IBM Technology Campus, Damastown Industrial Estate, Mulhuddart, Dublin 15, Ireland (
[email protected]). Mr. Clifford is a Lead Data Warehouse Specialist at the IBM Research - Ireland Lab. He received a Bachelor of Engineering degree in 1993 and a Master of Management Science degree in 1994, both from University College Dublin. Mr. Clifford has 19 years of industry experience in data warehouse and database technologies. He holds two U.S. patents and has another seven patents pending. Mr. Clifford received an IBM Outstanding Technical Achievement Award in 2012 for his work on very large databases and in 2014 for his work on IBM industry models and big data.
IBM J. RES. & DEV.
VOL. 58
NO. 5/6
PAPER 18
SEPTEMBER/NOVEMBER 2014
P. O’SULLIVAN ET AL.
18 : 11