A Statistical Process Monitoring Perspective on “Big Data” Fadel M. Megahed and L. Allison Jones-Farmer
Abstract As our information infrastructure evolves, our ability to store, extract, and analyze data is rapidly changing. Big data is a popular term that is used to describe the large, diverse, complex and/or longitudinal datasets generated from a variety of instruments, sensors and/or computer-based transactions. The term big data refers not only to the size or volume of data, but also to the variety of data and the velocity or speed of data accrual. As the volume, variety, and velocity of data increase, our existing analytical methodologies are stretched to new limits. These changes pose new opportunities for researchers in statistical methodology, including those interested in surveillance and statistical process control methods. Although it is well documented that harnessing big data to make better decisions can serve as a basis for innovative solutions in industry, healthcare, and science, these solutions can be found more easily with sound statistical methodologies. In this paper, we discuss several big data applications to highlight the opportunities and challenges for applied statisticians interested in surveillance and statistical process control. Our goal is to bring the research issues into better focus and encourage methodological developments for big data analysis in these areas. Keywords: analytics; change-point; control charts; cyber-security; data mining; image-monitoring; innovation; surveillance; text mining; visual analytics Fadel M. Megahed Department of Industrial and Systems Engineering, Auburn University, AL 36849, USA e-mail:
[email protected] L. Allison Jones-Farmer Department of Aviation and Supply Chain Management, Auburn University, AL 36849, USA e-mail:
[email protected]
1
2
Fadel M. Megahed and L. Allison Jones-Farmer
1 Introduction The volume of data produced from purchase transactions, social media sites, production sensors, healthcare information systems, cellphone GPS signals, and other sources. continues to expand at an estimated annual rate of 6080% (The Economist 2010a). To put these numbers into perspective, an 80% annual rate - if compounded daily - means that ˜90% of that data existing in the world today has been created since 2011. The acquisition of massive datasets is expected to transform approaches to major problems such as combating crime, preventing diseases, and predicting the occurrence of natural disasters. Additionally, businesses are using their collected data to spot business trends and customize their services according to customer profiles. For example, Wal-Mart, the leading U.S. discount retailer, processes more than 1 million customer transactions per hour, resulting in databases estimated to be in the magnitude of 2,500 terabytes (The Economist 2010b). Wal-Mart is not the only corporation that handles and stores such massive amounts of data. Others include social media corporations, gaming companies, airlines, insurance companies, healthcare providers, electric companies, and others. Indeed, many industries are experiencing data inflation and facing the growing need to make sense of big data. The acquisition of data does not automatically transfer to new knowledge about the system under study. This notion is not new to the statistical community. W.E. Deming (2000, p. 106) said “. . . Information, no matter how complete and speedy, is not knowledge. Knowledge has temporal spread. Knowledge comes from theory. Without theory, there is no way to use the information that comes to us on the instant.” Deming’s perception holds true today. A recent survey in The Economist shows that 62% of workers have indicated that the quality of their work is hampered because they cannot make sense of the data that they already have (The Economist 2011). Unfortunately, for many organizations in the private and public sector, processing and making sense of big data is still somewhat of a challenge. In 2010, the Library of Congress (LOC) signed an agreement with Twitter to archive all public tweets since 2006. The LOC receives and processes nearly 500,000,000 tweets per day, and as of December, 2012, had amassed 170 billion tweets, or 133.2 terabytes of compressed data. A single search of the archive can take up to 24 hours. As scholars anxiously await access to the tweet archive, the LOC released a public statement stating that, “It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data.” (Library of Congress 2013) In this article we give an overview of big data and how the researchers in statistical surveillance and statistical process control (SPC) can aid in making sense of big data. We begin in Section 2 by trying to answer the question “What is Big Data?”, with the understanding that there is no clear cut answer to this question. In Section 3 we discuss the challenges with data quality.
A Statistical Process Monitoring Perspective on “Big Data”
3
We provide a description of the analytics process in Section 4. In Section 5 we give some examples on how SPC and statistics are being used in several application domains of big data, highlighting a few of the challenges. In Section 6, we discuss some differences between traditional and big data applications of statistical process control (SPC). We provide concluding remarks in Section 7.
2 What is Big Data? Big data is a popular term that is used to describe the large, diverse, complex and/or longitudinal datasets generated from a variety of instruments, sensors and/or computer-based transactions. Much has been written on big data and how it can result in innovation and growth (SAS 2013; Manyika et al. 2011; Zikopoulos et al. 2013). To be able to gain knowledge from big data, it is imperative to understand both its scale and scope. The challenges with processing and analyzing big data are not only limited to the size of the data. These challenges include the size, or volume, as well as the variety and velocity of the data (Zikopoulos et al. 2012). Known as the 3V’s, the volume, variety, and/or velocity of the data are the three main characteristics that distinguish big data from the data we have had in the past. Volume. There are two major factors that contribute to the increase in data volume (Carter 2011). First, the continued advancements in sensing and measurement technologies have made it possible to collect large amounts of data in manufacturing, healthcare, service, telecommunications, and military applications. Second, decreasing storage costs have made it economically feasible to store large amounts of data relatively inexpensively. With increasing technology, storage can take different forms from traditional databases to cloud storage (e.g. Dropbox, Google Drive and Amazon Web Services). It is interesting to note that advancements in sensing/measurement technology and inexpensive storage are two of the driving forces behind advancements in statistical methodology, including SPC methodology. With specific regard to SPC, the 1920s were characterized with low levels of automation and high levels of manual inspection. Therefore, control charts focused on small samples and univariate quality characteristics. With automation and increased computational abilities, there has been increased emphasis on multivariate control charts, time-series methods and profile monitoring (Montgomery 2013; Noorossana et al. 2011). Variety. With increased sensing technology, the explosion in social media and networking, and the willingness of companies to store data, we have new challenges in terms of data variety. For example, manufacturers are no longer restricted by traditional dimensional measurement since they accumulate sensor data on their equipment, images for inspecting product aesthetics, radiofrequency identification (RFID) technology in supply chain management, and
4
Fadel M. Megahed and L. Allison Jones-Farmer
social network data to capture their customers’ feedback. In theory, this variety of data should allow for a 360 view of manufacturing quality. In reality, even if we had the infrastructure to match the customer feedback to the exact product identification, and tracking through the supply chain, our methods to analyze and gain knowledge from this data are limited. Most statistical methods (including control charts) are classified according to the type of data with which they should be used. For example, few methods are available for monitoring multivariate processes with a mixture of categorical and continuous data (Jones-Farmer et al. 2013; Ning and Tsung 2010). Monitoring and analyzing mixed data types becomes more challenging when we include nonnumeric, unstructured data. Some reports estimate that nearly 80% of an organization’s data is not numeric (SAS 2013). Although there are many approaches to text analytics, our experience with text analytics suggests there remain many limitations. Most text analytics approaches convert unstructured text into numerical measures (e.g. word counts and/or measures of sentiment captured in a phrase). The numerical measures are then analyzed using traditional data analytic approaches (e.g. data mining). While a thorough discussion of text analytics methods is well beyond the scope of this article, it is important to note that most approaches have been developed in a specific context (e.g. dictionaries that assign sentiment to word combinations from literature or from interpreting political manifestos). The applicability of these methods to the cryptic and evolutionary nature of the language of social media is questionable. Thus, the analyses based on the numerical scores resulting from these methods may not be meaningful. A major challenge of big data analysis is how to automatically process and translate such data into new knowledge. Velocity. Zikopoulos et al. (2012) stated that the conventional notion of data velocity is the rate at which the data arrives, is stored, and retrieved for processing. These authors noted, however, that this definition of velocity should be expanded to include, not only the growth rate of data, but also the speed at which data are flowing. The rate of flow can be quite massive in certain domains. Considering microblog data, data consumers have the option to purchase near real-time access to the entire Twitter data or a randomly selected portion of the data over a fixed prospective time range. Recalling that Twitter processes around 500,000,000 tweets daily, the entire data set is appropriately referred to as the “fire hose”. Social media is not the only domain with high velocity data. Recall the estimated 1,000,000 transactions per hour processed by Wal-Mart. One of the main challenges for any organization with high velocity data is how fast it can transform the high velocity data into knowledge. Near realtime responsiveness to high velocity data may allow for competitive advantage in terms of managing stock-outs, product introductions, customer service, and public relations interventions. Similar advantages may be found with faster responsiveness to high velocity data in areas including public health, network security, safety, traffic, and other surveillance domains. Essentially,
A Statistical Process Monitoring Perspective on “Big Data”
5
the opportunity cost clock starts once the data point has been generated. Zikopoulos et al. (2012) stated “velocity is perhaps one of the most overlooked areas in the Big Data craze . . . .” A similar sentiment is echoed by SAS (2013) in a recent white paper that states, “reacting quickly enough to deal with velocity is a challenge to most organizations.” Another challenge related to data velocity is the fact that social media and transactional data can produce highly variable flow rates, with enormously large peaks, and near zero valleys. In the context of social media, the phrase “gone viral” is often used to describe a video, blog, or microblog that has been shared on a large scale very rapidly. With internet and social media data, transactions or activity on a particular site/topic can go from a manageable baseline velocity up to a level that is beyond the capacity of the infrastructure to handle in a matter of seconds. For example, in September 2011, Target’s website was flooded with visitors and transactions within hours of making a limited-edition line of clothing and home products by Missoni available to customers. The influx of transactions crashed the entire website, rendering it unavailable for the better part of a day. In an attempt to manage unpredictable data velocity and volume, there has been an increasing market for computer Infrastructure as a Service (IaaS) providers, where customers pay as they go for scalable computing, storage and data access over the internet (for a detailed introduction on IaaS see Mell and Grance 2011; Prodan and Ostermann 2009; and U.S. General Services Administration 2013). While the emergence of IaaS providers has allowed companies a failsafe for some of the computing and storage needs that arise from the variable nature of big data, analyzing and monitoring data with extreme fluctuations in velocity remains a difficult problem. In addition to potential data overload problems, extreme variations in data velocity makes it very difficult to develop a baseline to account for daily, seasonal, and eventtriggered loads, and to detect performance changes and trends (SAS 2013).
3 Beyond the 3V’s The volume, variety, and velocity of data are three fundamental characteristics that distinguish big data from any other type of data. Many have suggested that there are more V’s that are important to the big data problem such as veracity and value (IEEE BigData 2013). Veracity refers to the trustworthiness of the data, and value refers to the value that the data adds to creating knowledge about a topic. While we agree that these are important data characteristics, we don’t see these as key features that distinguish big data from regular data. It is important to evaluate the veracity and value of all data, both big and small. Both veracity and value are related to the concept of data quality, and important research area in the Information Systems (IS) literature for more
6
Fadel M. Megahed and L. Allison Jones-Farmer
than 50 years. The research literature discussing the aspects and measures of data quality is extensive in the IS field, but seems to have reached a general agreement that the multiple aspects of data quality can be grouped into several broad categories (Wang and Strong 1996). Two of the categories relevant here are contextual and intrinsic dimensions of data quality. Contextual aspects of data quality are context specific measures that are subjective in nature, including concepts like value-added, believability, and relevance. Many of the measures for these contextual dimensions are based on surveys of the end-users of the data. Batini et al. (2009) provided an overview of the previously used measures regarding contextual data quality measures. Intrinsic aspects of data quality are more concrete in nature, and include four main dimensions: accuracy, timeliness, consistency, and completeness (Batini et al. 2009; Huang et al. 1999; Parssian 2006; Scannapieco and Catarci 2002). The term accuracy implies data which are as free from error as possible. Timeliness implies that data are as up-to-date as possible. Consistency measures how closely the data’s content and occurrences are structured in the same way in every instance. Completeness refers to data which are full and complete in content, with no missing data. Jones-Farmer et al. (2013) gave an overview of data quality and discussed opportunities for research in statistical methods for evaluating and monitoring the quality of data. From our perspective, many of the contextual and intrinsic aspects of data quality are related to the veracity and value of the data. That said, big data presents new challenges in conceptualizing, evaluating, and monitoring data quality. An excellent example of the unique difficulties with big data quality, including value and veracity, occurs with social media data. On April 23, 2013, at 1:07 p.m., Eastern Time, a tweet from the Associated Press (AP) account stated “Breaking: Two Explosions in the White House and Barack Obama is injured” (Strauss et al. 2013). The fraudulent tweet, originated from the hacked AP Twitter account led to an immediate drop in the Dow Jones industrial average. Although the Dow quickly recovered following a retraction from the AP and a press release from the White House, this example illustrates the immediate and dramatic effects of poor quality data. There are other examples of problems with veracity and value of big data in the social media realm. For example, political analysts covering the 2012 Mexican presidential election struggled with tweets due to the creation of a large number of fake Twitter accounts that polluted the political discussion, and introduced derogatory hash tags (Zikopoulos et al. 2013, p. 14-15). It became very difficult to sift through the spam tweets from the tweets of voters, and even understand the impact of spam tweets on the voting population.
A Statistical Process Monitoring Perspective on “Big Data”
7
4 Big Data Analytics Data mining, business intelligence, business analytics, and predictive modeling are terms that may have subtly different meanings, but all relate to the analysis of large data sets. The term data mining generally refers to the process of mining through large data sets to search for useful information. Part of the Knowledge Discovery in Databases (KDD) process, data mining is considered a field that merges computer science, artificial intelligence, and statistics. Business intelligence is a broadly defined term that refers to computer-based reporting structures used to support decision making. Business intelligence is an umbrella term that includes data management activities along with analysis, data mining, and reporting. Business analytics is a term that was made popular by Tom Davenport’s book, Competing on Analytics: The New Science of Winning (Davenport and Harris 2007). Davenport and Harris (2007) describe how business leaders use statistical analysis and predictive models to strategically lead their organizations. It is implied that business analytics refers to more than just data mining, but includes gathering, processing, and analyzing the data using a variety of methods, and communicating the results of the analysis, matching them to key strategic goals. The subtle difference between business analytics and business intelligence is difficult to pinpoint. Predictive modeling is a term that refers to a portion of either of the data mining, business intelligence, or business analytics processes that involves the use of statistical models for the prediction of future outcomes. For the purposes of this paper, we use the term “analytics process” to describe the entire process of extracting knowledge from large data sets. Han and Kamber (2011) divided the analytics process into seven steps. First, the data should be cleaned to remove noise and any inconsistent data. The second step involves the integration of data from multiple data sources. Third, the analyst should select the data which is relevant to the analysis task and retrieve it from the database(s). The transformation of data into structures appropriate for mining constitutes the fourth step. Fifth, data mining algorithms are applied to extract data patterns. With big data applications, these algorithms typically exploit parallel and/or online computations to overcome the storage and computing limitations that arise from handling massive datasets. The sixth step involves the application of some interestingness measures to identify the truly interesting patterns representing new knowledge extracted from the data. The final step is knowledge presentation, where visualization techniques are used to present the mined knowledge in an understandable format. It should be noted that the order of applying these methods can be different in some applications, and that some steps can be combined or skipped. However, the major components remain the same where the data is first preprocessed, afterwards analyzed through the application of algorithms, and then evaluated and presented to the different stakeholders. We refer the reader to the seminal work of Fayyad et al. (1996) for an indepth discussion of the KDD process. Practitioners can consult Dyche (2012)
8
Fadel M. Megahed and L. Allison Jones-Farmer
for an introduction on the needed steps for deploying big data analytics in an organization. The multidisciplinary nature of the analytics process has resulted in extensive applications in many domains. Although there are many ways to do so, we classify big data analytics problems based loosely on the types of data to be analyzed. Although this may be an imperfect approach, classifying statistical methods by types of data is a common approach in classical statistics and SPC. For example, in their review of recent developments in SPC, Woodall and Montgomery (2013) divided their discussion according to the type of data that is being monitored. By grouping the analytics approaches in terms of types of data, we are able to cut across disciplines and recognize common themes that may be present among the different application domains. This can serve as a thrust for innovation as highlighted in Box and Woodall (2011). In the next section, we discuss several examples on how SPC is currently being used or could potentially be used with big data, highlighting some differences between traditional and big data applications of SPC. The types of data we discuss include a) labeled data, b) unlabeled data, c) functional data, d) graph data, and e) data from multiple streams. The variety of the data, including traditional numeric measurements, text, image, and video, the volume and velocity of the data in each application, dictates the types of preprocessing methods and analysis tools used.
5 Big Data and SPC Labeled Data. Many of the big data analytics algorithms used today are classified as machine learning. Machine learning algorithms are typically classified as supervised or unsupervised learning algorithms. The focus of supervised algorithms is to classify data based on a training set that includes known classifications. In the case of supervised learning, the classifications or dependent variables are considered labeled. Specifically, the training set consists of pairs (x , y) where x is a vector of independent variables (or features) and y is the classification value for x (the label). The objective of the analysis is to discover a function y = f (x ) that best predicts the value of y that is associated with unseen values of x (Rajaraman et al. 2012). It is important to note that y is not necessarily a real number, but might be a dichotomous or polytomous variable. In some cases, such as text analytics, y can represent a near-infinite set of classes. Recently, machine learning algorithms have proposed for use within SPC (see, e.g., Chinnam 2002; Cook and Chiu 1998; Deng et al. 2012; Hwang et al. 2007; Sun and Tsung 2003). Although all of these methods are not presented in a big data setting, we discuss them since they may be scalable to high volume, high velocity data. Several of these methods can be used with mixed
A Statistical Process Monitoring Perspective on “Big Data”
9
data types and data measured using different measurement scales. Usually, an artificial data set is generated to represent out-of-control data, which is used to train a classifier. This approach converts the monitoring problem to a supervised classification problem which classifies future observations as either in- or out-of-control. Supervised learning methods have also been used in Phase I analysis, in an attempt to identify an in-control baseline sample (e.g., Li et al. (2006), Deng et al. (2012)) Unlabeled Data. Unsupervised learning algorithms are commonly used when data are unlabeled, often because it is expensive, difficult, or impossible to assign a label identifying the classification or outcome. The goal of unsupervised learning methods is to assign each point to one of a finite set of classifications. There are several approaches to unsupervised learning, the most common of which include cluster analysis and mixture modeling. Several authors have applied unsupervised learning methods have been applied to unlabeled data in Phase I. Sullivan (2002) introduced a clustering method to detect multiple outliers in a univariate continuous process. Thissen et al. (2005) used Gaussian mixture models to establish an in-control reference sample in a Phase I analysis. Jobe and Pokajovy (2009) introduced a computer intensive multi-step clustering method for retrospective outlier detection in multivariate processes. More recently, Zhang et al. (2010) introduced a univariate clustering-based method for finding an in-control reference sample from a long historical stream of univariate continuous data. There remain many opportunities to investigate the strengths and limitations of unsupervised learning methods for Phase I analysis. We see opportunities for using machine learning in for both Phase I and Phase II applications in SPC. Much work and process understanding is often required in the transition between a Phase I and Phase II analysis (Woodall 2000). With big data applications, machine learning algorithms can be used to develop insights and understand the root-causes for underlying problems. For example, Wenke et al. (1999) included a classification step as a part of a general framework for intrusion detection in computer systems. Cruz and Wishart (2006) provided a review on using machine learning algorithms in cancer prediction and prognosis. In their findings, they stated that “machine learning is also helping to improve our basic understanding of cancer development and progression.” Functional Data. Functional data originated within the fields of chemometrics and climatology in the 1960s. The topic has received a great of attention from the statistics community ever since, because it covers a wide range of important statistical topics, including classification, inference, factor-based analysis, regression, resampling, time-series and random processes. Additionally, it allows for “infinitesimal calculations” due to the continuous nature of the data (Ferraty and Romain 2011). The reader is referred to Ramsay and Silverman (2002, 2005) for an in-depth coverage of the topics involved in functional data analysis (FDA). Ferraty and Romain (2011) provided an
10
Fadel M. Megahed and L. Allison Jones-Farmer
excellent discussion on various benchmarking methodologies and the fundamental mathematical aspects that are related to the big data aspect of FDA. Woodall et al. (2004) related the application of profile monitoring to FDA. Profile monitoring is a control charting approach used when the quality of a process or product can be characterized by a functional relationship between a response variable and explanatory variable(s). In such applications features are often extracted from the monitored curves, such as regression coefficients, and then, used as input to standard control charting methods. This transformation represents an initial phase of preprocessing that has not been explicitly identified in the traditional SPC literature. Extensions of the existent profile monitoring techniques can be seen in image analysis and other related areas. We provide an overview on these applications in the paragraphs below. We view the use of images for process monitoring as a very promising area of statistical research with a wide range of applications. In manufacturing settings, image-based monitoring adds the capability of monitoring a wide variety of quality characteristics, such as dimensional data, product geometry, surface defect patterns, and surface finish, in real-time. This is somewhat different from traditional SPC applications that focus on dimensional and discrete process data. Additionally, image monitoring plays an important role in many medical, military and scientific applications. Megahed et al. (2011) provided a review on the use of control charts with image data. In their review, they noted that the use of control charting with image data is not yet well-developed (with the exception of applications in chemometrics). Megahed et al. (2012a) proposed several metrics to evaluate the effectiveness of a monitoring method in detecting process shifts through the use of image data. These metrics take into account the spatiotemporal nature of image monitoring, where images are taken over time and a problem (product fault or a tumor) represents the spatial features of the image. Specifically, they divided the metrics into two groups. The first group assesses the run length performance of the control charting technique. The diagnostic aspect of the detection method (i.e. change-point detection and the ability to estimate the problem location on the image) is examined in the second group. Three dimensional (3D) images are widely used for diagnostic purposes by medical practitioners. Examples of these images include x-rays, functional Magnetic Resonance Imaging (fMRI), and ultrasound images. With the expected adoption of electronic medical records, we foresee an increased need for surveillance, classification and spatiotemporal analysis that can be used to extract knowledge from these images. The data are “remarkably complex (correlated in time and in space in ways that are still not fully understood) and massive (a typical number might be hundred of thousands of time series for a single subject . . . )” (Lazar 2008, p. vii). For this data, there is a focus on change-point detection. See, for example, Aston and Kirch (2012a, b, c) and Lindquist and Wager (2008).
A Statistical Process Monitoring Perspective on “Big Data”
11
A related area is the use of 3D surface scans, also called point clouds, for process monitoring. 3D scanners provide an opportunity to collect millions of data points, allowing for a product’s entire surface geometry to be represented. Most often the resulting scans have been used for inspection purposes, where non-conforming items are separated from conforming items, or for reverse engineering. We see opportunities for using such data in detecting process changes prior to the production of non-conforming items. Wells et al. (2012) discussed some issues in using 3D scanners in process monitoring, and provided a nonparametric profile monitoring approach for detecting/ diagnosing process changes. In general, data from 3D laser scanners can be seen as an extension to process monitoring applications involving coordinate measurement machines (CMMs), which are widely implemented in the automotive industry. We believe that 3D scans are more useful in process monitoring, since one can detect the occurrence of unexpected fault patterns, i.e. faults that would not normally impact preset CMM measurement points. A common theme in the literature discussing FDA is the lack in statistical and mathematical tools that enables these data-rich sources to achieve their full potential. For example, Ferraty and Romain (2011, pp. viii-ix) identified several challenges for the future in the context of FDA including the need for semiparametric methods, the nonlinear structure in high-dimensional spaces, and the increasing sophistication of the datasets. These challenges are important research areas to consider, especially since surveillance requires Phase I analyses where the baseline is established and the parameters are estimated. In addition to these challenges, there exists several control charting opportunities that are specific to 2D images and point cloud data. For example, there is no discussion on the effect of estimation error on image-based control charts, and therefore it is not known what Phase I sample sizes are needed and whether they would be practical. Generally speaking, there is a need to evaluate the assumptions inherent in many of the papers involving control charting for image and point cloud data. It may be beneficial to create a large repository of functional data and share it with researchers to accelerate scientific discovery in this data-driven area. Such repositories exist for non-surveillance applications involving functional data (see, e.g., http://archive.ics.uci.edu/ml/, http://www.ncbi.nlm.nih.gov/geo/, and http://aws.amazon.com/publicdatasets/). Graph Data. In several big data applications, there is an interest to extract relationships between the different entities in the data. This relational knowledge is best described through a graph representation (Cook and Holder 2006). A graph allows the entities to be represented by nodes. Two nodes are connected by an edge if there is a relationship between them. According to Chakrabarti and Faloutsos (2012), the graph representation encompasses a large number of applications that include: a) social networks, like Facebook and LinkedIn, presents a natural opportunity for representing friends network using a graph. This type of networks can generalized to all contact networks, e.g., who-calls-whom, who-texts-whom, or who collaborates with
12
Fadel M. Megahed and L. Allison Jones-Farmer
whom; b) cyber-security and computer networks, where the goal is to detect intrusions; c) influence propagation, where the rate of adoption of products and concepts is measured. Here the focus is to answer questions such as: What can we do to accelerate a word-of-mouth campaign for our new product? Or conversely, what can we do to stop the propagation of the flu; d) e-commerce, utilizing the product ratings by users on Amazon, Netflix and smart phone app stores; and e) other application domains, e.g., in biology, protein-protein interaction networks can be modeled as graphs. These applications present ample opportunities for applying SPC techniques. We present some examples below. McCulloh and Carley (2008) proposed an SPC-based approach to detect change in social networking based on the notion that the data collected on such networks is similar in structure to continuous variables that are typically monitored by control charts in manufacturing applications. In their approach, they calculated the average graph measures for density, closeness, and betweenness centrality for several consecutive time-periods of the social network. They estimated the in-control mean and variance for each of these three measures by taking a sample average and sample variance of the stabilized measures. Then, they used a CUSUM chart (with a reference value k=0.5 and control limit, h=4) to monitor each of these three measures. Their control charts were deployed to detect structural changes on two publically available social network datasets: Tactical Officer Education Program e-mail Network and open source Al-Qaeda communications network. Using the CUSUM chart, they were able to identify significant changes in both networks, with Al-Qaeda network structurally changing prior to the attacks of September 11th. In addition, the authors discussed in detail the limitations of their method and some future research extensions that should be examined. We agree with their recommendations and suggest further investigation in this area. Wu et al. (2007) presented a simple approach using a Shewhart control chart to detect distributed denial of service (DDoS) attacks on a computer network. Underbrink et al. (2012) suggested using Shewhart control charts in the context of testing cyber-attack penetration to software applications. Hale and Rowe (2012) provided an introduction on SPC and suggested its use to detect and eliminate problems in developing software for military applications. From this discussion, one can see that the use of control charting for intrusion detection in cyber-security applications has been introductory. This can be seen by the absence of the phrase “control chart” in the Guide to Intrusion Detection and Prevention Systems developed by the U.S. National Institute of Standards and Technology (Scarfone and Mell 2012). We see the cyber-security (CS) problem to be far more complex than what is presented because a computer network is continuously changing and the state of the network can be represented by data of different types (e.g. counts, continuous variables and probabilities/ratios). At this time, we know of only a limited class of multivariate control charting methods that can be used to successfully
A Statistical Process Monitoring Perspective on “Big Data”
13
monitor multivariate data of mixed data types. These methods are generally in the class of supervised machine learning techniques (e.g. neural network based control charts), and the properties of these methods are still being discussed in the academic literature. Thus, further work in this area warrants a detailed investigation of these methods. Multistream Data. Increased availability of data and data tracking sources makes it possible to simultaneously monitor data from many, often disparate, sources. This presents opportunities and challenges in terms of managing reasonable false alarm rates while maintaining the ability to quickly detect unusual events. A control chart for monitoring multiple streams collected at regular time intervals was introduced by Boyd (1950). This approach considered plotting the maximum and minimum subgroup mean from the group of streams on a control chart to detect location changes. Mortell and Runger (1995) proposed alternative control charts based on the pooled subgroup means for detecting overall mean changes and the range of the subgroup means to detect location changes in an individual stream. Liu et al. (2008) compared the method of Mortell and Runger (1995) to several approaches in the context of monitoring using multiple gauges in a truck assembly process. Liu et al.’s (2008) proposed methods were based on F-tests comparing the between gauge to the within gauge variability and likelihood ratio tests to detect specific changes in only one stream. Meneces et al. (2008) considered the effect of correlation across the multiple streams on the use of individual Shewhart X-bar charts to monitor each stream. Using simulation, Meneces et al. (2008) suggested alternative control chart constants to maintain a desired false alarm probability when monitoring multiple correlated streams. Lanning et al. (2002) used an adaptive fractional-sampling approach to monitor a large number of independent streams when it is only possible to monitor a fraction of the streams. Recently, Jirasettapong and Rojanarowan (2011) provided a framework for selecting a control chart method for monitoring multiple streams based on the stream characteristics including the degree of correlation among the streams, the number of streams, and the shift size to be detected. While some methods to monitor multiple streams have been addressed in the SPC literature, most focus on a manufacturing context. The need to monitor multiple, often correlated, data streams occurs outside of manufacturing in several big data contexts. These include surveillance public health, bioterrorism, social media, and others. Many authors have considered the statistical challenges present in monitoring multiple data streams, particularly in the biosurveillance context. Shmueli and Burkom (2010) noted that the context of biosurveillance, including the data types, the mechanisms underlying the time series, and the applications all deviate from the traditional industrial setting for which control chart methods were developed. Rolka et al. (2007) gave an overview of the methodological issues in public health and bioterrorism surveillance in the context of multiple stream data. In addition to traditional methods including the multivariate cumulative sum and
14
Fadel M. Megahed and L. Allison Jones-Farmer
exponentially weighted moving average approaches, Rolka et al. (2007) considered the spatiotemporal aspect of biosurveillance data and the use of spatial scan statistics as well as Bayesian network models to combine multiple data streams. Tsui et al. (2008) also considered public health and disease surveillance, giving an overview of the most popular methods, and discussing research opportunities in spatiotemporal surveillance. Fraker et al. (2008) and Megahed et al. (2012b) discussed the advantages and disadvantages of several performance metrics applied to health surveillance, recommending metrics based on time-to-signal properties. The advances in the use of the Web as an information sharing tool, the simplicity with which publicly available data can be merged and filtered, and the willingness of individuals to openly broadcast through social media is making the already challenging area of multiple stream surveillance even richer. For example, Brownstein et al. (2009) lists a sample of Web-based digital resources for disease detection and discusses the use of search-term surveillance. They gave an example using Google Insights to investigate the search volume for several terms related to a salmonella outbreak in attributed to contaminated peanut butter in January, 2009. Chunara et al. (2012) illustrated the effectiveness of the use of informal data sources such as microblogs volume and query search volumes to gain early insight into the 2010 cholera outbreak in Haiti. In both the case of the salmonella outbreak and cholera in Haiti, researchers were able to look retrospectively at a known event, asking the question “could we have detected this event using alternate sources?” In addition to the statistical challenges with monitoring multiple data streams, the real challenge may be non-statistical, and relate to knowing what to listen and look for in these emerging data sources.
6 Some Challenges in Big Data SPC The big data surveillance problem differs somewhat from traditional SPC applications. For example, it is clear from the seven step hierarchy of Han and Kamber (2011) that noise reduction, data preprocessing and outlier detection are often necessary before any surveillance can be done. This analysis is more exhaustive than in standard SPC applications. Additionally, the data processing speed is a major area of focus in big data applications, especially with 100% sampling. Therefore, methods compete based on the overall running time. The focus on data processing speed can also reflect on how the surveillance problem is approached. This effect can be seen in the monitoring of massive amounts of labeled data, where historical data is not stored and the analysis is done on the fly (Rajaraman et al. 2012). Accordingly, the analysis is often carried out using a representative set of points from the data, and a selective set of summary statistics that are useful for moment
A Statistical Process Monitoring Perspective on “Big Data”
15
calculations. See Zhang et al. (1996), Bradley et al. (1998), and Guha et al. (1998) for three highly-cited examples in this area. When applying surveillance methods to multidimensional data, it is unlikely that all of the monitored variables will shift at the same time; thus some method is necessary to identify the process variables that have changed. In high dimensional data sets, the decomposition methods used with multivariate control charts can become very computationally expensive. Several authors have considered variable selection methods combined with control charts to quickly detect process changes in a variety of practical scenarios including fault detection, multistage processes and profile monitoring. Zou and Qiu (2009) considered the least absolute shrinkage and selection operator (LASSO) test statistic of Tibshirani (1996) combined with a multivariate exponentially weighted moving average (EWMA) control chart to quickly identify process changes and to identify the shifted mean components. Zou et al. (2012) considered the use of the LASSO statistic applied to profile monitoring. Zou et al. (2011) developed a LASSO based diagnostic framework for statistical process control that applies to high dimensional data sets. Capizzi and Masarotto (2011) developed a multivariate (EWMA) control chart using the Least Angle Regression (LAR) algorithm to detect changes in both the mean and variance of a high dimensional process. All of these methods based on variable selection techniques are based on the idea of monitoring subsets of potentially faulty variables. Capizzi and Masarotto (2011) listed several important research areas in this domain including the need to adapt these methods for nonparametric process monitoring, better monitoring of process dispersion, and adapting the methods for variable sampling interval (VSI) methods to improve detection performance. Some variable reduction methods are needed to better identify shifts. We believe that further work in the areas combining variable selection methods and surveillance are important for quickly and efficiently diagnosing changes in high-dimensional data. Another glaring difference is that traditional SPC methods only focus on numeric datasets. However, most of big data applications are concerned with non-numeric data obtained from several databases. This introduces several challenges. First, data quality becomes a focal point. Pipino et al. (2002) defined data quality to have several dimensions that include: completeness, consistency, free-of-error, security, timeliness, and understandability. These metrics are essential to evaluate prior to performing any analysis since the reliability of any produced model is associated with the quality of the input data. Moreover, this evaluation is not straightforward since the data is often non-numeric, unstructured, and inputted by a large number of users. Second, the modeling of the data is often based on disciplines that are not studied by statisticians and quality engineers. With text data, the models draw from linguistic sciences, psychology, and computer science. The models can also integrate data inputted in different languages. The third challenge is that the arrival rate of the data fluctuates based on factors that are often not
16
Fadel M. Megahed and L. Allison Jones-Farmer
understood prior to analyzing the data. This phenomenon is referred to as trending/viral for online content. With big data, there is an increased likelihood for autocorrelation. This issue is a common theme in any of the data classifications indicated in Sec. 5. Even though autocorrelated data has received considerable attention in the SPC literature (see Psarakis and Papaleonida 2007 for a detailed review), there is not much understanding on how to deal with autocorrelation when the time-series model is not known. This is often the case with big data problems, where trending and high fluctuations in the observed data are common. We believe that this research area needs more work and will be significant to further developments in the surveillance of big data. It is very beneficial to carefully distinguish between the Phase I analysis of historical data and the Phase II monitoring stage. For current papers on the monitoring of big data, there is not a clear distinction made between the two phases. Distinguishing between the in-control and out-of-control states becomes difficult in some applications due to issues regarding data quality, autocorrelation, the understanding of the repeatability and reproducibility of the data collection system, and the difficulty of collecting data regarding the covariates that may affect model performance. These are all issues that relate to the preprocessing of the data. We refer to the preprocessing phase as Phase 0 to highlight its importance to the surveillance problem. Thus, Phase 0 analysis comprises cleaning the data, reducing the dataset without losing valuable information (i.e. the feature selection step), evaluating data quality and assessing the accuracy of the measurement system(s).
7 Concluding Remarks In this paper, we provided an overview of big data and the role of statisticians/ quality engineers in understanding and advancing big data. It is clear that big data analytics is an evolving field with numerous applications, some of which can present solutions to global challenges pertaining to public-health and science. SPC and statistics are currently being deployed in some of these applications; however, we encourage further research in these areas. In particular, work is needed to examine how we can best preprocess large datasets and evaluate this preprocessing phase. This includes a better understanding of how to evaluate the quality of the data, select important features from the dataset and perform measurement system analysis. Additionally, establishing the baseline is much more complex than standard SPC applications since it requires domain knowledge, the data is heavily autocorrelated and is very complex. The development of visualization techniques may play a role in better understanding the problem. In Phase I, we believe that classification and clustering approaches will become more common. We expect that the monitoring of functional, graph and multiple stream data to become in-
A Statistical Process Monitoring Perspective on “Big Data”
17
creasingly important. In summary, big data provides many opportunities for statisticians and quality engineers, whose role will continue to grow.
References 1. Aston JAD, Kirch C. (2012a). Detecting and estimating changes in dependent functional data. Journal of Multivariate Analysis, 109:204-220.
doi:10.1016/j.jmva.2012.03.006. 2. Aston JAD, Kirch C. (2012b). Evaluating Stationarity via change-point alternatives with applications to FMRI Data. Annals of Applied Statistics, 6 (4):1906-1948.
doi:10.1214/12-aoas565. 3. Aston JAD, Kirch C. (2012c). Supplement to Evaluating Stationarity via changepoint alternatives with applications to FMRI Data. Annals of Applied Statistics.
http://lib.stat.cmu.edu/aoas/565/supplement.pdf 4. Batini C, Cappiello C, Francalanci C, Maurino A. (2009). Methodologies for Data Quality Assessment and Improvement. ACM Computing Surveys, 41 (3):16-16.52.
doi:10.1145/1541880.1541883. 5. Box GEP, Woodall WH. (2011). Innovation, Quality Engineering, and Statistics. Quality Engineering, 24 (1):20-29. doi:10.1080/08982112.2012.627003. 6. Boyd DF. (1950). Applying the group chart for X and R. Industrial Quality Control, 7 (3):22-25. 7. Bradley PS, Fayyad U, Reina C. (1998). Scaling clustering algorithms to large databases. Knowledge Discovery and Data Mining:9-15. http://www.aaai.org/
Papers/KDD/1998/KDD98-002.pdf 8. Brownstein JS, Freifeld CC, Madoff LC. (2009). Digital Disease Detection — Harnessing the Web for Public Health Surveillance. New England Journal of Medicine, 360 (21):2153-2157. doi:10.1056/NEJMp0900702. 9. Capizzi G, Masarotto G. (2011). A Least Angle Regression Control Chart for Multidimensional Data. Technometrics, 53 (3):285-296.
doi:10.1198/Tech.2011.10027. 10. Carter P. Skills and poration
(2011). Big Roadmaps for (IDC),
Data the
Analytics: Future CIO. International
Architectures, Data Cor-
http://www.sas.com/resources/asset/ BigDataAnalytics-FutureArchitectures-Skills-RoadmapsfortheCIO. pdf
11. Chakrabarti D, Faloutsos C. (2012). Graph Mining: Laws, Tools, and Case Studies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3 (3):1-207.
doi:10.2200/S00449ED1V01Y201209DMK006. 12. Chinnam RB. (2002). Support vector machines for recognizing shifts in correlated and other manufacturing processes. International Journal of Production Research, 40 (17):4449-4466. 13. Chunara R, Andrews JR, Brownstein JS. (2012). Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. Am J Trop Med Hyg, 86 (1):39-45. doi:10.4269/ajtmh.2012.11-0597. 14. Cook DF, Chiu CC. (1998) Using radial basis function neural networks to recognize shifts in correlated manufacturing process parameters. IIE transactions, 30 (3):227234. 15. Cook DJ, Holder LB. (2006). Mining graph data. John Wiley & Sons, Inc., Hoboken, NJ. 16. Cruz JA, Wishart DS. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2:59-77. http://www.ncbi.nlm.nih.gov/
pubmed/19458758
18
Fadel M. Megahed and L. Allison Jones-Farmer
17. Davenport TH, Harris JG. (2007). Competing on analytics: the new science of winning. Harvard Business School Press. 18. Deming WE. (2000). The new economics: for industry, government, education. 2nd edn. The MIT Press, Cambridge, MA. 19. Deng H, Runger GC, Tuv E. (2012). Systems Monitoring with Real Time Contrasts. Journal of Quality Technology, 44 (1):9-27. 20. Dyche J. (2012). The seven steps of big data delivery. SAScom Online, Quarter (4):2-4. http://www.sas.com/news/sascom/big-data-delivery.html 21. Fayyad U, Piatetsky-Shapiro G, Smyth P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17 (3):37-54. 22. Ferraty F, Romain Y. (2011). The Oxford handbook of functional data analysis. Oxford handbooks. Oxford University Press, Oxford ; New York. 23. Fraker SE, Woodall WH, Mousavi S. (2008). Performance Metrics for Surveillance Schemes. Quality Engineering, 20 (4):451-464.
doi:10.1080/08982110701810444. 24. Guha S, Rastogi R, Shim K. (1998). CURE: an efficient clustering algorithm for large databases. Paper presented at the Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, USA. 25. Hale C, Rowe M. (2012). Do Not Get Out of Control: Achieving Real-time Quality and Performance. CrossTalk: The Journal of Defense Software Engineering, 25 (1):4-8. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=
GetTRDoc.pdf&AD=ADA554677 26. Han J, Kamber M. (2011). Data mining : concepts and techniques. 3rd edn. Elsevier, Burlington, MA. 27. Huang K-T, Lee YW, Wang RY. (1999). Quality information and knowledge. Prentice Hall PTR, Upper Saddle River, N.J. 28. Hwang W, Runger G, Tuv E. (2007). Multivariate statistical process control with artificial contrasts. IIE transactions, 39 (6):659-669. 29. IEEE BigData (2013). http://www.ischool.drexel.edu/bigdata/ bigdata2013/ (Accessed 05/19/2013) 30. Jirasettapong P, Rojanarowan N. (2011). A guideline to select control charts for multiple stream processes control. Engineering Journal, 15 (3):1-14.
doi:10.4186/ej.2011.15.3.1. 31. Jobe JM, Pokojovy M. (2009). A multistep, cluster-based multivariate chart for retrospective monitoring of individuals. Journal of Quality Technology, 41 (4):323-339. 32. Jones-Farmer LA, Ezell JD, Hazen BT. (2013). Applying control chart methods to enhance data quality. To appear in Technometrics. 33. Lanning JW, Montgomery DC, Runger GC. (2002). Monitoring a multiple stream filling operation using fractional samples. Quality Engineering, 15 (2):183-195.
doi:10.1081/QEN-120015851. 34. Lazar NA. (2008). The statistical analysis of functional MRI data. Statistics for biology and health. Springer, New York. 35. Li F, Runger GC, Tuv E. (2006). Supervised learning for change-point detection. International Journal of Production Research, 44 (14):2853-2868. 36. Library of Congress. (2013). Update on the Twitter Archive At the Library of Congress. http://www.loc.gov/today/pr/2013/files/twitter_ report_2013jan.pdf (Accessed 05/19/2013) 37. Lindquist M, Wager TD (2008). Application of change-point theory to modeling state-related activity in fMRI. In: Cohen P (ed) Applied Data Analytic Techniques for “Turning Points Research”. Multivariate applications series. Routledge, New York. 38. Liu X, MacKay RJ, Steiner SH. (2008). Monitoring Multiple Stream Processes. Quality Engineering, 20 (3):296-308. doi:10.1080/08982110802035404. 39. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. http://www.mckinsey.com/insights/business_
A Statistical Process Monitoring Perspective on “Big Data”
19
(Accessed 5/8/2013). 40. McCulloh IA, Carley KM. (2008). Social Network Change Detection. Center for the Computational Analysis of Social and Organizational Systems Technical Report. Carnegie Mellon University, Pittsburg, PA. http://www.dtic.mil/cgi-bin/
technology/big_data_the_next_frontier_for_innovation
GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA488427 41. Megahed FM, Wells LJ, Camelio JA, Woodall WH. (2012). A spatiotemporal method for the monitoring of image data. Quality and Reliability Engineering International, 28 (8):967-980. doi:10.1002/Qre.1287. 42. Megahed FM, Fraker SE, Woodall WH. (2012). A note on two performance metrics for public-health surveillance schemes. Journal of Applied Probability and Statistics, 7 (1):35-41. 43. Megahed FM, Woodall WH, Camelio JA. (2011). A review and perspective on control charting with image data. Journal of Quality Technology, 43 (2):83-98. 44. Mell P, Grance T. (2011). The NIST definition of cloud computing. National Institute of Standards and Technology. http://docs.lib.noaa.gov/
noaa_documents/NOAA_related_docs/NIST/special_publication/ sp_800-145.pdf 45. Meneces NS, Olivera SA, Saccone CD, Tessore J. (2008). Statistical control of multiple-stream processes: a Shewhart control chart for each stream. Quality Engineering, 20 (2):185-194. doi:10.1080/08982110701241608. 46. Montgomery DC. (2013). Introduction to statistical quality control. 7th edn. Wiley, Hoboken, NJ. 47. Mortell RR, Runger GC. (1995). Statistical process control of multiple stream processes. Journal of Quality Technology, 27 (1):1-12. 48. Ning X, Tsung F (2010). Monitoring a process with mixed-type and highdimensional data. In: Industrial Engineering and Engineering Management (IEEM), 2010 IEEE International Conference on, 7-10 Dec. 2010 2010. pp 1430-1432.
doi:10.1109/IEEM.2010.5674333. 49. Noorossana R, Saghaei A, Amiri A. (2011). Statistical analysis of profile monitoring. Wiley series in probability and statistics. Wiley, Hoboken, N.J. 50. Parssian A. (2006). Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decision Support Systems, 42 (3):1494-1502. doi:10.1016/j.dss.2005.12.005. 51. Pipino LL, Lee YW, Wang RY. (2002). Data quality assessment. Communications of the ACM, 45 (4):211-218. doi:10.1145/505248.506010. 52. Prodan R, Ostermann S. (2009) A survey and taxonomy of infrastructure as a service and web hosting cloud providers. In: Grid Computing, 10th IEEE/ACM International Conference on, 13-15 Oct. 2009. pp 17-25. doi:10.1109/GRID.2009.5353074. 53. Psarakis S, Papaleonida G. (2007). SPC procedures for monitoring autocorrelated processes. Quality Technology and Quantitative Management, 4 (4):501-540. 54. Rajaraman A, Leskovec J, Ullman JD. (2012). Mining of massive datasets. ISBN:9781-107-01535-7. 55. Ramsay JO, Silverman BW. (2002). Applied functional data analysis: methods and case studies. Springer series in statistics. Springer, New York. 56. Ramsay JO, Silverman BW. (2005). Functional data analysis. Springer series in statistics, 2nd edn. Springer, New York. 57. Rolka H, Burkom H, Cooper GF, Kulldorff M, Madigan D, Wong WK. (2007). Issues in applied statistics for public health bioterrorism surveillance using multiple data streams: Research needs. Statistics in Medicine, 26 (8):1834-1856.
doi:10.1002/Sim.2793. 58. SAS (2013). Big data - What is it? http://www.sas.com/big-data/ (Accessed 5/8/2013). 59. Scannapieco M, Catarci T. (2002). Data quality under a computer science perspective. Archivi & Computer, 2:1-15.
20
Fadel M. Megahed and L. Allison Jones-Farmer
60. Scarfone K, Mell P. (2012). Guide to intrusion detection and prevention systems (IDPS) (Draft): Recommendations of the National Institute of Standards and Technology. ttp://csrc.nist.gov/publications/drafts/800-94-rev1/
draft_sp800-94-rev1.pdf 61. Shmueli outbreak
G, Burkom H. (2010). Statistical challenges facing early detection in biosurveillance. Technometrics, 52 (1):39-51.
doi:10.1198/Tech.2010.06134. 62. Strauss G, Shell A, Yu R, Acohido B. (2013). Hoax and ensuing crash on Wall Street show the new dangers of our light-speed media world.
http://www.usatoday.com/story/news/nation/2013/04/23/ hack-attack-on-associated-press-shows-vulnerable-media/ 2106985/ (Accessed 05/16/2013). 63. Sullivan JH. (2002). Detection of multiple change points from clustering individual observations. Journal of Quality Technology, 34 (4):371-383. 64. Sun R, Tsung F. (2003). A kernel-distance-based multivariate control chart using support vector methods. International Journal of Production Research, 41 (13):29752989. 65. The Economist (2010a). All too much. http://www.economist.com/node/ 15557421 (Accessed 5/8/2013). 66. The Economist (2010b). Data, data everywhere. http://www.economist.com/ node/15557443 (Accessed 5/8/2013). 67. The Economist (2011). Schumpeter: Too much buzz. http://www.economist. com/node/21542154 (Accessed 5/8/2013). 68. Thissen U, Swierenga H, de Weijer AP, Wehrens R, Melssen WJ, Buydens LMC. (2005). Multivariate statistical process control using mixture modelling. Journal of Chemometrics, 19 (1):23-31. doi:10.1002/Cem.903. 69. Tibshirani R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58 (1):267-288. 70. Tsui K-L, Wenchi C, Gierlich P, Goldsman D, Xuyuan L, Maschek T. (2008). A review of healthcare, public health, and syndromic surveillance. Quality Engineering, 20 (4):435-450. doi:10.1080/08982110802334138. 71. Underbrink A, Potter A, Jaenisch H, Reifer DJ (2012). Application stress testing Achieving cyber security by testing cyber attacks. In: Homeland Security (HST), 2012 IEEE Conference on Technologies for, 13-15 Nov. 2012. pp 556-561.
doi:10.1109/THS.2012.6459909. 72. U.S. General Services Administration. (2013). Infrastructure as a Service (IaaS). http://www.gsa.gov/portal/content/112063 (Accessed 5/8/2013). 73. Wang RY, Strong DM. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12 (4):5-33. 74. Wells LJ, Megahed FM, Niziolek CB, Camelio JA, Woodall WH. (2012). Statistical process monitoring approach for high-density point clouds. Journal of Intelligent Manufacturing:1-13. doi:10.1007/s10845-012-0665-2. 75. Wenke L, Stolfo SJ, Mok KW (1999). A data mining framework for building intrusion detection models. In: Proceedings of the IEEE Symposium on Security and Privacy 1999, pp 120-132. doi:10.1109/SECPRI.1999.766909. 76. Woodall WH. (2000). Controversies and contradictions in statistical process control. Journal of Quality Technology, 32 (4):341-350. 77. Woodall WH, Montgomery DC. (2013). Some current directions in the theory and application of statistical process monitoring. To appear in the Journal of Quality Technology. 78. Woodall WH, Spitzner DJ, Montgomery DC, Gupta S. (2004). Using control charts to monitor process and product quality profiles. Journal of Quality Technology, 36 (3):309-320.
A Statistical Process Monitoring Perspective on “Big Data”
21
79. Wu Q, Zhang H, Pu J. (2007). Mitigating distributed denial-of-service attacks using network connection control charts. In: Proceedings of the 2nd international conference on Scalable information systems, Suzhou, China. 80. Zhang H, Albin SL, Wagner SR, Nolet DA, Gupta S. (2010). Determining statistical process control baseline periods in long historical data streams. Journal of Quality Technology, 42 (1):21-35. 81. Zhang T, Ramakrishnan R, Livny M. (1996). BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, Montreal, Quebec, Canada. 82. Zikopoulos P, deRoos D, Parasuraman K, Deutsch T, Corrigan D, Giles J. (2013). Harness the Power of Big Data: The IBM Big Data Platform. ISBN:978-0-071808187. 83. Zikopoulos P, Eaton C, deRoos D, Deutsch T, Lapis G. (2012). Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill, New York. 84. Zou C, Ning X, Tsung F. (2012). LASSO-based multivariate linear profile monitoring. Annals of Operations Research, 192 (1):3-19.
doi:10.1007/s10479-010-0797-8. 85. Zou CL, Jiang W, Tsung F. (2011). A LASSO-based diagnostic framework for multivariate statistical process control. Technometrics, 53 (3):297-309.
doi:10.1198/Tech.2011.10034. 86. Zou CL, Qiu PH. (2009). Multivariate statistical process control using LASSO. Journal of the American Statistical Association, 104 (488):1586-1596.
doi:10.1198/jasa.2009.tm08128.