An Efficient Approach to Intelligent Real-time Monitoring Using Ontologies and Hadoop Tomasz Wiktor Wlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger N-4036 Stavanger, Norway
[email protected]
Csongor I. Nyulas, Mark A. Musen Stanford Center for Biomedical Informatics Research Stanford University Stanford, USA
This approach focuses mainly on data intensive problems. MapReduce is used as a deployment platform as it is currently one of the most tested, reliable and scalable tools for distributed data-intensive problems. In practice it is one of the most common tools used in cloud computing services creation, which is why work presented in this paper can be an important contribution to cloud computing research.
ABSTRACT This paper describes an approach to how ontologies can be used for modeling of real-time monitoring systems that provide both efficiency and intelligence. An ontologybased framework is described that allows for fusion of various data-processing and reasoning techniques. This solution supports the construction of detailed models of data dependencies and their validation. We also describe a set of tools that provides auto¬mated deployment from the ontology-based model. Deployment is performed using the Hadoop MapReduce imple¬mentation to provide efficiency in dealing with a vast amount of data. We explain its possible use for real-time monitoring of an extensive underwater sensor network.
The most prominent application of MapReduce is probably log analysis; others include analysis of raw sensor data or streams of data (like messages in social networks). While there is a growing variety of methodologies and tools to assist in the various stages of building such systems, they usually do not cover the whole design and deployment process. Moreover, existing frameworks do not explicitly facilitate use of technologies that would allow for more intelligent data processing.
KEYWORDS: cloud computing, ontology, real time, reasoning
The new approach should be the characterized by expressiveness required by the programmer and the generality level necessary for the data analyst, who does not need to have deep implementation knowledge. An important element is the possibility of integrating interaction between various environments at one place in a declarative manner, in order to be able to validate the model across those environments. A definition language should facilitate usage of intelligent elements e.g. logical reasoning both for definition and execution purpose. Techniques for progressive reasoning should be enabled in the system to assist real-time analysis. All definitions should be computer interpretable and detailed enough to enable automatic deployment.
1. INTRODUCTION One of biggest problems posed by real-time monitoring is how to achieve intelligence-preserving efficiency; especially if a large amount of data is analyzed. In this paper we propose a twofold approach to this problem. First, we apply different data processing methods to data on different stages of processing; some of these methods focus more on supporting efficiency, others on supporting intelligence. The aim is to create simple events from raw data in an efficient way. Later, these simple events are combined into complex events in an intelligent way. Secondly, ontology-based approached is applied to provide uniform modeling and deployment environment based on Protégé[1] and the Hadoop [2] MapReduce implementation.
978-1-4244-6830-0/10/$26.00 ©2010 IEEE
In this paper, real time means that there exists an activity with a given latency requirement. However, it is different from low latency applications such as Voice over IP (VOIP) or control systems. In the presented use case,
209
underwater sensor network monitoring, the final decision is always made by a human. The goal of the system is to provide real-time decision support. Therefore, the framework is designed from a perspective that data processing should be optimized in such a way that when new data arrive it is only these new data that have to be processed.
system. However, its language provides just the simplest constructs and doesn't allow expression of details related with complex data analysis. Pig[9] is a project that allows for easy definition of dataflow and is used as a part of our implementation as a facilitator for defining Hadoop map-reduce jobs. The Chukwa[10] project which was created to simplify log analysis in Hadoop is used as a tool for introducing data streams from various sources into the Hadoop environment.
In this paper an ontology-based framework is outlined. It supports the aforementioned approach for building scalable intelligent real-time data monitoring systems based on cloud technologies. This framework is targeted at Hadoop and uses the Web Ontology Language (OWL) as its modeling tool. It comes with a set of open source tools for managing various stages. The rich modeling language provides tools for automatic model validation, which might be of great value in complex distributed systems. Using OWL enables more intelligent data analysis thanks to the availability of DL reasoners and the SWRL engine. The framework has its roots in the BioSTORM project[3, 20] which was an ontology-driven framework for deploying the JADE agent systems. Whenever we refer to a “framework” we mean a general set of ontologies and tools that we created, when we refer to “system” we mean a particular implementation of the framework.
There exist various systems for Complex Event Processing (CEP) [11], which can often be based on rule engines[12]. However, apart from strictly business applications[13], they are not tightly integrated into bigger data processing frameworks. However, such integration is of course possible in general. At the moment, there does not exist any other integrated ontology-based framework to define how the whole realtime data processing system can be automatically deployed on a cloud-based system.
3. APPROACH DESIGN
This work is based on other research both in area of distributed computing (e.g. Hadoop, Pig) and ontologydriven design. Its originality lays in providing an approach to connect them in order to solve the problem of efficiency and intelligence in real-time monitoring, and in describing tools that support this approach.
TO
FRAMEWORK
A framework should provide an expressive definition language. It should allow the description of all data analysis and processing steps, and the interaction between them. It must be descriptive enough so that it can serve as a base for actual deployment. It needs to describe data sources and adaptors for them, processing steps with inputs, outputs and methods, and specify content of the information being exchanged between processing steps. At the same time this language should provide a level of generality suitable for the data analyst, who does not need to have deep implementation knowledge. This framework makes use of OWL as the definition language. OWL is descriptive enough to meet the aforementioned requirements. At the same time it can provide a more general overview of defined models using various plugins and widgets e.g. GraphWidget v. Fig. 1.
2. RELATED WORK The core of the aforementioned approach was implemented first in the BioSTORM project. The current framework was built on the basis of BioSTORM improving on its structure to facilitate real-time monitoring and enabling it for cloud-based deployment. Ontologies have been used for development of agentbased systems in [4] and [5] . Those projects used formal models to describe various aspects of the system, but to our knowledge, none of those used ontologies to describe all major aspects of a system with the final goal of automatic generation and deployment.
Another element of the framework should be a tool to validate the dataflow through the whole algorithm. Therefore, an important element of the models is integrating interaction between various environments at one place in declarative manner, in order to be able to validate the model across those environments. By basing ourselves on OWL and Protégé, it is possible to provide the validation tool relatively easily.
There exist several workflow definition systems for Hadoop, mainly Cascading[6], Oozie[7] and CloudWF[8]. Oozie is not stable at the moment of writing. Cascading is targeted at expert programmers. CloudWF uses XML to define its workflow. It is simpler than Cascading and enables easier integration of external elements into the
210
Figure 1. Example Algorithm Design The definition language should facilitate usage of intelligent elements e.g. a logical reasoner both for definition and execution purposes. As OWL is based on Description Logic (DL) it gives access to several DL reasoners during the definition phase. It is also possible to access the reasoner during execution.
in Protégé. This allows for more transparent model definition and more complete model validation. Second, there is stress on explicit division between old data(that might be used to support analysis) and realtime/new data. This should assist in reprocessing only the necessary data. We assume that those two mechanisms, if properly supported in design and deployment phases, can efficiently support real-time data processing. This assumption will be subject of verification in future tests.
Techniques for progressive reasoning should be introduced in the system to assist real-time analysis. Currently recommendation and decision support systems may require recalculation of all analyses when new data is appended or some changes are made to reasoning process. This is not suitable for real-time decision support. That is why in this framework we propose two mechanisms. First, two main stages of dataflow are introduced. There are map-reduce jobs that can efficiently perform heavy batch processing of incoming data and produce reduced data sets with events. Those events are consumed by a rule engine which analyses SWRL rules. SWRL rules can easily be integrated with OWL ontology development and can support necessary intelligence. This approach is similar in its essence to CEP. However, it is extended by integration with the map-reduce stage to enhance scalability while dealing with raw data. Moreover, both stages are integrated in one algorithm in the design phase
4. SYSTEM DESIGN The design of the system takes place in the Knowledge Layer. The Knowledge Layer consists of several ontologies and software packages that let one completely specify the functionality of the system. They are depicted in Knowledge layer in Fig. 3. The Task-Method Ontology specifies the configuration of a problem described using the task-method decomposition approach. We have adopted approaches from the knowledge modeling community to declaratively represent the procedural structure of these algorithms [1517]. These approaches model knowledge about systems with respect to their goal or the task that they perform, 211
and most share a methodology referred to as task analysis [18]. This methodology is used to construct algorithms or problem solving methods (PSMs) to solve particular tasks. In this methodology, a task defines "what has to be done".
A model validation tool is associated with the TaskMethod Ontology and is accessible through the Protégé Plug-in interface. It can work in two modes: offline and online. It allows the automatic checking of the consistency of dataflow in the model, and it can automatically correct simple common mistakes.
Tasks are accomplished by application of a method, which defines "how to perform a task". The Task-Method Ontology defines classes to model tasks, methods, connectors (that specify communication paths among tasks), algorithms (that consists of a collection of related tasks), together with detailed properties of those classes. The modeler of a specific system will define all the necessary domain- and problem-specific subclasses of the above classes and configure them with appropriate properties. For example, a method to compute the mean of a vector of values must define at least the properties vector and mean. The description of the problem will be realized by instantiating the user-defined subclasses and setting appropriate values to their properties. These property values together specify which task will use which method, which variables play which roles in the instantiated methods, and which tags are written and read by a task. Thus, for example, we may instantiate the method to compute the means of a vector of values for the task of computing the mean value of the temperature reading of a given sensor, specifying that the vector property is associated with a sequence of tempera¬ture reading values, and the mean property is associated with an output variable.
The Deployment Ontology defines deployment configurations for the overall task defined in the taskmethod ontology. It specifies all the necessary information for a given system deployment, including values for the system configuration variables, system output variables, variables for profiling and performance measurement, etc.
5. SYSTEM DEPLOYMENT The deployment platform consists of several processes encapsulated by a GUI. These processes parse and instantiate Knowledge Layer information into an external environment, including the Hadoop cluster and monitored nodes. They are depicted in Deployment Layer in Fig. 3. The configurator process translates OWL-encoded deployment configurations in the surveillance deployment ontology (including task-methods and data structures) into system elements and scripts, providing all the necessary information to the controller agent to create and initialize the tasks involved in the execution of a deployment.
Our system design also contains a Methods library, which is a collection of software implementations of methods used later by tasks in the algorithm. It consists of typical methods (like basic math functions, data transformation functions, etc.) and it can easily be extended with new methods according the need of particular implementation. The Data Source Ontology describes the data elements that system processes and jobs have to deal with at runtime. In this ontology, data elements are referenced as variables (e.g., temperature reading from a sensor). A tag represents a bundle of variables that are logically related to each other (e.g., the date and time, sensor identifier, sensor location and sensor reading). These bundles are used to describe the inputs and outputs of a task. In our architecture, the tag represents the basic communication unit between the elements of the system. It might usually, though not always, be compared to a relation in Pig or files in HDFS (Hadoop Distributed File System)[14]. This data source ontology focuses on describing relatively simple data structures and does not directly tackle transformation and integration issues. However, combined with Task-Method Ontology, it also allows the performance of the necessary transformations of source data. Such transformations can be grouped as subtasks in order to improve readability of the whole algorithm and reused later.
The controller process controls the initialization, initiation and interruption of a deployment. It is responsible for the creation and initialization of all the tasks, as well as of the implementation of specific distribution policies. A controller process receives a deployment configuration from a configurator process. The monitor process observes progress of operations performed by the Controller process during deployment and later execution, serving as a main source of information for the GUI after deployment is initiated. The interface also lets one specify which information is to be monitored and logged. An example monitor GUI during tests is depicted in Fig. 2. The framework allows for different elements to be implemented in various environments. This is necessary both to be able to integrate all the elements in one framework and to provide appropriate efficiency in the deployment phase. Data collection is performed externally through a web services interface. The main raw data processing is performed in a Hadoop cluster using mainly Chukwa and Pig. When raw data are reduced to events these are processed in a rule engine based on SWRL. 212
framework and later encapsulated into subtask for better algorithm analysis. The boundary between “HDFS” and “In memory” demonstrates that different dataflow stages require different processing techniques that might work better using different data storage methods. We have decided to use a rule engine as a final data-processing stage as it can provide a natural way to express dependencies between various event data. At this moment the amount of data should be significantly reduced, making the “Events” dataset small enough to handle in the memory of a modern server or even PC Holding this data in memory (with backup in HDFS) should provide both efficiency and stability to the system.
7. FUTURE WORK Figure 2. Example GUI during Tests. The processing core of the system is implemented and deployed on a 15 node Hadoop cluster at University of Stavanger. In the future, we want to perform more tests regarding the Chukwa agents that introduce data into the system and their behavior with different types of data sources. SWRL can at the moment only be used in the design phase of the system. We want to allow for ad-hoc modifications to this part of the system directly from the GUI during execution. Furthermore, we will prepare a public implementation of the framework for Amazon Web Services [19], based on a research grant we have received from Amazon for this work. Finally, we plan to integrate more methods in the library and release the system as open-source
6. SYSTEM ARCHITECTURE In Fig. 3, the general system architecture is presented. It shows dependencies between ontologies in the knowledge layer. Subsequently, one can see that the knowledge layer constitutes the basis for the deployment platform where it is interpreted by the controller process and monitored by the monitor process. The controller process instantiates data processing, which is performed mainly on a Hadoop cluster. In the data layer one can see what the data location for each data processing stage is. It is important to notice that map-reduce can consist of many processing steps (depending on the algorithm). It is simplified in the diagram for clarity of presentation. Boundaries in the data layer between “Source” and “HDFS” , and “HDFS” and “In memory” are also marked on the diagram. The ability to describe dataflow above those boundaries in the knowledge layer and later deploy it through the deployment layer is the most important quality of the framework described in this paper. Having all the stages of dataflow described in one algorithm in the knowledge layer allows for more complete model validation. This should result in better quality of the final system. Moreover, as the knowledge layer is expressive enough to allow automated deployment of all tasks, there is only one central deployment platform. This should provide a better overview and control over the whole process, compared to the situation where each task would have to be deployed separately. It is important to have control over the boundary between “Source” and “HDFS” in order to see how source data is related to input data. It might require some transformations that need to be described in detail in the 213
Task-method Ontology
Deployment Ontology
Configurator process
Deployment
Data-source Ontology
Knowledge
Method Library
Monitor Process
GUI Controller Process
Hadoop
SWRL reasoner
M/R data processing
Collector
Supplier
Processing
Supplier
Supplier
In Memory
HDFS
Source
Log Real-time Raw Data
XML
Historic Raw Data
Raw sensor data Web services streaming
Figure 3. General System Architecture
214
Data
Events
http://issues.apache.org/jira/browse/HADOOP-5303. Last accessed on March 9, 2010.
8. SUMMARY This paper has described an approach to how ontologies can be used for modeling of real-time monitoring systems that provide both efficiency and intelligence. Using the Semantic Web ontology language OWL, the Hadoop platform and Semantic Web Rule Language SWRL, a number of models and associated software tools were constructed with an aim to provide an end-to-end solution for designing and deploying efficient intelligent real-time data-monitoring systems. This solution supports the construction and validation of detailed models of data dependencies. It also enables automated generation and deployment from those models. We have discussed the importance of modeling the whole system in form of one algorithm over boundaries of several physical environments, which aims to increase quality of model validation results. We have explained the possible use of our framework for real-time monitoring of a vast underwater sensor network, and discussed preliminary results.
ACKNOWLEDGEMENTS We would like to thank Ben “Shevek” Mankin for interesting and valuable discussions.
[8]
C. Zhang and H. De Sterck, “CloudWF: A Computational Workflow System for Clouds Based on Hadoop.” Cloud Computing, 2009, pp. 393-4040.
[9]
“Welcome to Pig!” http:// hadoop.apache.org/pig/. Last accessed on March 9, 2010.
[10]
“Welcome to Chukwa!” http://hadoop.apache.org/chukwa/. Last accessed on March 9, 2010.
[11]
“Complex Event Processing” http://complexevents.com/. Last accessed on March 9, 2010.
[12]
“Jess, the Rule Engine for the Java Platform” http://www.jessrules.com/. Last accessed on March 9, 2010.
[13]
“Complex Event Processing” http://www.oracle.com/technologies/soa/complex-eventprocessing.html. Last accessed on March 9, 2010.
[14]
“HDFS User Guide” http://hadoop.apache.org/common/docs/current/hdfs_user _guide.html. Last accessed on March 9, 2010.
[15]
B. Chandrasekaran and T.R. Johnson, “Generic tasks and task structures: history, critique and new directions,” Second generation expert systems, Springer-Verlag New York, Inc., 1993, pp. 232-272 0.
[16]
L. Steels, “Components of expertise,” AI Mag., vol. 11, 1990, pp. 30-49
[17]
D. Fensel, E. Motta, F.V. Harmelen, V.R. Benjamins, M. Crubezy, S. Decker, M. Gaspari, R. Groenboom, W. Grosso, M. Musen, E. Plaza, G. Schreiber, R. Studer, and B. Wielinga, “The unified problem-solving method development language UPML,” Knowl. Inf. Syst., vol. 5, 2003, pp. 83-131
[18]
B. Chandrasekaran, T.R. Johnson, and J.W. Smith, “Task-structure analysis for knowledge modeling,” Commun. ACM, vol. 35, 1992, pp. 124-137
[19]
“Amazon Web Services” http:// aws.amazon.com/. Last accessed on March 9, 2010.
[20]
Nyulas CI, O’Connor MJ, Tu SW, Buckeridge DL, Okhmatovskaia A, Musen MA. “An Ontology-Driven Framework for Deploying JADE Agent Systems”. IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WIIAT'08), 2008.
REFERENCES [1]
“The Protégé Ontology Editor and Knowledge Acquisition System” http://protege.stanford.edu/. Last accessed on March 9, 2010.
[2]
“Welcome to Apache Hadoop!” http:// hadoop.apache.org/. Last accessed on March 9, 2010.
[3]
“Welcome to BioSTORM [BioSTORM]” http://biostorm.stanford.edu/doku.php. Last accessed on March 9, 2010.
[4]
M. Laclavík, Z. Balogh, and M. Babík, “L.: AgentOWL: Semantic Knowledge Model and Agent Architecture,” IN COMPUTING AND INFORMATICS, vol. 25, pp. 419-437.
[5]
G. Tian and C. Cao, “An Ontology-Driven Multi-Agent Architecture for Knowledge Acquisition from Text in NKI,” Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC'06) - Volume 02, IEEE Computer Society, 2005, pp. 704-70.
[6]
“Cascading” http://www.cascading.org/. Last accessed on March 9, 2010.
[7]
“[#HADOOP-5303] Oozie, Hadoop Workflow System ASF JIRA”
215