Pattern Matching Against Distributed Datasets - Semantic Scholar

Searching Against Distributed Data Using a Web Service Architecture Tom Jackson, Mark Jessop, Andy Pasley, Jim Austin Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK. [arp, tom.jackson, mark.jessop, arp, austin]@cs.york.ac.uk

Abstract Many condition health-monitoring applications require access to distributed data assets. The DAME project has investigated one such example based upon condition monitoring of civil aero-engine sensor data. A servicebased solution is introduced that has been implemented within the Globus Grid framework. It provides a general architecture for distributed search and identifies the generic functionality that is required.

1.

Introduction

Modern aircraft engine diagnostics and maintenance is assisted by the collection of operational data during flight. This is supported by a wide variety of tools for viewing and analysing the data collected. The Distributed Aircraft Maintenance Environment (DAME) project [1] has demonstrated the use of the grid to implement a distributed decision support system for deployment in maintenance applications and environments, with particular interest in the field of Rolls-Royce aeroengines. The DAME project is a collaboration between four Universities (York, Leeds, Sheffield and Oxford) and industrial partners Rolls-Royce, Data Systems and Solutions, and Cybula Limited. The project has implemented a condition health monitoring demonstrator system, using the White Rose Grid [2] as a test bed and accessed through a web portal. The DAME portal uses web and grid services to deploy, within a single framework, various distributed diagnostic tools and datasets. These service assets can be owned and administered by geographically and organisationally separate entities (forming a Virtual Organisation[3]), each providing their own area of expertise to the DAME system. The theme of the DAME project is the design and implementation of a fault diagnosis and prognosis system

based on the grid computing paradigm and the deployment of grid services. In particular, DAME focuses on developing an improved computer-based fault diagnosis and prognostic capability and integrating that capability with a predictive maintenance system in the context of aeroengine maintenance. Here, we use the term “predictive maintenance” to mean that there is sufficient time interval between the detection of a behaviour that departs from normal (via a fault threshold) and the actual occurrence of a failure. The DAME system will deploy grid services within this time window to develop a diagnosis of why an engine has deviated from normal behaviour, to provide prognosis (understanding what will happen) and to plan maintenance actions that may be taken a safe and convenient point when the impact of maintenance is minimized. We begin with an outline of the characteristics of the problem domain and illustrate the reasons why it is amenable to a grid-based solution. Condition monitoring techniques are deployed across many diverse IT domains, for example, medicine, engineering, transport, and aerospace. However, regardless of the application domain, fault diagnosis and prognosis systems share a number of operating and design characteristics: • • • •

•

These systems are data centric. Monitoring and analysis of sensor data and domain specific knowledge is critical to diagnostic process; They typically require complex interactions among multiple agents or stakeholders; They are often distributed; They need to provide supporting or qualifying evidence for the diagnosis or prognosis offered; They can be business critical, and typically have stringent dependability requirements;

The emerging grid computing paradigm [3] appears to offer an inherently practical framework in which to build and manage systems to meet these requirements.

2.

Demonstration Context

Modern aero-engines operate in highly demanding operational environments and do so with extremely high reliability. To achieve this, the engines combine advanced mechanical engineering systems with tightly coupled electronic control systems. As one would expect, such critical systems are fitted with extensive sensing and monitoring capabilities for performance analysis. To facilitate engine fleet management, engine sensor data are routinely analyzed using the COMPASS health monitoring application developed by Rolls-Royce and prognostic applications employed by Data Systems and Solutions. The resulting commercial services are subscribed to by many commercial airlines. The basis of monitoring is to detect the earliest signs of deviation from normal operating behaviour. COMPASS achieves this by comparing snap-shots of engine sensor data against ideal engine models. The relatively small data sets may be transmitted in flight or downloaded once on the ground. There is scope to increase the effectiveness of monitoring by looking at more data in greater detail. For this reason, Rolls-Royce has collaborated with Oxford University and has developed an advanced monitoring system called QUICK [4]. QUICK performs engine analysis on data derived from continuous monitoring of broadband engine vibration. The analysis is achieved through data fusion of the vibration data with instantaneous performance measurements. QUICK does not store data from many flights or cross-reference data from the rest of the fleet. However, a ground-based system can maintain fleet wide databases of flight data and other maintenance related information and can use this additional data to perform various analyses. This analysis will enable unknown anomalies to be correlated to root causes and appropriate remedial actions taken. The DAME project offers the prospect of combining the bandwidth rich QUICK approach with the sophisticated time series, fleet-wide repositories available to novel fault signatures in the monitored sensor data from an individual engine against an archived, historical fleetwide dataset. This search task is made more complex by the fact that the DAME demonstration scenario presupposes that the fleet archive is remote and distributed; sensor data will be downloaded at airports all over the world. Furthermore, each individual engine is likely to have its data downloaded at many different airports. In traditional database terms, this is analogous to an end user wishing to simultaneously submit an SQL query to a range of different and remote database repositories. The search process also requires correlation and ranking, via a data fusion process, so that the results

COMPASS with the goal of an enhanced ability to anticipate maintenance requirements. Developing grid-based diagnostic systems to facilitate the processing of data in a ground-based system presents the DAME project with three principal challenges: •

•

•

The type of data captured by QUICK involves real valued variables monitored over time. Each flight can produce up to 1 Gigabyte of monitored sensor data, which, if scaled to the fleet level, implies many Terabytes of data per fleet per year. The storage of this data will require vast data repositories which will be distributed across many geographic and operational boundaries, but which must be accessible for health monitoring services; Advanced pattern matching and data mining methods must be developed to detect features and analyse the type of data produced by the engine. These methods must be able to operate on Terabytes of data and give a response time that meets operational demands; The diagnostic processes require collaboration among a number of diverse actors within the stakeholder organizations, who may need to deploy a range of different engineering and computational tools to analyse the problem. Thus, any grid-based solution must allow a Virtual Organisation to support the services, individuals and systems involved.

In the following sections we describe a grid-based distributed data mining architecture that address many of these issues. Other aspects of the DAME system, for example, middleware services, the portal, security requirements etc, are reported elsewhere [1].

3.

The Data Mining Problem

The context for the data mining and pattern matching problem in the DAME architecture is one of matching of pattern matching at each distributed node can be interpreted in a coherent manner. There are many possible reasons why data may have to remain in a distributed manner, but two reasons encountered within the DAME project that are likely to apply to other domains are: •

Prohibitive cost of providing suitable network infrastructure, in particular meeting bandwidth requirements for delivery to a central repository of very large data assets.

•

Legal constraints, especially where data is owned by different stakeholders, prevent its storage at a single site.

This paper considers the provision, within a grid framework, of data mining tools for a distributed engine vibration dataset. A fundamental aim of this work has been to devise a solution that separates the distributed nature of the problem from the searching/pattern matching problem. From this architecture re-usable generic components have been highlighted that could directly be incorporated into other grid applications of distributed search. In addition to making the data mining architecture generic, the following objectives for the system were identified: • •

•

•

•

•

•

Scalable. The system should be designed to operate on large data sets (terabytes) and across 100’s to 1000’s of data nodes.. Robust. The system should have high availability and produce partial results where one or more nodes fail. As a consequence of this there should not be a central point of failure. Transparent. The distribution (and therefore the architecture) should be hidden from the end user. Client developers should need to access a single point to request searching against the full data set. Efficient. The architecture should introduce minimal overheads and bandwidth requirements, and should minimise the amount of data that is required to be moved across the network infrastructure in achieving the data mining objectives. Flexible. It must be possible to add and remove nodes from the system dynamically, and for nodes to optimally contribute to the performance of a search process. Concurrent. It must support multiple simultaneous, independent and asynchronous searches. Storage Format Independent. It should be independent of the underlying database technology used to store the data repositories, and should operate transparently across a heterogeneous, distributed data archive facility.

4.

The Proposed Architecture

This section introduces a service-based architecture for deploying pattern matching and/or search functionality against data distributed over many geographically separate locations. The overall DAME data-mining and data management architecture overview is shown in figure 1. This is based around three primary web service enabled components: • • •

A data-management system for handling data arriving from an aircraft or other data asset; A distributed query system; A virtual data archiving service.

We will focus our description of the architecture on the distributed query service and the virtual data archiving service. At the heart of the distributed query process are a range of web services that encompass the Pattern Match Controller (PMC), the Pattern Match Service (PMS) and the Search Constraints Builder (SCB). The architecture assumes that queries are formulated via a client service. In the case of DAME this is the Signal Data Explorer (SDE) indicated on the diagram. SDE is a GUI that defines the search query. Its functionality is outside the scope of this paper but further details can be found in [5]. The architecture has been implemented as grid services developed under Globus Toolkit 3, but will ultimately be ported to a web services and WSRF under GT4. The distributed concept of DAME dictates that the PMC and PMS services are hosted remotely and replicated at the diverse data repositories (e.g. at each potential airport where data is downloaded and stored). These services and the data repository form a ‘Data Node’ within the DAME architecture. For the purpose of this architecture definition, each node will be treated as a single resource. In practise, it is likely that each node will utilise many resources, for example, high performance clusters, tape archives, desktop PCs and laptops. The key assumption made is that the communication bandwidth available between resources at a single node is significantly higher than that available between nodes. The role of each service will be outlined in the following sections.

SRB Stored Data

Engine Data Raw Engine Data

Engine Data

Storage Service

Raw Data

Tracked Orders

Scope

Signal Data Explorer

Constraints

Search Constraints Builder

Store()

Extractor Search()

GetResults()

NodeSearch()

Pattern Match Control (PMC) ‘master’ node

Store()

ReturnResults()

Other PMC nodes PMC PMC

Search() ReturnResults()

Pattern Matching Service

Figure 1: The DAME data-mining architecture and the PMC.

4.1.

The Pattern Match Controller & Search Process

The Pattern Match Controller (PMC) is the front-end service for distributed search operations. The PMC controls the search process at each node and provides all communication between nodes. An instance of the PMC resides at each node. Each PMC: • Has a catalogue of all other nodes in the system, to ensure that search requests can be sent to all nodes. • Has access to a local Pattern Matching Service (PMS), which it can instruct to carry out searches. • Manages results for all ‘active’ searches at its own node.

Pattern match searches can be initiated either from an end-user or from automatic workflows in the task brokerage system. In both cases, a client service communicates, via web service protocols, with its nearest PMC service to request a search. The query takes the form of a request to match a fault signature against the stored fleet data at each node. The PMC node that receives the request becomes the ‘master’ node for that unique search task. All other nodes in the system are referred to as ‘slaves’ for that search. A search scenario is shown in Figure 2. Numbered arrows show communication between services. The exact order of some communication is dependent upon when individual pattern matching services complete their search. 1.

A client delivers a search request to a PMC. This becomes the master PMC for this search,

2. 3. 4.

5.

6.

and returns a unique identifier for this search that the client can use in later communication. The master PMC passes the search request to all available slave PMCs. All PMCs (including the master) pass the search request to their local Pattern Matching Service. At each node searching commences. At each node, as the Pattern Matching Service completes its search and passes the result to its local PMC. Pattern Matching Services then clean up, discarding the search results. At the master node, the result is merged into the overall result set. Slave PMCs pass the results to the master PMC. The slave PMCs can now clean up and discard the search results. The master PMC merges the new results into the overall result set as they arrive. The client makes a request for the results. The master PMC returns the complete results with additional information informing the client that the search is complete.

The client can request the current search results at any time. The master PMC returns the current result set with additional information such as how many nodes have completed their search task. Typically, a client will request results at regular intervals until the search is finished. One of the stated aims of DAME is to build a generic framework for diagnostics. An API has been designed to support this objective; it does not contain any domain specific data types or structures. The API only specifies how components should interact with the pattern match architecture. This means that application specific components can be built and configured as required by particular implementations to work with the PMC. The PMC receives domain specific data as part of a search request, but does not need to understand it, just how to deliver it to the appropriate services. The PMC is a reusable component appropriate for distributing search requests regardless of the problem domain. Each node can behave as a completely stand-alone pattern matching entity, capable of matching against data stored at its location. The first PMC in a system is configured as a single entity, with no knowledge of any other nodes. As new PMC nodes are added, each one is configured with the address of one other known PMC. When one of these nodes starts up it contacts the known PMC and requests a list of all nodes in the system. It then contacts each one in turn to register itself. From this point on all searches will include the new node.

Each PMC maintains a list (or catalogue) of all the nodes within the system. For each node, the IP address of the PMC at that node is the only information required. Each PMC maintains a catalogue of all the nodes within the system. If a node fails, its entry is kept in each node list maintained by each PMC. This allows search results to reflect the fact that the entire dataset is not being searched against. If a node is to be permanently removed, a de-registration process is invoked. These architectural features address the requirements for scalable, efficient and flexible functionality. Any number of nodes can be supported, each operates independently, and searches are carried out asynchronously. The PMC passes search requests to and from other services, without requiring any understanding of the nature of the search request. Any data that can be contained within the search request structure can be supplied as a search request. Achieving, appropriate results depends upon providing PMS implementations that can understand and process the request.

4.2.

Pattern Match Service

The Pattern Matching Service is responsible for performing the search across the data held at a node. This means that the pattern-matching process is handled entirely independently to the search requesting mechanism. This provides for flexibility in the architecture, and does not constrain the query process to any particular pattern-matching algorithm; the PMS can deploy any available pattern matching and/or indexing algorithms as appropriate. It receives search requests from its local1 PMC and upon completion of the search sends the results back to the PMC. Where there is no relevant data to search against at a node, the PMS simply returns an empty result set. At each node the Pattern Matching Service can access the data stored there without consideration of data stored at other nodes. This feature addresses the requirement for a scalable and concurrent system by permitting the query process to be carried out as an inherently parallel operation at each data node. A separate service, seen in figure 1, the Search Constraints Builder, permits the queries to be formulated with a range of operational constraints, such as time scales, data ranges, user domains etc.

1

Throughout this document, data or services referred to

as ‘local’ to each other reside at the same node.

Figure 2: The communication required to complete a typical search process.

4.3.

DAME Query Process

To provide an example of what functionality a PMS might perform we describe the search process in the DAME system; pattern matching vibration signatures against the historical flight data. The aim is to find vibration patterns that are ‘similar’ to the anomaly just observed (the ‘query’ pattern, q ). The problem can be stated as finding a set of time series sub-sequences R .

R = {x ∈ X d ( x, q ) ≤ τ }

τ

is the maximum distance (using an appropriate metric) that a vibration pattern can be from the query and be considered a match. This is an instance of the ‘query by content’ or ‘query by example’ problem, a significant area of data mining research [6,7,8]. Where τ is varied to return a set of a fixed number ( k ) of matches in R , this is referred to as a k-nearest neighbour (k-NN) search. Within the DAME system this search algorithm is implemented using the AURA [9] technology, which provides high performance neural network based pattern

matching. AURA is pivotal to the DAME requirements for meeting search requests within tight operational time constraints. It is a highly parallelised and massively scalable search engine that can implement the k-nearest neighbour algorithm extremely efficiently on massive datasets. However, as already stated, the PMS architecture is algorithm independent and the AURA pattern match engine can easily be replaced or accompanied by other pattern matching algorithms as required for the relevant data set. Simple API’s allow any algorithm to be made available to the service.

4.4.

PMS Data Transfer

Passing large volumes of data (that may not be required) around the system is likely to increase search times and/or bandwidth requirements. To reduce the volume of network traffic generated (both within nodes and between nodes, and between the client and master PMC), the actual data for each match is not passed between services. Details for each match include an identifier that specifies the location of the data for that result. If a client wishes to examine a result it must fetch the data, which could be located at any node in the system. Clients should be able to access/retrieve result data through a

single service, without concern for which node the data is located at, i.e. treating the dataset as a single entity at a single location. A data management system is required that will provide a single logical view to the distributed dataset and allow clients to access data from the same service, regardless of location. This data management service is responsible for ‘looking up’ the location of the data and fetching it.

4.5.

Storage Resource Broker (SRB)

Storage Request Broker [10] was selected as the storage mechanism to provide a virtual file indexing system at each node within the DAME demonstrator. SRB is a tool for managing distributed storage resources, from large disc arrays to tape backup systems. Files are indexed via SRB and can then be referenced by logical file handles that require no knowledge of where the file physically exists. A Meta-data Catalogue (MCAT) is maintained which maps logical handles to physical file locations. Additional system specified meta-data can be added to each record. This means that the catalogue can be queried in a data centric way, e.g. engine serial number and flight date/number, or via some characteristic of the data, rather than in a location centric fashion. When data is requested from storage, it is delivered via parallel IP streams, to maximise network through-put, using protocols provided by SRB. SRB can operate in heterogeneous environments and in many different configurations from completely standalone, such as one disc resource, one SRB server and one MCAT to completely distributed with many resources, many SRB servers and completely federated MCATs. In this configuration a user could query their local SRB system and yet work with files that are hosted remotely in a number of diverse database systems, such as Oracle, DB2 etc. This meets the requirements for a data mining architecture that is independent of the underlying storage medium or database technologies. Within DAME the arrival of aircraft vibration data is simulated at each node; new data is stored directly into SRB on a local resource. Each location stores engine data under a logical file structure based on node location, engine type, engine serial number and flight information. All data is visible to any user or client regardless of their location. Pattern Matching Services access local data using standard file access protocols, the efficiency of this depending upon the architecture and specification of individual nodes. Alternatively they can use SRB, querying the SRB MCAT to locate all data at that node.

The total volume of network traffic created by all the communication between services is small due to the fact that large data assets are not transferred. The client application can access SRB directly to fetch the data for individual matches as required. This means that the client application does not need to know, or be concerned with where the data is actually located. SRB’s logical view allows PMCs and pattern match services to operate on a distributed data set only accessing local data, while other DAME services can treat the data as if held at a central repository.

5.

Summary

We have proposed an architecture that permits data mining and pattern matching queries to be carried out on highly distributed data, and on data that could potentially be managed in a highly heterogeneous range of database technologies. The architecture is generic in that the services are not constrained to any particular search task or any set of search algorithms. Considerable efforts have been expended to make the architecture robust and scalable, and to separate out the mechanisms for requesting and managing the results of queries from the process of pattern matching through the diverse data repositories. It has been demonstrated and deployed in a real demonstration domain and found to be highly effective in managing the problems associated with searching distributed data assets. Quantitative analysis of the system will be the subject of a future journal paper on the PMC architecture One limitation to the flexibility of the architecture is that it has been designed to support searches that only require coarse-grained parallelism. Further research is investigating support for complex queries that require more fine-grain parallel behaviour. This work will develop methods to allow search requests to be broken up into smaller tasks, with communication between nodes during a search allowing results from one node to influence the searching of other nodes. Such functionality could allow search requests to be made in a similar manner to complex SQL queries, utilising internode communication to more efficiently process operations that are equivalent to database joins.

6.

Acknowledgements

This project was funded by a UK eScience Pilot grant from the EPSRC (GR/R67668/01), and supported by our industrial collaborators; Rolls-Royce, Data Systems and Solutions, and Cybula Ltd.

References [1] Austin, Jackson, et al, Chapter 5, Predictive Maintenance: Distributed Aircraft Engine Diagnostics, in The Grid: 2nd Edition, edited by Ian Foster and Carl Kesselman. MKP/Elsevier , Oct 2003. [2] The White Rose Grid at http://www.wrgrid.org.uk/index.html. [3] I. Foster, C. Kesselman and S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, Int. J. Supercomputer Applications, vol. 15, no. 3, 2001. [4] A. Nairac, N. Townsend, R. Carr, S. King, P. Cowley, L. Tarassenko, “A system for the Analysis of Jet Engine Vibration Data”, Integrated Computer-Aided Engineering, pp 53-65, 1999. [5] Laing, B., Austin, J., A Grid Enabled Visual Tool for Time Series Pattern Match, In: Proceedings of the UK e-Science All Hands Meeting 2004, Nottingham, UK. [6] R. Agrawal, C. Faloutos, and A. Swami, “Efficient Similarity Search in Sequence Databases”, in Proc. 4th Int. Conf. Foundations of Data Organization and Algorithms (FODO), 1993, pp. 69-84. [7] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality Reduction for Fast Similarity Search in Large TimeSeries Databases”, Knowl. Inf. Syst., vol. 3, no. 3, pp. 263-286, 2001. [8] E. Keogh and S. Kasetty, “On the Need for Time-Series Data Mining Benchmarks: A Survey and Empirical Demonstration”, in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002, pp102-111. [9] J. Austin, J. Kennedy, and K. Lees, The advanced uncertain reasoning architecture. In weightless neural network Workshop, 1995. [10] Storage Request Broker at http://www.npaci.edu/DICE/SRB/

Pattern Matching Against Distributed Datasets - Semantic Scholar

Pattern Matching Against Distributed Datasets - Semantic Scholar

Suggest Documents