Search and retrieval of plasma wave forms ... - Semantic Scholar

3 downloads 0 Views 411KB Size Report
Sep 29, 2006 - R. Dormido, J. Sánchez, and N. Duro. Departameno de Informática y Automática, UNED, C/Juan del Rosal 16 5a, 28040 Madrid, Spain.
REVIEW OF SCIENTIFIC INSTRUMENTS 77, 10F514 共2006兲

Search and retrieval of plasma wave forms: Structural pattern recognition approach S. Dormido-Cantoa兲 and G. Farias Departameno de Informática y Automática, UNED, C/Juan del Rosal 16 5a, 28040 Madrid, Spain

J. Vega Asociación EURATOM/CIEMAT para FUSIÓN, Avenida Complutense 22, 28040 Madrid, Spain

R. Dormido, J. Sánchez, and N. Duro Departameno de Informática y Automática, UNED, C/Juan del Rosal 16 5a, 28040 Madrid, Spain

M. Santos, J. A. Martin, and G. Pajares Departamento de Arquitectura de Computadores y Automática, UCM, Ciudad Universitaria, 28040 Madrid, Spain

共Received 5 May 2006; presented on 11 May 2006; accepted 28 May 2006; published online 29 September 2006兲 Databases for fusion experiments are designed to store several million wave forms. Temporal evolution signals show the same patterns under the same plasma conditions and, therefore, pattern recognition techniques can allow identification of similar plasma behaviors. Further developments in this area must be focused on four aspects: large databases, feature extraction, similarity function, and search/retrieval efficiency. This article describes an approach for pattern searching within wave forms. The technique is performed in three stages. Firstly, the signals are filtered. Secondly, signals are encoded according to a discrete set of values 共code alphabet兲. Finally, pattern recognition is carried out via string comparisons. The definition of code alphabets enables the description of wave forms as strings, instead of representing the signals in terms of multidimensional data vectors. An alphabet of just five letters can be enough to describe any signal. In this way, signals can be stored as a sequence of characters in a relational database, thereby allowing the use of powerful structured query languages to search for patterns and also ensuring quick data access. © 2006 American Institute of Physics. 关DOI: 10.1063/1.2219409兴

I. INTRODUCTION

II. SYNTACTIC AND STRUCTURAL PATTERN RECOGNITION APPROACH

Visual data analysis is an essential tool in plasma physics. A simple visual inspection of signals is enough to recognize a typical plasma evolution or to distinguish the presence of interesting events. A researcher identifies the plasma behavior through the recognition of patterns inside wave forms: bumps, unexpected amplitude changes, abrupt peaks, or sinusoidal components. Therefore, a big challenge in data access is the creation of fast means to look for patterns within wave forms. These techniques will allow the development of intelligent data retrieval methods instead of using manual searches by pulse number 共in general兲 or identifiable time interval 共in long pulse operation兲. There are some previous works on pattern recognition in fusion databases. In an earlier approach, efforts were concentrated in looking for similar full wave forms, i.e., signals covering the full plasma life.1–3 In another approach, the interest is centerd in searching for patterns within wave forms. A pioneer work4 describes the search of patterns based on one major frequency component. However, more general methods are required to look for general patterns.

The syntactic approach takes the view that a pattern is composed of simpler subpatterns.5 The most elementary subpatterns are known as primitives. A complex pattern is then expressed in terms of relationships among its primitives. An analogy between the structures of patterns and the theory of formal languages is used to establish the foundation for syntactic pattern recognition. The patterns represent the sentences in a language, while the primitives constitute the alphabet of the language. A grammar for a language generates and identifies sentences belonging to that language by employing its rules. The idea that a potentially large set of related complex patterns can be described by a finite number of primitives, and grammatical rules makes this approach appealing. There are many applications where patterns can be described in terms of primitives and their relations. However,

a兲

Electronic mail: [email protected]

0034-6748/2006/77共10兲/10F514/4/$23.00

FIG. 1. The flow chart of our recognition system. 77, 10F514-1

© 2006 American Institute of Physics

Downloaded 05 Feb 2007 to 161.111.120.151. Redistribution subject to AIP license or copyright, see http://rsi.aip.org/rsi/copyright.jsp

10F514-2

Dormido-Canto et al.

Rev. Sci. Instrum. 77, 10F514 共2006兲

FIG. 2. Discriminates and labels for the classification of the angle of the fitted straight line.

sometimes a grammar is not suitable for a pattern class description because the patterns under consideration lack regularities and cannot be defined by rules. In such a case, the structural approach to pattern recognition can be adopted. In structural pattern recognition, we use symbolic data structures, such as strings, for the representation of individual patterns, similar to the syntactic approach. However, rather than use a grammar, we represent pattern classes through a number of primitives. Consequently, the recognition problem turns into a pattern-matching problem. For example, given a pattern decomposed into primitives 共set of characters: string兲, the final goal is to find the most similar pattern from a database of strings. The description task of a structural pattern recognition system is difficult to implement because there is no general solution for extracting structural features 共primitives兲 from data. The result is that primitive extractors for structural pattern recognition systems are developed to extract either the simplest and most generic primitives possible or the domain specific primitives that best support the subsequent searching task. Simplistic primitives are domain independent; therefore, a deeper interpretation is postponed until the searching. At the other extreme, domain specific primitives can be developed with the assistance of a domain expert, but obtaining and formalizing the necessary domain knowledge can be problematic. III. APPLICATIONS TO TIME-SERIES DATA

Identification problems involving time-series data 共or wave forms兲 constitute a subset of pattern recognition applications that is of particular interest because of the large num-

FIG. 3. Cost function for a signal.

FIG. 4. Primitive sequences in the coarse searching.

ber of domains that involve such data 共for instance, fusion databases兲. Although structural approaches are particularly appropriate in domains where domain experts classify timeseries data sets based on the arrangement of morphological events evident in the wave form 共e.g., speech recognition, electrocardiogram diagnosis, seismic activity identification, radar signal detection, and process control兲, we are interested in a domain-independent structural pattern recognition system, which is one that is capable of acting as a “black box” to extract primitives and perform searching without the need for domain knowledge. Our method is applied to fusion databases with the aim of looking for similar patterns within wave forms. The technique consists of three stages. First we preprocess the signal by applying a low-pass filter for smoothing purposes. Then we extract primitives, which encode the most elementary pieces of structural information of the pattern. Finally, we can find similar patterns from a database of strings, where the patterns represent the temporal evolution of physical properties. It should be remarked that the technique allows searching for patterns of any time length. The general flow graph of our searching pattern method is described in Fig. 1. IV. COMPUTATION OF PRIMITIVES

Selection of primitives is an essential issue in the structural pattern recognition of wave forms because they deter-

FIG. 5. Direct and inverse primitive sequences.

Downloaded 05 Feb 2007 to 161.111.120.151. Redistribution subject to AIP license or copyright, see http://rsi.aip.org/rsi/copyright.jsp

10F514-3

Rev. Sci. Instrum. 77, 10F514 共2006兲

Search and retrieval of plasma wave forms

FIG. 6. Application scheme.

mine what types of structural components we can construct. There are plenty of different ways to compute the primitives of the wave forms, such as constant, straight, exponential, sinusoidal, triangular, and trapezoidal structures. We have used the straight structure 共line segment兲 which is easy and fast to calculate. In our method we divide the original signal into segments 共all the segments have the same number of samples兲 which are fitted with a straight line. A least squares minimization procedure is used to obtain each straight line. Then we encode these segments into a string of primitives. We give a label to each segment, and we calculate the amplitude between the first and the last sample into the primitive. The labeling of the segment 兵共xi , y i兲 , 共x j , y j兲其 is based on the classification of the slope of the fitted straight line. We find the primitives P where the angle of the line with the x axis belongs 关Eq. 共1兲兴. Lab共兵共xi,y i兲,共x j,y j兲其兲 = P,

共1兲

if the lower limit of primitive P ⬍ arctan关共y j − y i兲 / 共x j − xi兲兴 艋 the upper limit of primitive P end. Our discriminate values and the primitive labels are depicted in Fig. 2. The classification of the angle gives us all

FIG. 8. Some shots returned and their matches.

the elementary structural information needed to construct more complex subpatterns in wave form recognition. The amplitude of a segment is computed in a straightforward manner 关Eq. 共2兲兴: Amplitudei,j = 兩y j − y i兩.

共2兲

Thus, our input to the system is composed by 共1兲 a string of n primitives, where n is samples/samples per primitive, and 共2兲 an array with the amplitudes of each primitive. We use five different values 共a, c, e, d, and z兲 to represent the classes of the angle. With a bigger amount of primitive classes we could have expressed more accurately the structure of the signal, but the final string would have been more complex. On the other hand, these five codes are just enough for a typical plasma evolution analysis. The code e represents a flat part of a signal, codes c and d represent the ascending and descending angles, and codes a and z represent the extremely steep slopes. In order to obtain a suitable number of primitives 共n兲 we minimize a cost function 关Eq. 共3兲兴 where 0 艋 ␣ 艋 1, A is the fitted error, and B the primitive error 共number of primitives兲. The fitted error is estimated between the filtered signal and the straight lines from the least squares minimization procedure. We evaluate the function until we obtain 100 primitives, and we select the value with minimum cost. Figure 3 shows an example for the evaluation of a cost function. J = ␣A + 共1 − ␣兲B.

共3兲

V. TYPES OF SEARCHING

FIG. 7. Input signal and section to search.

In our case, we define two types of searching: 共1兲 fine searching and 共2兲 coarse searching. In the fine searching we will obtain from the database the patterns whose sequences of primitives match exactly with the input pattern sequence. In the coarse searching, it is possible to associate different primitives into the same label. For example, if a ⇔ c and z ⇔ d we would have the following set of primitive labels: 关a or c兴, e, and 关z or d兴. Where 关a or c兴 represent the ascending angle, e represents a flat part

Downloaded 05 Feb 2007 to 161.111.120.151. Redistribution subject to AIP license or copyright, see http://rsi.aip.org/rsi/copyright.jsp

10F514-4

Rev. Sci. Instrum. 77, 10F514 共2006兲

Dormido-Canto et al.

TABLE I. SQL results of searching example-codes.

TABLE II. SQL results of searching example-distance errors.

Shot

Code

Shot

10108 10109 10110 10112 10115 10116 10119 10120 10121 10152 10153 10155 10176 10194 10222 10227 10228 10104

aaazzzddeeeeeezaazzdzzeeeeeeeeeeeeeeeeee aaazzzzzzzzzdeedeeeeeeeeeeeeeeeeeeeeeeee aaazzzzzzzzzeeeeceeeeeeeeeeeeeeeeeeeeeee aazaedzdddeezdzaadzzzzeededeeeeeeeeedeee aaazzzzzzzzzdddecdeeeeeeeeeeeeeeeeeeeeee aaazzzzzzzeeeeecaddeeedeeeeeeeedeeeeeeee aeaazzzzzzzzddzcazzddddedeeeeeeeeeeeeeee aaazdzzzzzzzzdzaaazzdzzdeeeeeeeeeeeeeeeeee aaazeeddddddzdzaaezzzzzzzzzzdedeeeeeeeee azaaaaaaaaeaacaaaaaezcaezeaezzdaazzzzace aaaaaaccaaedaaeecaeeeezzedzeazzcaazzaaae aaaaezzeeddecezzeeeedzaazeaaadzacedcaeze aaazzzzzzdeeeeeeeeeeeeeeeeeeeeeeeeeeeeee aaaaaaccccccccceeeeeccececaccdeaaaaaczze aaaczzzzzzdeeeeceeeeeeeaaaeeeecaecccecee aazzzeeeeeecceeecceccaaaaezzdcceacdeeeee aazzzdeeeeeeeeceecacaaaaadzeezdceeaeecce aadzddeeeeeeeeeeeeceeecaazzzzzcczzaazzce

10108 10109 10110 10112 10115 10116 10119 10120 10121 10152 10153 10155 10176 10194 10222 10227 10228 10104

and 关z or d兴 the descending angle. In this case, if the input pattern sequence is aceedze, we will obtain from the database those sequences that fit in the following combined sequences: 关a or c兴, 关c or a兴, e, e, 关d or z兴, 关z or d兴, and e. Observe that, in this example, there are 16 possible sequences 共Fig. 4 shows two of these兲. In the application it is also possible to search, in the database, wave forms with inverse polarity with regard to the input signal. For example, if the input pattern sequence is aacedzze, we would substitute the labels of the primitives by theirs symmetries (zzdecaae), as shown in Fig. 5.

Error M1 0.155 175 645 91 0.338 439 721 8 0.266 122 453 8 1.774 495 9321 E − 10 0.283 615 803 3 0.205 821 807 5 0.107 452 797 05 0.190 652 696 1 0.147 236 874 64 0.407 779 733 38 0.247 111 540 27 0.245 258 614 59 0.642 049 205 7 0.135 704 051 57 0.132 745 187 24 0.101 018 156 59 0.079 871 870 697 0.153 679 361 30

Error M2

0.233 227 516 3 0.114 520 788 97

0.533 216 355 46

direct and inverted searching is chosen. In this case, the SQL sentence is select * from codes where Type like ‘BOL5’ and 具Codce like ’% [z, d] [z, d] [z, d] [a, c] [a, c] %’ or code like ’% [a, c] [a, c] [a, c] [z, d] [z, d] %’ 典 Once ACCESS has executed the SQL query, it sends back to MATLAB the results, and finally, the application shows the matches. Figure 8 depicts some of the matches, and Tables I and II show all SQL results. Note that errors in matches can be used to sort the SQL results.

VI. APPLICATION SCHEME

The types of searching mentioned before can be done with a relational database management system using Structure Query Language 共SQL兲. In this work, we have used MICROSOFT ACCESS™ because it is easy to build a database and test our approach. Preprocessing and primitive computing of wave forms were done by means of MATLAB™; the MATLAB Database Toolbox6 was used for the link between MATLAB and Access. The application algorithm is the following: First a user selects a shot, then chooses a section of the signal 共pattern兲, and asks for the application for a type of searching 共fine or coarse兲. The MATLAB application carries out the preprocessing and primitive computation. After that, an SQL query is realized with regard to the searching type selected. Finally, ACCESS sends back to MATLAB the SQL results, and the application shows all matches in the returned signals. Figure 6 shows the application scheme. VII. SEARCHING EXAMPLE

In this example we select patterns from the TJ-II stellarator database. The input is a signal from a bolometer 共BOL5兲 in shot 10112, as it is shown in Fig. 7. After that, a coarse

VIII. DISCUSSION

We present a structural approach based on high level codification and SQL for pattern searching within wave forms. We have tested the system with several wave forms, and we have obtaines excellent results in the detection of subpatterns. The computation time is suitable for an interactive searching because we have used the power of a relational database management system for retrieving similar wave forms. 1

H. Nakanishi, T. Hotchin, M. Kojima, and LABCOM group, Fusion Eng. Des. 71, 189 共2004兲. S. Dormido-Canto, G. Farias, R. Dormido, J. Vega, J. Sénchez, M. Santos, and TJ-II Team, Rev. Sci. Instrum. 75, 4254 共2004兲. 3 G. Farias, S. Dormido-Canto, J. Vega, J. Sénchez, N. Duro, R. Dormido, M. Santoa, and G. Pajares, Fusion Eng. Des. 共in press兲. 4 H. Nakanishi, T. Hotchin, M. Kojima, LABCOM group, Fusion Eng. Des. 共in press兲. 5 K. S. Fu, Syntactic Pattern Recognition and Applications 共Prentice-Hall, Englewood Cliffs, NJ, 1982兲. 6 The MathWorks, Inc., Database Toolbox for use with MATLAB, User’s Guide, Version 3, 1998–2006. 2

Downloaded 05 Feb 2007 to 161.111.120.151. Redistribution subject to AIP license or copyright, see http://rsi.aip.org/rsi/copyright.jsp

Suggest Documents