Sequential Analysis Design Methods and Applications
ISSN: 0747-4946 (Print) 1532-4176 (Online) Journal homepage: http://www.tandfonline.com/loi/lsqa20
Symbolic pattern recognition for sequential data Oguz Akbilgic & J. Andrew Howe To cite this article: Oguz Akbilgic & J. Andrew Howe (2017) Symbolic pattern recognition for sequential data, Sequential Analysis, 36:4, 528-540 To link to this article: https://doi.org/10.1080/07474946.2017.1394719
Published online: 22 Jan 2018.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lsqa20
SEQUENTIAL ANALYSIS 2017, VOL. 36, NO. 4, 528–540 https://doi.org/10.1080/07474946.2017.1394719
Symbolic pattern recognition for sequential data Oguz Akbilgica
and J. Andrew Howeb
a
Center for Biomedical Informatics and Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee, USA; b Independent Researcher, Riyadh, Saudi Arabia ABSTRACT
ARTICLE HISTORY
Sources of sequential data surround and pervade our lives. Our bodies continuously generate sequential data such as heart rate and blood pressure. In global finance, stock indices and currency exchange rates change every second. The movement of clouds, the coordinates of the planets, the score of a soccer game, etc., are all examples of sequential data transitioning from one state to another in regular time steps. There is a mature body of literature related to modeling time-dependent sequential data, or time series, in which every model relies upon specific assumptions for applicability to data. However, there other kinds of sequential data that are not collected with respect to time; for example, DNA sequences (if we ignore the gene mutations). Modeling of this type of sequential data does not have a body of literature that is as mature. This is somewhat perplexing, since sequential data modeling should cover both time-dependent and non-time-dependent data. In this study, we introduce a framework called Symbolic Pattern Recognition for modeling pattern transition behavior of sequential data that is expressed by a finite alphabet of symbols. Our framework can be used to characterize, predict, simulate, cluster, and classify multiple series based on their observed pattern transition behaviors. We document our proposed framework and apply it to perform unsupervised clustering of 13 different species of mollusks based on their DNA sequences. Our model correctly clusters the mollusks into their respective genera, matching results obtained via more traditional, supervised genetic analysis.
Received 15 May 2016 Revised 25 May 2017, 23 June 2017 Accepted 8 July 2017 KEYWORDS
Genetic analysis; pattern recognition; sequential data; symbolic data; time series modeling SUBJECT CLASSIFICATIONS
62H30; 60Jxx; 62P10
1. Introduction Time series analysis for continuous data is a mature field of research with a robust literature. Most time series models are of an autoregressive nature, in which we find a functional stochastic relationship between a series and its lags, for example xt ∼ f (xt−1 , xt−2 , xt−3 , . . .). Additionally, several time series techniques attempt to model the joint behavior of multiple series, such as the vector autoregressive integrated moving average (Chatfield, 2009) model, which generalizes ARIMA (Granger and Joyeux, 1980). However, these models make the assumption, explicit or not, that the observations are drawn from a real-valued continuous distribution. CONTACT O. Akbilgic
[email protected] UTHSC-ORNL Center for Biomedical Informatics, 50 N. Dunlap Ave., Suite 490R, Memphis, TN 38103, USA. Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/lsqa. Recommended by Nitis Mukhopadhyay © 2017 Taylor & Francis
SEQUENTIAL ANALYSIS
529
Another area of strong interest in time series modeling is clustering of series based on their similarities. One such method for comparing series is to align the observations by warping the time axis (Gullo et al., 2009). The dynamic time wrapping is an algorithm based on this idea (Jayadevan et al., 2009; Rabiner and Juang, 1993). In addition, there are several methods that express time series data as a collection of discrete events via piecewise discontinuous functions, including piecewise aggregate approximation (Keogh and Pazzani, 2000), symbolic aggregate approximation (Lin et al., 2003), and discrete wavelet transforms (Chan and Fu, 1999). All of these methods convert time series data into the form of static data so that the existing algorithms for clustering static data can be used directly (Liao, 2005). Furthermore, time series data are a special case of sequential data, in which patterns and functional relationships depend on the sequence of observations. The genetic information encoded in DNA is an example of sequential data that has no time component. Since time series is a special case, a general inclusive framework for modeling symbolic sequential data should be applicable to time series data as well. In this work, we develop a new statistical modeling framework called Symbolic Pattern Recognition (SPR) to model pattern transition behavior in sequential discrete data. The only requirement is that a modeled series be composed of a sequence of a finite, constant alphabet of symbols. Though we focus in our examples on letters, “symbol” can obviously be anything symbolic: numbers, letters, images, direction of movements, etc. SPR can be seen as generalizing a discrete version of Markov chains, in that they both model a system with finite states and transition probabilities that define the likelihood of moving from one state to another (Kelleher et al., 2015). Markov chains have the property that each state depends only on the previous state. SPR models, however, the sequential behavior of a series by considering the transition behavior of longer patterns of states. By doing so, we are in eminent company. In his seminal work on communication theory, Shannon showed that simulated sequences of an alphabet of letters generated better and better approximations to real English language when the frequencies of longer patterns was considered, as shown in their examples (Shannon and Weaver, 1949, p. 43): First-order: OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL Second-order: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE Third-order: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE As opposed to the first-order Markov chain, the sequences generated using digram and trigram frequencies better capture the transition behavior in real English language. In introducing our SPR framework, since it is designed for series expressed with a finite set alphabet of symbols, we first discuss discretization of continuous series. In Section 3.1, we show how to identify the optimal pattern memory to model, which is akin to selecting the optimal number of autoregressive lags. We show how to predict the next symbol, conditioning on the recent observations, in Section 3.2 and then propose a robust simulation algorithm in Section 3.3. Section 3.4 details measuring similarity between series, and we show how this can be leveraged to perform unsupervised clustering of series.
530
O. AKBILGIC AND J. A. HOWE
For expository purposes, we use a sample discrete series, S = {aabcabccbabcabcbaabc}. In Section 4, we apply the SPR methods to DNA data from several species of mollusks. The data are available from the National Center for Biotechnology Information’s nucleotide resource website, http://www.ncbi.nlm.nih.gov/nuccore. However, the problem of classification of mollusks was inspired by a student research activity entitled “Comparing DNA Sequences to Determine Evolutionary Relationships among Mollusks,” published by the Howard Hughes Medical Institute (2009; http://media.hhmi.org/biointeractive/activities/shells/shell-dna. pdf).
2. Discretizing continuous data Though SPR is used for discrete data, it can also be valuable to apply the methods to continuous data. Indeed, continuous data could be considered to actually be discrete with an infinite alphabet of symbols. Before we can use SPR to model continuous data, we must first apply a preprocessing step to discretize it into a finite alphabet. Though it may seem counterintuitive, many statistical learning algorithms are known to perform better on discretized continuous data (Kotsiantis and Kanellopoulos, 2006). A continuous variable X is discretized by assigning a discrete symbol for each observation. This discretization process can be done in many different ways. For example, if we can assume that X is stationary following a normal distribution N (µ, σ ), we could discretize X into quantiles, resulting in an alphabet of {a, b, c, d}: a: −∞ < x ≤ 8−1 (0.25, µ, σ ) b: 8−1 (0.25, µ, σ ) < x ≤ 8−1 (0.5, µ, σ ) c: 8−1 (0.5, µ, σ ) < x ≤ 8−1 (0.75, µ, σ ) d: 8−1 (0.75, µ, σ ) < x < ∞ This is also shown visually in Figure 1. Consider a sample stationary data set X with observations X = [40.38, 38.25, 40.84, 42.40, 40.33, 40.68, 42.31, 43.14, 41.19, 39.24, 41.93, 42.84, 39.64, 41.20, 43.06, 41.46, 40.20, 39.68, 41.97, 42.23], which can be described using a normal distribution with µ = 42 and σ = 2. The quantile cutoff points are X25 = 40.65, X50 = 42, and X75 = 43.35. Using these three cutoff points, X would be discretized as S = {aabcabccbabcabcbaabc}. The number of quantiles determines the number of symbols, ns, in the discretized alphabet. Small ns will lead to loss of too much information, whereas large ns leads to computational expense. Thus, identication of the optimal ns is an important consideration for applying SPR methods to continuous data. In addition to the uniform-density discretization example shown, there are other methods that could be used, such as uniform-width binning, clustering methods like k-means, symbolic aggregate approximation (Lin et al., 2003), or even more complex sequence-adaptive techniques. Depending on the data and problem, domain expert knowledge can also be used for discretization. As an example, consider measurement of a patient’s systolic blood pressure over time. Rather than using static cutoff points to discretize observations as low, normal, or high, it can make more sense to vary the thresholds in response to the potentially dynamic clinical condition of the patient.
SEQUENTIAL ANALYSIS
531
Figure 1. Discretization using quantiles.
3. Symbolic pattern recognition (SPR) Our SPR framework is based on learning pattern transition behaviors in sequential data. Based on these probabilistic transition behaviors, we can predict future behavior, simulate similar series, and cluster series. 3.1. Learning pattern transition behavior SPR aims to learn and model the pattern transition behavior in a discrete series represented with ns unique symbols, in a similar way to n-grams for natural language processing (Broder et al., 1997). To do this, SPR looks for joint occurrences of observed patterns of length np followed by a single symbol. For example, a two-symbol pattern ba followed by d or a threesymbol pattern acd followed by c. This maximum pattern length np is an important parameter of SPR, especially when very long sequences of data are modeled. By observing the frequency with which these patterns and transitions occur, we can infer the transition probabilities governing how the series evolves. As an example, consider our sample series S = {aabcabccbabcabcbaabc}, defined over the alphabet {a, b, c}. We see the pattern ab occurring five times, always followed by a c: {a|abc|abc|cb|abc|abc|ba|abc|}. Table 1 shows the i-symbols pattern transition frequencies (PTFi ) and probabilities (PTPi ) for S, with patterns of length i = np = 2, in lexicographical order. We see that the bc pattern is observed in S four times (fifth column). Furthermore, we observe bca twice and bcb/bcc each once. According to PTP2 , if we later observe either aa, ca, or cb, we can reasonably expect the next symbol will be b assuming the observed sequence S appropriately captures the underlying pattern transition behavior. For an alphabet of length ns, there are nsnp possible np -length patterns. Hence, PTFi is a matrix of at most nsi ×(ns+1) size, and PTPi is at most nsi ×ns. If any of the possible patterns are unobserved, the matrices will be accordingly smaller; note how ac is never observed in S, so it is not shown in Table 1.
532
O. AKBILGIC AND J. A. HOWE
Table 1. PTF2 and PTP2 for S. PTF2 aa ab ba bc ca cb cc
PTP2
a
b
c
Total
a
b
c
0 0 1 2 0 2 0
2 0 1 1 2 0 1
0 5 0 1 0 0 0
2 5 2 4 2 2 1
0.00 0.00 0.50 0.50 0.00 1.00 0.00
1.00 0.00 0.50 0.25 1.00 0.00 1.00
0.00 1.00 0.00 0.25 0.00 0.00 0.00
For a given series of length n, there are up to np pattern transition matrices calculated, with np an integer ranging from 1 to n − 1. Optimal determination of np is important to balance computational cost vs. information modeled. When np is small, the method may not truly learn the pattern transition behavior of the series. On the other hand, setting np too high will be computationally expensive and result in a sparse transition matrix. To select the optimal value for np , we propose a criterion called the sparsity index (SIi ), which is defined to be the ratio of the number of observed patterns to the number of possible np -length patterns. As an example, for the digram patterns in S, we have observed seven patterns in Table 1, though there are nine possible such patterns, so SI2 = 7/9 = 0.778. The sparsity indices for S are shown in Table 2. In the SPR algorithm, we sequentially compute PTPi and SIi , looping for i = 1, 2, . . . , n − 1, using a subjectively set threshold tnp to determine np . 1) set i = 1. 2) compute PTPi and SIi . 3) if SIi ≥ tnp , increment i and go to step 2; otherwise, set np = i − 1 and exit loop. Using the sparsity indices for S in Table 2, we set np = 2 when the threshold is 0.5, but if tnp = 0.1, we would prefer np = 4. It is true that this has merely pushed the optimization back from determining np to determining tnp . However, as the primary purpose of tnp is as a stopping criterion that allows for sufficiently large np , we can set it to a low value, such as 1E − 4. Alternatively, for series that are not too long, we may instead compute all transition matrices for i = 1, 2, . . . , n − 1, and then plot i vs. SIi . As with interpretation of scree plots, we can then select np by identifying the largest deflection point, or when the curve becomes almost horizontal. Table 2. Sparsity indices SIi for S. i 1 2 3 4 5 6 7 8 9 ≥10
Observed
Possible
SIi
3 7 10 13 13 13 13 12 11
3 9 27 81 243 729 2,187 6,561 19,683
1.000 0.778 0.370 0.161 0.054 0.018 0.006 0.002 0.001