1. INTRODUCTION. Algorithmic Information Theory (AIT) and the notion of Kolmogorov complexity [1] unify the fields of computer science and information theory.
AN ASSOCIATIVE MEMORY MODEL FOR UNSUPERVISED SEQUENCE PROCESSING Stefan V. Pantazi, Jochen R. Moehr School of Health Information Science University of Victoria, Victoria, BC, Canada {spantazi,jmoehr}@uvic.ca ABSTRACT We introduce the design principles and present formally the building block of an associative memory model capable of unsupervised sequence processing: the constrained partially ordered set. We then use the model in a series of experiments, presented in increasing order of complexity and conclude that it demonstrates interesting information processing capabilities which warrant future development.
1. INTRODUCTION Algorithmic Information Theory (AIT) and the notion of Kolmogorov complexity [1] unify the fields of computer science and information theory. It is also said [2] that rather than focusing on ensembles and probability distributions as in classical information theory, AIT focuses on the algorithmic properties of individual objects. In the context of this pattern processing perspective on Information Theory, the notion of randomness hinges on the ability (or inability) of detecting patterns in data [3]. We focus our attention on connectionist information processing models that work on parallel and distributed processing (PDP) principles [4], and which are able of detecting certain types of patterns (or regularities) present in input data, in an on-line, completely unsupervised manner. If such a model finds regularities in input data, it will create compressed representations of that input data. Further, we want a representation to be a growing layered hierarchy, with smaller patterns occurring at lower layers of the model, and more complicated patterns, built on previously identified smaller patterns, in higher layers (Figure 1 and Figure 2., compositional hierarchy [5]). From a mathematical perspective such a structure is a collection of partially ordered sets or posets. Because the growth of the hierarchy is controlled by constraints we name this structural design approach a constrained compositional hierarchy or a constrained poset (Figure 2).
0-7803-9195-0/05/$20.00 ©2005 IEEE.
Similarity based retrieval is difficult on serial computers. Although not the object of this paper, we expect a simulation of a model built on PDP principles to advance similarity based retrieval, a process we consider key to human cognition [6]. In addition to building and retaining a compressed representation of the input data, we require a representation to be complete, i.e., input data should be completely retrievable from the model. Such a design is functionally an associative memory (AM), capable of pattern recognition, and able to retrieve input data based on similarities with a query. This paper is inspired by the work of Elman in [7]. Accounts on compositionality and associative memory models can be found in cognitive science (e.g., [8]), psychology (e.g., [9]), as well as in recent doctoral research [5, 10]. 2. FORMAL DESCRIPTION OF MODEL A poset is a base set X together with a binary relation
≤ which is reflexive, antisymetric and transitive. Posets are generalizations of trees and can be depicted using Hasse diagrams such as in Figure 1 and Figure 2. (ghij,1)
(ghi,1) (hi,1)
(gh,1)
(g,1)
(hij,1)
(h,1)
(ij,1)
(i,1)
(j,1)
Figure 1. Unconstrained, complete poset (compositional hierarchy) built from the input sequence ghij. The inputs are associated with symbols {g, h, i, j}.
Definition 1. Let Σ be an alphabet. Let s∈Σ+. Let U={w | s=xwy, w∈Σ+, x,y s∈Σ*}, i.e., the set of all nonempty, unique substrings (words) in s, including s itself. Let X be the set of pairs (w,c)∈U×Z+ such that c∈ Z+ is the count of occurrences in s of the string w∈U. Define a binary relation ≤ on X by (w1,c1)≤(w2,c2), (w1,c1),(w2,c2)∈X such
233
that w2=xw1 or w2=w1x for some x∈Σ. The poset P=(X, ≤) is an unconstrained poset. For example, in Figure 1 we present the unconstrained poset built from s=ghij, i.e., in the particular case when the length of string s, d(s)=|Σ|=4. From the definition, it follows that, in this case, the cardinality U, |U|=|X|=4×(4+1)=10 i.e., equal to the number of substrings of ghij. abcdefcdefababefcdefabcdcdefabcdefefefabcdef
(a,6)
(fa,4)
(ab,6)
(ab,6)
(f,9)
(efab,4)
(b,6) (bc,4)
(e,9) (d,7)
(ef,9)
(ef,9)
(c,7)
(abcd,4)
(cd,7)
(cd,7)
(cdef,6)
(de,6)
(cde,4)
Figure 2. Two-layer model built from the input sequence abcdefcdefababefcdefabcdcdefabcdefefefabcd ef. The oval input nodes of the second layer are elements in the first layer. Each node contains a lower bound estimation of the pattern count (e.g., cde,4).
Definition 2. An unconstrained poset P=(X, ≤) is a constrained poset (associative memory layer) relatively to α,β∈N if d(w2)log2(c2+β)-d(w1)log2(c1)≥0 and c1,c2>α for all (w1,c1),(w2,c2)∈X such that (w1,c1)≤(w2,c2).
In our implementation, pattern counts are lower bound estimations of real counts (e.g., the actual count of cde in Figure 2 is 6) due to combinatorial complexity. The problem of determining the base set X of a constrained poset P, given an input string s, is the problem of unsupervised lexical acquisition. This constrained approach solves the problem of hidden unit allocation and provides ways of coping with combinatorial complexity issues through the constraint parameters and . An intuitive way to think of and , is as a combination of inhibiting and facilitating forces that drive the hierarchical growth of the model. The configurations of constraint parameters are referred to as normal constraints ( =1, =0), tight constraints ( >1), relaxed constraints ( =1, >0) and no constraints ( =0, >=0). For the input string of length 44 from Figure 2, normal constraint conditions result in X={(a,6), (b,6), (c,7), (d,7), (e,8), (f,8), (ab,6), (fa,4), (ef,8), (bc,4), (cd,7), (de,5), (cde,4)} for the first layer. No constraints would cause the cardinality of X, |X|, to become 44×(44+1)/2=990 elements compared to 13 in the normally constrained case. X would include the input sequence itself as well as many other long substrings,
3. EXPERIMENTS AND RESULTS
Artificial data which contains nonsensical lexical items brings human processors closer to a more primitive information processing model by removing their powerful semantic processing capabilities. This may provide insights into processing mechanisms. We therefore begin the exploration of the AM model performance with experiments based on such artificially created data. 3.1. Experiment #1 virvezloxmorhujsantegvirzopsancowlic poshujhujloxzopmarposmorcowfanvirlic morfanbacsansanfanzopvirposkilposfan loxlicfanloxhujlictegzopmarkilbacsan licsanzoptegsanmorposzopsanmormorhuj virloxzopfanlicsancowtegtegmorposlic posloxzopkilvezhujcowhujsanlichujzop posbacmormarfanzopmarloxvirhujvezsan sanmarloxmormarvezbacsanbacsanfancow fanmorsanfancowtegvezposhujtegkillox markiltegloxvirloxmormormorcowfanhuj tegposzoplicloxkilposlicmarmorteglox backilcowloxposmarloxlicloxvirtegmar virvirsancowmarlicbacmarmarposbacvir morhujvirzoploxvirtegteglicvirsanhuj
which appear only once in the input (e.g., befc). The constraints are therefore paramount for dealing with combinatorial issues and to drive the poset’s growth process. Hebbian learning is a simple, effective and biologically plausible strategy for associative learning: “if two nodes on either side of a synapse (connection) are activated simultaneously (i.e., synchronously) then the strength of that synapse is selectively increased” [11]. Our approach to modeling a Hebbian synapse is built in the definition of a constrained poset on the counts of related patterns (e.g., a and ab). These counts form a time dependent, local and correlational mechanism that can be employed in various information theoretic models of measuring association strength. For example, the higher and closer the counts of connected (similar) patterns (e.g., (ab,6) with (a,6) and (b,6) in Figure 2), the stronger their “association.” We measure and compare such “associations” using the logarithms of counts (Definition 2) because of the importance of logarithmic functions in information theory, in modeling biological systems and because of our positive empirical evaluations of such a measure.
Table 1. Artificial input sequences for the first experiment. The highlighted patterns which arise from the 2dimensional arrangement of the 1-dimensional sequences are discussed in experiment #4.
234
For the first experiment we used a restricted lexicon of 15 “words” of 3 symbols each. We created 15 sequences of 36 symbols using 12 randomly selected “words” from the lexicon. We use no separators at the word boundary. Pattern Layer Count 1 Pattern Count Pattern Layer Count 2 Pattern Count
z 5 vir 14 vir 14 vez 5
w 9 san 17 san 17 cow 9
f 11 huj 13 huj 13 fan 11
b 8 pos 13 pos 13 bac 8
k 7 lic 13 lic 13 kil 7
ve 5 teg 13 teg 13
an 11 zop 12 zop 12
co 9 mor 15 mor 15
ac 8 lox 17 lox 17
il 7 mar 13 mar 13
Table 2. Patterns acquired in the first two layers of the model. The second layer contains the 15 “words” lexicon.
The sequences in Table 1 are fed into the AM model which builds a hierarchical representation of the input using normal constraints. The lexicon is not presented to the model explicitly. In the representation, the AM model “discovers” the letter patterns that make the lexicon (unsupervised learning). The patterns discovered are shown in Table 2. The second layer shows all and nothing but the 15 “words” used to create the input. The reason the network did not “discover” the entire lexicon in the first layer has to do with the ambiguity of some patterns which contain identical symbols (e.g., “a” in san and fan or “o” in cow, lox and zop). Nonambiguous patterns (e.g., huj) or patterns with little ambiguity (e.g., teg) are represented in the first layer. Constraint relaxation yields an interesting result: the model is able to pick-up longer patterns (i.e., phrases such as morhuj, zopmar, mormor, loxvir, morcow, bacsan). 3.2. Experiment #2 l c z s w h po ow lo ba te me ag ur ve mae aop sae fao san kie bai vez lioc c ba mar lic san Layer pois cuow teag mor lox 2 zaop viur saen pos baic mar lic san vir teg Layer teag mor lox cow kil 3 saen pos baic loix heuj Layer 1
m ki ez vir
r an uj poi
x op he teg
n or mar loi
f vi lic cuo
vir teg vez lioc huj cow kil faon zop maer loix heuj fan meor veez kiel vez lioc huj pois cuow faon zop maer zaop viur fan meor veez kiel bac
Table 3. Patterns acquired in the first three layers of the model. The complete acquisition of the lexicon is attained in the third layer due to the increased ambiguity of input data.
The second experiment is very similar to the previous except that we enriched the lexicon with an additional 15 “words.” The additional lexical items are 4-letter patterns derived from the original lexicon (e.g., heuj from huj), in order to increase the ambiguity of the input. The input
data we created contains 30 sequences of variable length, each with 15 randomly selected words from the new lexicon and without separators at word boundary. Under normal constraints, the model performs well and discovers the entire lexicon in the third layer (Table 3). Slight constraint relaxation yields 2-word “phrases” such as liczop, faonsan, posmar, virsan, licfan, liockil, cuowteag, maervez, kielmaer, etc. The fact that 3-word patterns are not found is also interesting and gives a hint on the properties of input data, which indeed, upon visual inspection, does not seem to contain 3-word “phrases.” 3.3. Experiment #3
The third experiment involves the creation of artificial input data using 1000 randomly selected words from a lexicon of only three words but of different lengths (ba, dii and guuu), which results in sequences such as baguuudiiguuuguuudiidii…. This setup was originally described in [7] in order to test a recurrent model of artificial neural network particularly suitable for sequence processing, using a supervised learning paradigm. The task of the network was to predict the next letter in the sequence. The error signal was plotted showing an increase at word boundary, i.e., before each of the consonants b, d and g [7]. In our experiment, the AM model is able to not only discover the lexicon completely, but also to pick-up higher level regularities of the input data, i.e., “phrases,” “propositions” and “sentences” which comprise combinations of several “words.” In [7] Elman refers to auto-associative models and briefly acknowledges their potential use for sequence processing. Aside from the more powerful unsupervised learning paradigm of the model, the results obtained here demonstrate superior sequence processing capabilities as well as providing a closer look at what exactly happens inside the model, i.e., how processing is achieved. In this particular experiment, the input data does not contain any ambiguity: symbols which make individual words are specific to those words (e.g., b and a to ba, g and u to guuu). From a sequence processing, pattern acquisition or lexical acquisition standpoint the problem should be trivial. As expected, under normal constraints the AM acquires the lexical items ba, dii and guuu. However, in the same first layer, the model also discovers patterns such as iba, idii and diidi in an attempt to achieve an additional compression of the internal representation of the input. In subsequent layers, the AM acquires 2word “phrases” such as baguuu, guuuba, baba, diiba, diiguuu, diidii and guuuguuu as well as multi-word patterns such as diidiidii, diidiiba, badiiguuu, guuudiidiidii, diidiiguuu etc. As
235
expected, constraint relaxation causes the AM to pick up diidiibaguuuguuu, “sentences” such as badiibaguuuguuu, guuudiiguuudiidiiba, and even baguuudiiguuuguuudiiguuu, a 7-word “sentence” that appears twice in the input. 3.4. Experiment #4
For this experiment we use “real” data consisting of DNA and protein sequences (e.g., TACCCAGGAAAAGC…) of the SARS virus [12]. As in previous experiments, we expect the model to be able to acquire “words” but this time the “words of genetics” corresponding to the Universal Genetic Code, shared by all living organisms. Normal constraints cause the AM to acquire certain patterns in its first layer and to exhibit relatively small differences of acquisition in higher layers. We used our knowledge that amino-acids’ codes (i.e., codons - the “words” of the genome) are always sequences of 3nucleotides. The acquisition of some 4-letter patterns indicated the need to tighten the pattern acquisition constraints. This action yields the results in Table 4 showing an improved acquisition of the codons. Code a-acid
G
Code a-acid
ACA TGC ACT AGA TGA CTT Thr Cys Thr Arg Stop Leu
C
A
T
TGT Cys
CAA TTT Gln Phe
ATG CTA Met Leu
AAA ATT Lys Ile
TTG Leu
CAG AAG Gln Lys
Table 4. The 20 most frequent patterns acquired with tightened constraints.
The protein sequences derived from the SARS genome are represented by sequences of aminoacids using a single letter coding (e.g., MESLVLGVNEKTHVQLSLPVLQ…). Unexpectedly, the AM model is hardly able to pick 2letter patterns (e.g., LL, LS, AL) from the sequence. This is a sign that the randomness present in this particular input data is greater than in the data used in previous experiments. This randomness would translate into the primary structure of SARS proteins. The potential implications of this insight for proteomics, if any, are beyond the scope of this paper. However we take the opportunity to acknowledge some limitations of the representation of 1-dimensional sequences (primary structure) generated from objects (e.g., proteins) which are known to also possess additional (secondary, tertiary and quaternary) structural properties which determine their 3-dimensional configuration. In other words, noncontiguous regularities which are determined by particular dispositions of a sequence in a 2 or 3 dimensional space are not going to be acquired by this model. For example, the 2-dimensional configuration of the input sequences used in experiment #1 exhibits
certain regularities (ml - 9 times, rr and aa – 8 times, ioo – 6 times, oooa – 4 times) when considered a transposed collection of 36 column sequences of 15 symbols rather than a collection of 15 row sequences of 36 symbols. Such patterns will never be picked up by the model except in the case of a transposed input. Detecting first diagonal patterns such as lag and orv would also require the appropriate transposition of the input data. 4. CONCLUSIONS
The AM model demonstrates interesting information processing behavior by being able to discover and memorize patterns from input data and resembles the adaptiveness of biological associative memories which are also able to memorize (short-term, long term) patterns and perform similarity based retrieval. Acknowledgments. We wish to thank Dr. M. Tara for discussions and Dr. T. Spircu for suggestions. 5. REFERENCES 1. Li, M. and P.M.B. Vitányi, An introduction to Kolmogorov complexity and its applications. 2nd ed. 1997, New York: Springer. xx, 637. 2. Chaitin, G.J., Gödel's Theorem and Information. International Journal of Theoretical Physics, 1982. 22: p. 941-954. 3. Chaitin, G.J., Randomness and Mathematical Proof. Scientific American, 1975. 232(No. 5 (May 1975)): p. 4752. 4. Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Parallel Distributed Processing, ed. D.E. Rumelhart and J.L. McClelland. Vol. 1-2. 1986: MIT Press. 318-362. 5. Pfleger, K., On-Line Learning of Predictive Compositional Hierarchies, in Department of Computer Science. 2002, Stanford University. 6. Pantazi, S.V., J.F. Arocha, and J.R. Moehr, Case-based Medical Informatics. BMC Journal of Medical Informatics and Decision Making, 2004. 4(1). 7. Elman, J.L., Finding Structure in Time. Cognitive Science, 1990. 14(2): p. 179-211. 8. Hammer, B., Recurrent networks for structured data - A unifying approach and its properties. Cognitive Systems Research, 2002. 3(2): p. 145-165. 9. Bower, G.H., An Associative Theory of Implicit and Explicit Memory, in Theories of memory, M.A. Conway, S.E. Gathercole, and C. Cornoldi, Editors. 1998, Psychology Press: Hove, East Sussex. p. 25-60. 10. Wichert, A.M., Associative Computation, in Faculty of Informatics. 2000, University of Ulm: Ulm. p. 311. 11. Haykin, S.S., Neural networks: a comprehensive foundation. 1994, New York; Toronto: Macmillan. xix, 696. 12. Balotta, C., et al., SARS coronavirus AS, complete genome. 2004.
236